Mixed Attention Network for Cross-domain Sequential Recommendation

In modern recommender systems, sequential recommendation leverages chronological user behaviors to make effective next-item suggestions, which suffers from data sparsity issues, especially for new users. One promising line of work is the cross-domain recommendation, which trains models with data across multiple domains to improve the performance in data-scarce domains. Recent proposed cross-domain sequential recommendation models such as PiNet and DASL have a common drawback relying heavily on overlapped users in different domains, which limits their usage in practical recommender systems. In this paper, we propose a Mixed Attention Network (MAN) with local and global attention modules to extract the domain-specific and cross-domain information. Firstly, we propose a local/global encoding layer to capture the domain-specific/cross-domain sequential pattern. Then we propose a mixed attention layer with item similarity attention, sequence-fusion attention, and group-prototype attention to capture the local/global item similarity, fuse the local/global item sequence, and extract the user groups across different domains, respectively. Finally, we propose a local/global prediction layer to further evolve and combine the domain-specific and cross-domain interests. Experimental results on two real-world datasets (each with two domains) demonstrate the superiority of our proposed model. Further study also illustrates that our proposed method and components are model-agnostic and effective, respectively. The code and data are available at https://github.com/Guanyu-Lin/MAN.


INTRODUCTION
Widespread in online platforms such as news, video, e-commerce, etc., recommender systems that vastly improve the efficiency of information distribution and diffusion are of great importance in today's Web.Sequential recommendation [37] is one of the most important research problems in recommender systems, which aims at predicting a user's next interacted item based on their historical interaction sequence.Though recent representative models of sequential recommendation such as GRU4REC [10], SASRec [14] and SURGE [2] etc. have achieved decent performance, they suffer from the issue of data sparsity [39], limiting the performance.
To address the data sparsity issue, cross-domain recommendation [6,31,42] is a widely adopted approach, which leverages the data from multiple domains to boost the performance of the datascarce domain by parameter-sharing [6] or multi-task learning [26].Particularly, a few early attempts [3,18,28] were proposed to achieve cross-domain sequential recommendation, which leverages cross-domain technique to address the data sparsity of sequential modeling.However, as illustrated in Figure 1(a), these methods rely heavily on the overlapped users and require pairwise inputs from two domains of the same bridge users, which is hardly satisfied in practical scenarios.For example, in our experimental benchmark datasets (Micro Video and Amazon), there is only a small part (at most 8.37%) of overlapped users, as Table 1, which violates the assumption of existing approaches.In fact, in many real-world applications, users are not overlapping across different domains [20,23].Thus, it is challenging for the existing cross-domain sequential recommendation to work in real-world scenarios.Actually, there are three key challenges for cross-domain sequential recommendation: items are shared across domains, items reflect different characteristics.For example, for a higher-end e-commerce website, the price aspect takes less effect when users purchase items, while it plays an important role in a lower-end website.Such difference brings difficulty in learning accurate item representations across different domains.
• Various sequential patterns across domains.Similar to the item, the sequential behaviors vary in different domains.For example, users may be more decisive in a higher-end E-commerce website, leading to very short sequences with very brief sequential patterns.Therefore sequential patterns are various, and the modeling is challenging.• User preference transferring without overlapped user.We focus on the general cross-domain recommendation task, where users may not fully overlap.Therefore, it is challenging to capture the common preference shared by users across domains, especially when there is even no overlapped user.
To address these challenges, we develop a novel group-based method with the group transfer to avoid dependence on the overlap of users and global space to capture the item characteristics and sequential patterns across different domains as Figure 1(b).Note that the group-prototype attention here can capture group information in an unsupervised manner, without further requiring additional input information compared with Figure 1(a).Specifically, we propose a novel solution named MAN (short for Mixed Attention Network for Cross-domain Sequential Recommendation), consisting of local and global modules, mixing three types of designed attention network from item level, sequence level, and group level.First, we generate separate representations for each item, including the local representation capturing domain-specific characteristics and the global representation shared by different domains.We then design an item similarity attention module to capture the similarity between local/global item representation and the target item representation.Second, we propose a sequence-fusion attention module to fuse the local and global item sequential representations.Most importantly, although user information cannot be directly shared, the group information can be shared across domains.Therefore, we propose a group-prototype attention module, which utilizes multiple group prototypes to transfer the information at the group level.Finally, the obtained local and global embeddings are fed into the corresponding prediction layers to evolve the domain-specific and cross-domain interests.
The contributions of this paper can be summarized as follows.
• We approach the problem of cross-domain sequential recommendation from a more practical perspective that there is no prior assumption of overlapped users across domains, which is far more challenging.

PROBLEM FORMULATION
In our problem of cross-domain sequential recommendation, we first use A and B to denote the two domains.Let I  and I  denote the sets of items in domain A and B, respectively.More specifically, supposing    ∈ I  or    ∈ I  is the -th item that a given user has interacted with in the A or B domain, the -length sequence of historical items can be represented as (

Local/Global Encoding Layer
We first build local and global item embeddings.Then we further look up item embeddings and encode them with local and global encoders at the sequence level.In the Mixed Attention Layer, Item Similarity Attention is fed with local and global item embeddings to capture the item-level relation; Sequence-fusion Attention fuses the encoded local and global sequential representations to capture the sequence-level relation; Group-prototypes attention leverages the shared group prototypes to capture the group-level relation.Here we take domain A to illustrate each proposed attention component in detail.(4) The aggregated embeddings will be fed into the local prediction layer and global prediction layer, respectively, for the final prediction.

Sequence-fusion Attention
vector, following existing works [2,14].To further capture the position of items in the sequence, we also integrate learnable positional embeddings into item embeddings (domain A example) as: where where Encoder and Encoder  are the sequential backbone models (i.e., SASRec [14] or SURGE [2]) with independent and shared parameters, respectively, across domains 1 .
Based on it, we obtain S  (S  ) and S   (S   ), which capture local sequential patterns and global sequential patterns, respectively, in domain A (B).

Mixed Attention Layer
In this section, we first propose item similarity attention to extract similar items from local and global spaces.Then we propose sequence-fusion attention to further fuse the local and global item sequence representations, which will combine the domain-specific and cross-domain sequential patterns.Finally, we propose groupprototype attention to extract the group pattern across domains.

Item Similarity Attention.
To capture the similarity between local/global item embeddings and target item embedding, we first fuse the item embedding from local space (i.e., E   and E   ) and global space (i.e., E    and E    ) together.Specifically, given a user in domain A (B), we can calculate the item similarity scores F  (F  ) between his/her historical items and the target item as follows, where ) denotes the embedding of the target item for domain A (B) and ∥ denotes the concatenation operation.Based on the item similarity scores, we can then weigh similar historical items' embeddings to refine item embeddings as follows, (6) where E   and E   ∈ R  × are the representations of target items' similar historical items weighted by similarity scores of F  and F  in domain A and domain B, respectively.Here   and   mean item similarity of domain  and , respectively.

3.2.2
Sequence-fusion Attention.After obtaining S  , S  , S   , and S   , we then fuse them to combine the domain-specific and cross-domain sequential patterns together as follows, S  = MLP(CA(S  , S  ) + S  ); S  = MLP(CA(S  , S  ) + S  ); (7) where the cross-attention (CA) layer [36] is defined as follows (take S  as an example), where ∈ R  × are parameters to be learned and Atten function is defined as below.
where Q, K, and V are the query matrix, key matrix, and value matrix, respectively.

3.2.3
Group-prototype Attention.Although we can not leverage overlapped user IDs across domains, there often exist user groups with similar preferences.Specifically, we first pool each sequence to obtain relevance to each group.Then we leverage multiple group prototypes to aggregate the item groups and weigh them based on their relevance.
Group Interest Pooling.For an item sequence of a user, it actually does not belong to only one group prototype.Instead, it can be a hybrid combination of several prototypes with different weights.For example, a user can be both an adolescent and a basketball lover at the same time.Thus, we propose a learnable soft cluster assignment matrix [32,40], to calculate the importance of   groups.Specifically, the item sequence of each user is firstly pooled by a pooling matrix ), based on which the relevance of the user to each group can be calculated as follows, where C  and C  ∈ R   ×1 are relevance scores for each group.
Group Interest Aggregation.We then create   group-prototype embeddings G ∈ R   × to represent the interest groups.These embeddings can then be transformed to each domain, aggregating the typically related items as follows, (11) where G  and G  ∈ R   × are the obtained group-prototype representations for the sequences of domain A and domain B, respectively.Here CA layer is similar to Eqn. (8).We can then weigh all group-prototype representations based on the relevance scores as follows, where G   and G   ∈ R   × are the weighted group-prototype representations for each user.
Group-prototype Disentanglement.Each group prototype obviously should be distinct, according to its definition.Therefore, inspired by the advances of disentangled representation learning [27], we propose the prototype disentanglement regularization as: where   is the penalty hyper-parameter.This loss function will be jointly learned with the main loss function later.

Local/Global Prediction Layer
In this section, we first evolve the local and global interests via corresponding prediction layers.Then we optimize them with the objective function for each domain.,

Local and Global Prediction
where MLP  is the MLP layer with shared parameters across domains, and the concatenated embeddings are obtained via, which denotes average pooling before being fed into MLPs.

3.3.2
Objective Function with Independent Updating.We then exploit the negative log-likelihood function [2,44,45] for optimization, which can be formulated as follows, where R  and R  are the training sets of domain A and domain B, respectively.Here   , = 1 (  , = 1) and   , = 0 (  , = 0) indicate a positive sample and a negative sample, respectively, and ŷ , and ŷ , stand for predicted click probability of the next item.
To optimize jointly across two domains, the final objective function is a linear combination of L  , L  and L  calculated in Eqn.(13), Eqn.(16) and Eqn.(17), respectively, as follows, where Θ  and Θ  are the sets of learnable parameters with   and   as the regularization penalty hyper-parameters of domain A and domain B, respectively.Discussion.Different from the existing works of cross-domain sequential recommendation such as PiNet [28] and DASL [18] that are based on bridge users, our proposed MAN model does not rigidly require item sequences from two domains as input at the same time, since our proposed model's output of each domain does not require the input of another domain.That is to say, each domain in our model can update its parameters independently.If there is no input from any domain, we can easily just remove the optimization goal of that domain, e.g. the loss function of Eq.( 18) will be simplified as L = L  +   ∥Θ  ∥ 2 + L  if there is no input from domain A. In the real world's online recommendation, our MAN is more practical since the newly collected data from two domains are always not synchronous (our MAN can be optimized iteratively for each domain).

EXPERIMENTS
In this section, we conduct extensive experiments with two realworld datasets, investigating the following research questions (RQs).
• RQ1: How does the proposed method perform compared with the state-of-the-art single-domain recommenders and cross-domain recommenders?• RQ2: What is the effect of different components in the method?• RQ3: Is the proposed method model-agnostic?What about the performance on different backbones?Is the method still effective with the solely local or global module?• RQ4: How do the group prototypes represent different groups?We also study RQ5: "What is the optimal number of group prototypes?" in Appendix A.2.  1. Appendix A.4 illustrates the details of these two datasets.

Baselines and Evaluation Metrics.
To demonstrate the effectiveness of our model, we compare it with two categories of competitive baselines: single-domain models and cross-domain models.Specifically, single-domain models are DIN [45] Caser [35], GRU4REC [10], DIEN [44], SASRec [14], SLi-Rec [41] and SURGE [2].These single-domain models are trained on each domain independently following existing work [18,28].Besides, cross-domain models are NATR [5], PiNet [28] and DASL [18].PiNet and DASL are adapted to our settings without fully-overlapped users (with the item sequence of another domain as empty).Other cross-domain models like MiNet [29] and CoNet [12] are not included in experiments because they are non-sequential models and will be much poor than sequential models [28].
All models are evaluated on two popular accuracy metrics AUC and GAUC [8], as well as two ranking metrics, MRR and NDCG [2].

4.1.3
Hyper-parameter Settings.The initial learning rate for Adam [15] is 0.001 with Xavier initialization [7] to initialize the parameters.Regularization coefficients are searched in [1 −7 , 1 −5 , 1 −3 ].The batch size is set as 200 and 20, respectively, for the Micro Video dataset and Amazon dataset.The embedding sizes of all models with 40 and 20 are fixed for the Micro Video dataset and Amazon dataset, respectively.MLPs with layer size [100, 64] and [20,10] are exploited for the prediction layer on the Micro Video dataset and Amazon dataset, respectively.Item sequence length of 250 is set for the Micro Video dataset, and 20 is set for the Amazon

Overall Performance (RQ1)
The performance comparisons over all models are as shown in Table 2, where SASRec and SURGE with better performance are leveraged as backbones on these two datasets, respectively.It can be observed that: • Our approach performs best.Our model MAN significantly outperforms all baselines under all metrics.Specifically, our model improves AUC against all baselines by 4.10% and 3.85% on Micro Video A and Micro Video B, respectively, while by 8.25% and 2.07% on Amazon Video Games and Amazon Toys, respectively.In general, the improvement is more consistent across evaluation metrics on the Micro Video dataset with more overlapped users.The Amazon dataset with extremely sparse data sees the highest improvement (8.25%), which verifies that our approach can address the sparse data problem, promoting the sequential learning of both domains simultaneously and that of less interacted domains even more sharply.• Existing cross-domain sequential recommenders rely heavily on overlapped users or items.PiNet and DASL are based on fully-overlapped user datasets [18,28], but they are indeed comparable with GRU4REC under datasets without fully-overlapped users, either outperforming or even underperforming.In contrast to them, our proposed approach outperforms all baselines and improves the backbones significantly, which illustrates the effectiveness of our cross-domain modeling without user overlapping.
Though NATR achieves decent performance on the Micro Video dataset with a lot of overlapped items, it fails to achieve effective cross-domain modeling on the Amazon dataset with limited overlapped items.
• Sequential recommenders are effective but with data sparsity bottleneck.Based on the Micro Video dataset, comparing the sequential models (i.e., Caser, GRU4REC, DIEN, SASrec, SLi-Rec, and SURGE) with the non-sequential model (i.e., DIN), it is necessary for us to model the chronological relationship between items.Besides, SASRec and SURGE are comparable and outperform all other single-domain sequential models, which illustrates the capacity of self-attention to handle long-term information and verifies the effectiveness of compressing information with metric learning.The observation of sequential models is consistent with the experimental results of the SURGE [2] paper.Based on the Amazon dataset, DIN even outperforms some sequential models, i.e., SASRec, which also drops a lot under such a short sequence scene.Though sequential models are the potential for capturing the chronological relationship between items, they are blocked by the data sparsity.Besides we have also attempted to train them with two domains simultaneously ("Shared" models in the backbone study), but the results show that one domain's optimization will have a negative impact on another domain, leading to optimization conflict.Thus it is necessary to design cross-domain modeling to avoid optimization conflict and negative transfer.

Impact of Each Component (RQ2)
To study the impact of our proposed components, we compare our model with that detaching Item Similarity Attention (ISA) module, Sequence-fusion Attention (SFA) module, and Group-prototype Attention (GPA) module on two datasets under four evaluation metrics, as shown in Table 3.Firstly, it can be observed that the shared group prototypes of GPA are most effective in both Micro Video and Amazon Video datasets, illustrating that there are similar interest groups across different domains.Besides, the performance also drops a bit when removing the sequence-fusion component In short, Group-prototype Attention is the most important among the three proposed attentions.

Backbone Study (RQ3)
Here, we study whether GRU4REC, SASRec, and SURGE can be boosted under our proposed method.That is to say, whether our proposed method is model-agnostic.The reason why we choose these three models is that they perform better on the experimented datasets.Figure 3 shows the results of our method with different backbones on two datasets under AUC evaluation, where we can observe that: • Our proposed method is model-agnostic.The selected backbones are all boosted by our proposed MAN, which means our proposed method is model-agnostic.The backbones selected here are RNN-based, attention-based, and even graph-based models.
Thus our method can be applied in various state-of-the-art sequential recommendation models to boost their performance.• Our proposed method performs better on larger datasets.
The improvement on the Micro Video dataset is generally more obvious than that on the Amazon dataset.This is because a large dataset can provide rich cross-domain information.

User Group Visualization (RQ4)
In this section, the embeddings of all users' pooled group representations will be visualized to show the patterns our group-prototype attention module has captured.The pooled group representation (calculated in Eqn.( 12)) for each user is visualized with K-Means and t-SNE, as shown in Figure 4.More specifically, we first apply K-Means on the pooled group representations to cluster data into   groups.Then t-SNE is exploited to reduce the group representations into two-dimensional space, and the clustered groups by K-Means are used to label each user.It can be observed that: (1) for each dataset, the group patterns vary across different domains, where the users under Micro Video A are distributed evenly while the users under Micro Video B mostly belong to groups 0, 1, and 6.On the Amazon dataset, the users mostly belong to group 2, and group 0 under Video Games and Toys, respectively; (2) for two datasets, the users on the Amazon dataset are distributed more unbalanced and dispersed than those on the Micro Video dataset, which may because the Amazon dataset is more sparse.

RELATED WORK
There are two fields of work related to our proposed model: sequential recommendation and cross-domain recommendation.. [37] is the fundamental model of our work, which models the user's historical behaviors as a sequence of time-aware items, aiming to predict the probability of the next item.Initially, the Markov chain is exploited to model the sequential pattern of item sequence as FPMC [33].To further extract the high-order interaction between the historical items, researchers have also applied deep learning models such as recurrent neural network [4,11], convolution neural network [17] and attention network [36] in recommender systems [10,14,35,44,45].However, recurrent neural network-based and convolution neural network-based methods often pay attention to the recent items before the next item, failing to model the long-term interest.Recently, researchers have also combined the sequential recommendation model and traditional recommendation model such as matrix factorization [16] to model the long and shortterm interest [41,43] while SURGE [2] exploits metric learning to compress the item sequence.Some recent works like DFAR [22] and DCN [21] focus on capturing more complex relations behind sequential recommendation.In this paper, we perform cross-domain learning based on sequential recommendation models to achieve knowledge transfer between different domains.

Sequential Recommendation Sequential Recommendation
Cross-Domain Recommendation Cross-domain recommender systems [1] are an effective solution to the highly sparse data problem and cold-start problem that sequential recommendation meets.Early cross-domain recommendation models are based on singledomain recommendation, assuming that auxiliary user behaviors across different domains will benefit the target domain's user modeling [13,25,34].Indeed, the most popular approaches are often based on transfer learning [30] to transfer the user embedding or item embedding from the source domain to improve the target domain's modeling, including MiNet [29], CoNet [12] and itemCST [31] etc.
However, industrial platforms tend to improve all domains of their products simultaneously instead of improving the target domain without consideration of the source domain.Thus, dual learning [9,24], which can achieve simultaneous improvements across both source domain and target domain, grabs researchers' attention and has already been applied in cross-domain recommender systems [19,46].Moreover, to enhance the recommendation performance across all domains simultaneously, researchers have proposed some dual-target approaches focusing on sequential modeling [3,18,28], which addresses the sparse data problem and coldstart problem promisingly and considers the performance of both source domain and target domain.Specifically, PiNet [28] tackles the shared account problem and transfers account information from one domain to another domain where the account also has historical behaviors; DASL [18] proposes dual embedding to interact embeddings and dual attention to mix the sequential patterns for the same users across two domains.Besides PiNet and DASL, DAT-MDI [3] applies dual attention like DASL on session-based recommendation without relying on user overlapping.However, requiring the item sequence pairs in two domains as input is unreasonable because the item sequences of two domains are often independent of each other despite belonging to the same user.Hence such a dual attention manner by mixing the sequence embedding of two domains will not result in a promising performance, theoretically speaking, under a non-overlapped user scene.Though NATR [5] tends to avoid user overlapping, it is a non-sequential and single-target model.
In this paper, we perform cross-domain learning in a dual-target manner to achieve simultaneous improvements across different domains without any prior assumption of overlapped users or items.

CONCLUSIONS AND FUTURE WORK
In this work, we studied the task of sequential recommender systems in a cross-domain manner from a more practical perspective without any prior assumption of overlapped users.Such exploration brought us three key challenges from the item, sequence, and group levels.To address these three challenges, we proposed a novel solution named MAN with local and global modules, mixing three attention networks and transferring at the group level.The first one was the local/global encoding layer that captures the sequential pattern from domain-specific and cross-domain perspectives.Secondly, we further proposed the item similarity attention that captured the similarity between local/global item embeddings and target item embedding, the sequence-fusion attention that fused sequential patterns across global encoder and local encoder, and the group-prototype attention with several group prototypes to share the sequential user behaviors implicitly without leveraging the user ID.Finally, we proposed a local/global prediction layer to evolve the domain-specific and cross-domain interests.
As for future work, we plan to conduct online A/B tests to further evaluate our proposed solution's recommendation performance in the real-world product.We also consider applying MAN with more advanced sequential backbones, even from other fields, to explore the generalization of our proposed modules.

A APPENDIX FOR REPRODUCIBILITY A.1 Notation
We present all used symbols as Table 4 for clearer understanding.We vary the number of groups from {1, 5, 10, 20} as Figure 5 where AUC is tested to explore the best number of groups.From Figure 5, we can observe that: (1) for the Micro Video dataset, AUC reaches the peak when the number of groups is 10 under A domain, and AUC is best at group number 20 under B domain, while the mean value of AUC is best at 10 for these two domains; (2) for Amazon dataset, AUC is best at group number 5 for both domains.

A.3 Implementation
All the models are implemented based on Python with a Tensor-Flow2 framework of Microsoft3 .Besides, we also exploited Python to perform K-Means and t-SNE on group representation for each user.The codes for our model and visualization are available on Github 4 with processed Amazon dataset.The K-Means and t-SNE visualization code and embedding files to be visualized is under the directory "MAN/Code-visualization".Note that we will release the Micro Video dataset to benefit the community in the future.
Each item embedding is concatenated with a domain embedding according to the specific domain of input items.To avoid the distortion on the local and global sequential learning, we also stop the back propagation of S  , S  and S   , S   in Sequence-fusion Attention module of Section 3.2.2, which has been verified to be more effective by our early attempt.Besides, the back propagations of S  and S  are also stopped in Group-prototype Attention module of Section 3.2.3.
MLP for SASRec backbone is a MLP layer sandwiched two normalization layers.For the SURGE [2] backbone, we use the same input as the paper for the local prediction layer and global prediction layer, respectively.Besides, we concatenate the outputs of our item similarity attention module, sequence-fusion attention module and group-prototype attention module to the input of local prediction layer.

Single-domain Models
• DIN [45]: It represents the user by the aggregation of the historical items based on the attention weights calculated via querying the target item with the historical items.

Cross-domain Models
• NATR [5]: It relies on the overlapped items and performs linear transformation to transfer the item representation from the source domain to improve the performance in the target domain.
• PiNet [28]: It represents the user by a shared account filter unit, transfers user information via a cross-domain transfer unit, and encodes the sequence by GRU.• DASL [18]: It is the state-of-art cross-domain sequential model proposing dual embedding to represent the cross-domain user and dual attention to model the cross-domain sequential pattern.

A.4 Datasets and Evaluation Metrics
The public Amazon dataset is available here 5 and we also have uploaded the filtered dataset after 10-core setting on the Github of the code and the supplementary material.The statistics of our adopted Micro Video dataset and Amazon dataset before filtering • Micro Video.This dataset contains two domains, Micro Video A and B, collected from one of the largest micro-video platforms in China, where users can share their videos.User behaviors such as click, like, follow (subscribe), and forward are recorded in the dataset.We downsample the logs from September 11 to September 22, 2021, and filter out inactive users and videos via the 10-core setting [2].We split the behaviors before 12 pm on the last day and after 12 pm on the last day, respectively, as the validation set and test set.Other behaviors are used for training.• Amazon 6 .This highly sparse dataset with two domains is adopted by the existing cross-domain sequential recommendation work DASL [18], with few overlapped items but some overlapped users.We treat all the rating records as implicit feedback, also with the 10-core setting.The datasets include records from May 1996 to July 2014.We split the behaviors before June of the last year and after June of the last year, respectively, as the validation set and test set.Other behaviors are used for training.
The description of our adopted metrics is listed as: • AUC calculates the probability that the predicted positive target item's score is ranking higher than the predicted negative item's score, evaluating the model's accuracy of classification performance.• GAUC is a weighted average of each user's AUC, where the weight is his/her click number.It evaluates the model performance in a more bias-aware and fine-grained manner.• MRR is the mean reciprocal rank, which averages the value of the first hit item's inverse ranking.• NDCG@K thinks highly of those items at higher positions in the recommended K items, where the test items rank higher will result in better evaluating performance.In our experiments,  is set to 10, a popular setting in related work [14].

A.5 Parameter Settings
All models are trained with 2 steps for early stop.Activated by RELU (Rectified Linear Unit), MLP with layer size [80, 40] and [32,16] are exploited for the Item Similarity Attention module on Micro Video dataset and Amazon dataset, respectively.
For Micro Video dataset and Amazon dataset, the dimensions of domain embeddings are set as 8 and 4 while those of item embeddings are set as 32 and 16, respectively. 6Amazon.comFor SURGE backbone, we set the parameters following the paper.For the comparison methods PiNet7 and DASL 8 , we implement it under our framework based on the source code provided by the authors and can be referred in the footnotes.For DASL baseline, we do not pre-train the model as the paper for fair comparison and when we directly execute their provided code, we get poorer performance than the results in their paper under Amazon dataset.

Figure 1 :
Figure 1: Illustration of (a) user transfer learning relies on overlapped users and (b) group transfer learning without previous assumptions on user overlap.

Figure 2
Figure 2 illustrates our proposed MAN model, encoding the item sequence with local/global encoding layer, mixing three attention modules, and evolving the interests by local/global prediction layer.

Figure 2 :
Figure 2: Illustration of our proposed MAN model.(1) The item sequences are first input into the Local/Global Encoding Layer, which builds local and global embeddings for each item and encodes them to extract the local and global sequential patterns; (2) In the Mixed Attention Layer, Item Similarity Attention is fed with local and global item embeddings to capture the item-level relation; Sequence-fusion Attention fuses the encoded local and global sequential representations to capture the sequence-level relation; Group-prototypes attention leverages the shared group prototypes to capture the group-level relation.Here we take domain A to illustrate each proposed attention component in detail.(4) The aggregated embeddings will be fed into the local prediction layer and global prediction layer, respectively, for the final prediction.

Local / Global Encoding Layer Local Embedding Global Embedding Input Historical Items Domain A Domain B
Local and Global Item Embeddings.To capture the domainspecific patterns for different domains, we create two item embedding matrices M  ∈ R | I  | × and M  ∈ R | I  | × where  de- 3.1.1

Local / Global Prediction Layer
denote the local (global) embeddings for domain  and domain , respectively.Besides,  means global.Here P  , P  ∈ R  × and P ∈ R  × ′ are the learnable positional embeddings.Local Encoder and Global Encoder of Sequences.After obtaining E  , E  , E   and E   from the embedding layers, we then apply sequential encoders to learn the sequential patterns.Here we propose the local encoder and global encoder as follows, [2,44,45]h the proposed mixed-attention network (item similarity attention, sequence-fusion attention, and group-prototype attention), we concatenate the outputs together and feed them into the proposed local prediction layer and global prediction layer based on MLP[2,44,45], which can be formulated as follows, ŷ , = MLP e   ∥s   ∥g   ∥s  ∥M  MLP  s   ∥M   MLP e   ∥s   ∥g   ∥s  ∥M

Table 1 :
Data statistic for two datasets.Here Avg.Length is the average number of users' history interacted items.
4.1.1Datasets.We evaluate the recommendation performance on an industrial Micro Video dataset and a public e-commerce dataset.The statistics of the datasets for our experiments are shown in Table

Table 2 :
Performance comparisons for MAN on Micro Video dataset and Amazon dataset.

Table 3 :
AUC performance of MAN with different backbones on Micro Video dataset and Amazon dataset.Here "Single" means backbone models trained with single domain data, which refers to the local module."Shared" means shared backbone model trained with cross-domain data, which refers to the global module."Cross" is the backbone equipped with our method.Ablation study of the proposed components on Micro Video dataset and Amazon dataset.
Figure 4: K-Means and t-SNE visualization of pooled group representations on Micro Video dataset and Amazon dataset, with different colors representing different groups.Group patterns across domains of two datasets are captured by the different distribution of group representations.(Best view in color.)(most effective in Micro Video B), i.e., SFA for fusing the local and global sequential patterns, which means there are truly common sequential patterns across different domains.There are also similar items across different domains when the performance decreases after the detaching item similarity attention module (most effective in Amazon Toys).

Table 4 :
Notation table of important symbols. ,    ∈ I  the  -th item clicked by given users in domain A and B R  , R  training sets of domain A and domain B  ∈ R |I  |× item embedding matrix for domain A M  ∈ R |I  |× item embedding matrix for domain B M ∈ R |I  ∪I  |× ′ item embedding matrix for both domains P  , P  ∈ R  × , position embedding matrix for A and B P ∈ R  × ′ positional embedding matrices for both domains G ∈ R  × group prototype embedding matrix [44]ser[35]: It performs convolution filters on the historical item embedding to capture the sequential pattern.•GRU4REC[10]:Itmodelssessionsequenceandrepresents user preference by the final state based on GRU[4].•DIEN[44]:It proposes an interest extraction GRU layer and interest evolution GRU layer to capture the sequential pattern.

Table 5 :
Data statistics for Micro Video dataset and Amazon dataset before being filtered by 10-core setting.coresetttingare as Table5.The detailed illustration of them are as below.