Compressed Interaction Graph based Framework for Multi-behavior Recommendation

Multi-types of user behavior data (e.g., clicking, adding to cart, and purchasing) are recorded in most real-world recommendation scenarios, which can help to learn users' multi-faceted preferences. However, it is challenging to explore multi-behavior data due to the unbalanced data distribution and sparse target behavior, which lead to the inadequate modeling of high-order relations when treating multi-behavior data ''as features'' and gradient conflict in multitask learning when treating multi-behavior data ''as labels''. In this paper, we propose CIGF, a Compressed Interaction Graph based Framework, to overcome the above limitations. Specifically, we design a novel Compressed Interaction Graph Convolution Network (CIGCN) to model instance-level high-order relations explicitly. To alleviate the potential gradient conflict when treating multi-behavior data ''as labels'', we propose a Multi-Expert with Separate Input (MESI) network with separate input on the top of CIGCN for multi-task learning. Comprehensive experiments on three large-scale real-world datasets demonstrate the superiority of CIGF. Ablation studies and in-depth analysis further validate the effectiveness of our proposed model in capturing high-order relations and alleviating gradient conflict. The source code and datasets are available at https://github.com/MC-CV/CIGF.


INTRODUCTION
Recommender systems (RS) serve as an important tool to meet personalized information needs.To predict users' preferences for items, various methods have been devoted to Collaborative Filtering (CF) [29] techniques, which learn user and item representations from their historical interactions and then make predictions based on these representations.Most CF methods [14,15,26,34,38] are designed for a single type of behavior and rarely consider users' multi-faceted preferences, which widely exist in real-world web applications.Take the example of an e-commerce website, as shown in Figure 1(a).Users interact with items through different behaviors, such as viewing, adding to cart, tagging as favorites, and purchasing.Since different types of behaviors exhibit different interactive patterns out of users' diverse interests, it's of great importance to explicitly leverage multi-behavior data for recommendation.NMTR [10], DIPN [12], and MATN [36] regard multiple behaviors as different types and employ neural collaborative filtering unit, attention operator, and transformer architecture to model their dependencies, which perform much better than treating them as the same type.Multi-behavior data can be regarded as a multiplex bipartite graph (MBG), as shown in Figure 1(b).Recently, thanks to its capacity in representing relational information and modeling high-order relations which carry collaborative signals among users and items, graph neural networks (GNNs) based models [7,14,25,34] have become popular for recommendation.For example, MBGCN [19], GHCF [6], and MB-GMN [37] further empower GNNs with multi-graph, non-sampling, and meta network to capture highorder collaborative signals on multiplex bipartite graphs.
Multi-behavior data can be treated "as features" for multi-behavior relation learning or "as labels" for multi-task supervised learning.Despite years of research, two challenges remain: • Unbalanced Data Distribution.As we can see from Figure 2, observed interactions are highly unbalanced for different users and different behaviors, where a small percentage of users and behaviors cover most of the interactions.
• Sparse Target Behavior (behavior to be predicted, e.g., purchase in e-commerce).We can also find that most users have less than 10 purchase records, which is extremely sparse compared with the whole item space with thousands to millions of items.
We dig into these challenges and observe the following limitations: • Inadequate modeling of high-order relations when treating multi-behavior data "as features".User-item relations are meaningful for revealing the underlying reasons that motivate users' preference on items.For example, as shown in Figure 1(a), there are several third order relations between  1 and  4 (e.g., .With the help of collaborative effect, we predict that  1 is likely to purchase  4 as  3 , the user similar to  1 , has purchased  4 before.Existing methods like MBGCN, GHCF, and MB-GMN have attempted to employ GNNs to incorporate high-order relations.However, they use a two-stage paradigm which first learns representation for each behavior by considering all historical records belonging to this behavior, then leveraging the learned representation to model high-order relation across different behaviors.We argue that this relation modeling manner is behavior-level, as depicted in Figure 1(c).Due to the unbalanced data distribution, the learned relations are easily biased toward high-degree users and behaviors, and thus making the learned representations unable to effectively capture high-order relations.
• Potential gradient conflict when treating multi-behavior data "as labels".Early works like MBGCN [19] and MATN [36] only use target behavior as labels to train the model, which is vulnerable to the sparsity problem due to the sparse target behavior.To alleviate this problem, it is promising to use auxiliary behaviors as labels with multi-task learning (MTL) techniques.However, it is not easy to train with multiple objectives due to the negative transfer1 [32] phenomenon.Negative transfer indicates the performance deterioration when knowledge is transferred across different tasks.Therefore, it's risky to treat multi-behavior data "as labels".Several recent works like NMTR [10], GHCF [6], and MB-GMN [37] have investigated MTL in multi-behavior recommendation.As they use the same input, these methods might suffer from the gradient conflict due to the coupled gradient issue.Detailed explanations are presented in Section 3.3.
To tackle the above limitations, we propose a novel Compressed Interaction Graph based Framework (CIGF) for better representation learning of users and items.To handle the inadequate modeling of high-order relations when treating multi-behavior data "as features", we design a Compressed Interaction Graph Convolution Network (CIGCN) to model high-order relations explicitly.CIGCN firstly leverages matrix multiplication as the interaction operator to generate high-order interaction graphs which encode instancelevel high-order relations (including user-user, user-item, and itemitem) explicitly, then leverages node-wise attention mechanism to select the most useful high-order interaction graphs and compress the graph space.Finally, state-of-the-art GCN models are combined with residual connections [13] on these graphs to explore highorder graph information and alleviate the over-smoothing issue for representation learning.
To alleviate the potential gradient conflict when treating multibehavior data "as labels", we propose a Multi-Expert with Separate Input (MESI) network on the top of CIGCN for MTL.MESI network is a hierarchical neural architecture similar to the MMOE [24] and PLE [31].However, separate inputs are introduced to replace the same input used in the original MMOE and PLE models for MTL.Specifically, we use relations starting from different types of behaviors for the learning of separate inputs.By using separate inputs explicitly to learn task-aware information, potential gradient conflict of the same input can be alleviated when knowledge is transferred across different tasks, which makes the learning process more stable and effective.Explanations for the decoupled gradient of MESI can be referred to Section 3.3.
To summarize, our work makes the following contributions: • We look at the multi-behavior recommendation problem from a new perspective, which treats multi-behavior data "as features" and "as labels" with data analysis and theoretical support.

RELATED WORK
GNNs for Recommendation.GNNs based methods can be used for multi-behavior data by treating it "as features".Most of the existing GNNs are proposed for homogeneous graphs, such as NGCF [34], LR-GCCF [7], and LightGCN [14], which ignore the multiple types of edges.Recently, some researchers have focused on the heterogeneous graph and proposed methods like HGNN [39], R-GCN [27], and HGAT [23].However, these methods merely consider the behavior-level relations by utilizing the behavior-level representations for relation modeling.Hyper-graph based methods [3,9] leverage hyper-graph to model complex high-order relations.However, as an edge in hyper-graph connects two or more nodes, it is not suitable for the multi-behavior case where a node pair connects multiple edges.Existing meta-path based methods, like Metapath2vec [8], MCRec [16], and HAN [35] model high-order relations with the manually selected meta-paths, which is limited by the need of expert knowledge and the difficulty of searching all useful meta-paths with arbitrary length and edge types.MTL for Recommendation.MTL methods can be used for multibehavior data by treating it "as labels".A widely used model is the shared bottom structure in Figure 3(d).Though useful for knowledge sharing within multiple tasks, it still suffers from the risk of conflicts due to the task differences.To handle the task difference, some studies apply the attention network for information fusion.MMOE [24] in Figure 3(e) extends MOE [18] to utilize different gating networks to obtain different fusion weights in MTL.PLE [31] in Figure 3(f) further proposes to leverage shared or task-specific experts at the bottom and then employs gating networks to combine these experts adaptively, thus to handle task conflicts and alleviate the negative transfer issue.However, they still utilize the same input for MTL.We argue that this manner might suffer from the gradient conflict due to the coupled gradient issue.Detailed explanations are presented in Section 3.3.
Multi-behavior Recommendation.Existing multi-behavior recommendation methods can be classified into two categories: graphbased and MTL based [17].The former category treats multi-behavior data "as features".Some early works like DIPN [12] and MATN [36] fail to capture high-order relations, and thus performing poor.Most recent works (e.g., GHCF [6] and MBGCN [19]) use a behaviorlevel modeling manner that cannot capture the fine-grained instancelevel multi-behavior relations.Some other methods like MBGCN [19] and MGNN [40] learn high-order relations from the MBG directly, which is difficult to mine useful relations extensively due to the unbalanced data distribution.Different from the above methods, our proposed CIGCN models high-order relation by explicit graph interaction and graph compression, thus can learn relations in the instance-level.The latter category treats multi-behavior data "as labels".NMTR [10] in Figure 3(a) assumes that users' multiple types of behaviors take place in a fixed order, which may be too strong to be appropriate for all users.GHCF in Figure 3(b) uses a similar architecture with shared bottom for MTL.The only difference is that GHCF uses bilinear operation (Please refer to Section 3.3) as the prediction head, while shared bottom uses neural network.MB-GMN [37] in Figure 3(c) further proposes to use a meta prediction network to capture the complex cross-type behavior dependency for MTL.These existing methods optimize multiple tasks with the same static weights for all samples.The most obvious drawback is that they can easily suffer from the risk of conflicts caused by sample differences, as different samples may pose different preferences for different tasks.In contrast, our proposed MESI network learns adaptive weights according to the nature of different samples.Besides, we utilize the separate input to learn task-aware information to alleviate the potential gradient conflict.

PRELIMINARY 3.1 Problem Definition
In this section, we give the formal definition of the multi-behavior recommendation task.We denote the user set and item set as U = { 1 ,  2 , ...,   } and I = { 1 ,  2 , ...,   }, respectively.The user-item interaction matrices of behaviors as Y = Y 1 , Y 2 , ..., Y  .Where ,  and  are the number of users, items and behavior types, respectively, and    = 1 denotes that user  interacts with item  under behavior , otherwise    = 0. Generally, there is a target behavior to be optimized (e.g., purchase), which we denote as Y  , and other behaviors Y 1 , Y 2 , ..., Y −1 (e.g., view and tag as favorite) are treated as auxiliary behaviors for assisting the prediction of  target behavior.The goal is to predict the probability that user  will interact with item  under target behavior .

Graph and Relation Definition
As shown in Figure 1(b), we denote the Multiplex Bipartite Graph (MBG) as G = (V, E, A), where V = U ∪ I is the node set containing all users and items, E = ∪  ∈R E  is the edge set including all behavior records between users and items.Here  denotes a specific type of behavior and R is the set of all possible behavior types.A = ∪  ∈R A  is the adjacency matrix set with A  denoting adjacency matrix of a specific behavior graph A relation P is defined as a path in the MBG G with the form of  1 as the set of all edge types in this path.If  ≥ 2 and | P | = 1, we define this path as a high-order single-behavior relation.If  ≥ 2 and | P | ≥ 2, we define this path as a high-order multi-behavior relation.Node   is node   's -th order reachable neighbor if there exists a path connecting node   and node   and the length of this path is .In a new generated graph G  , if arbitrary two connected nodes   and   are -th order reachable in the original MBG G, G  is defined as a -th order graph.We will illustrate the explicitly modeling of high-order multi-behavior relation through high-order graph interaction and convolution in Section 4.2.

A Coupled Gradient Issue in MTL
Most of the existing methods use the same input for MTL, as summarized in Section 2. This may cause a coupled gradient issue in MTL which restricts their learning ability for each task.Here we use bilinear module from GHCF [6] as an example to claim this.The bilinear module can be formulated as: where (•) is the hadamard product operation, ô , denotes the predictive value of the k-th behavior, x *  and y *  represent the learned representation for user  and item .r  ∈ R 1× is a behavior-aware transformation vector, which projects user and item representation to separate prediction head for MTL, and  denotes the embedding size.Here we use the square loss as an example for optimization: where   , is the true label.Then we have: where   , is an scalar, r ′  is the synthetic gradient from the -th behavior, which determines the updating magnitude and direction of the same input vector x *  • y *  .We can see that the gradients from all behaviors are coupled.Figure 4 shows an example of  = 3.Assuming r ′ 1 as a reference vector, we do orthogonal decomposition to all the other vectors.We can find that the components of other vectors are not in the same direction as the reference vector.This demonstrates the potential gradient conflict brought by the coupled gradient.The proof for other methods, loss functions and the decoupled gradient of MESI can be referred to Appendix A.2 and A.3.

OUR PROPOSED METHOD
We now present the proposed CIGF framework, which treats multibehavior data both "as features" and "as labels" in an end-to-end fashion.The architecture is shown in Figure 5 and it consists of three main components: i) input layer, which parameterizes users and items as embedding vectors; ii) compressed interaction graph convolution network (CIGCN), which extracts instance-level highorder relation from the multi-behavior data explicitly by treating it "as features"; iii) multi-expert with separate input (MESI) network, which mines multi-task supervision signals from the multi-behavior data with separate inputs by treating it "as labels".

Input
We first apply a shared embedding layer to transform the one-hot IDs of users and items into low-dimensional dense embeddings.Formally, given a user-item pair (, ), the embedding lookup operation for user  and item  can be formulated as follows: 4) where p  ∈ R ×1 and q  ∈ R  ×1 denotes the one-hot IDs of user  and item , P ∈ R × and Q ∈ R  × are the user and item embedding matrix and  is the embedding size.
where Y  is the user-item interaction matrix of behavior .We then use these adjacency matrices for explicit high-order graph interaction which encodes the instance-level relations of each two nodes.Denote the set of all possible -th (1 ≤  ≤ ) order interaction graph starting from behavior  as B   .The purpose why we only use interaction graph starting from behavior  here is to generate graph sets with different high-order relations, which will be used as separate inputs for MESI to alleviate the potential gradient conflict.The generation of B   can be formulated as: where B 1  = {A  }. (⊗) denotes the matrix multiplication operation between any pairs of matrices from the two sets separately.Noticed that there are  sets of high-order graph B   (1 ≤  ≤ ), each of which starts from a behavior-specified adjacency matrix A  .By selecting different behavior-specific graph A  at each step, we can construct a high-order graph set that contains multiple -th order interaction graph with different semantics.Specifically, the number of all possible -th order graph can be calculated as: where  (•) is a measure of the number of elements in a set,  (,  − 1) is the function that calculates the  − 1 power of a given number ,  is the number of behavior types.However, as the number of all possible -th order graph is a exponential function of  − 1, it's impractical to use such an extensive space for -order interaction graph generation.

Graph Compression.
In order to find an applicable solution with limited time and space complexity, we employ a graph compression layer to construct the high-order graph sets iteratively with the node-wise multi-head attention mechanism.The graph compression layer for target node  (node  could be a user node  or an item node ) can be formulated as: where B 1 , = {A  }. (•) is the vector multiplication operation,  is the number of heads and  ,ℎ , ∈ R 1× is the learned attention vector for node  in the -th order and the ℎ-th head.By using the nodewise multi-head attention mechanism, the number of generated -th order graph is reduced from  (,  − 1) to  (,  − 1).Since  is usually much smaller than  and  − 1 is usually a very small value, so the scale of  (,  − 1) is acceptable.The attention mechanism not only serves as a tool to reduce complexity, but is also used for finding the most useful behaviors for high-order graph generation.To adaptively select the most relevant behavior of users and items for representation learning, we use the nodewise attention mechanism to obtain the soft weights for different behaviors, which can be defined as: where  (•) is the activation function set as LeakyReLU here for better performance.W ,ℎ  ∈ R × and b ,ℎ ∈ R ×1 are feature transformation matrix and bias matrix, respectively.Noticed that we also use a behavior-and layer-wise (i.e., we use different transformation matrices for different layers and behaviors) attention mechanism here, we empirically verify its effectiveness in Section 5.3.1.In this way, we can generate the personalized high-order graph sets for both users and items, which are used for later information propagation and integration.

Graph Convolution.
After generating the graph set by graph interaction and graph compression layers, we enrich the representation of users and items with graph convolution.The neighbor information propagation in each graph can be formulated as: x ,   , = (x  , B , , ), y ,   , = (y  , B , , ) where B , , and B , , denote the adjacent matrices of the -th and -th graph in graph set B  , and B  , ,   and   denote the neighbors of  and , and x ,   , and y ,   , denote the outputs by aggregating neighbor information from -th and -th graph.(•) is a arbitrary graph convolution operator that can be used for information aggregation.We implement (•) with the following four stateof-the-art GCN models: GCN Aggregator [21], NGCF Aggregator [34], LR-GCCF Aggregator [7], and LightGCN Aggregator [14].Notice that the matrix multiplications lead to a very dense high-order graph which is computationally unacceptable.Therefore, we use the matrix associative property to accelerate the aggregation process for computational efficiency.For example, (A  ⊗ A  ⊗ A  ) × x  can be accelerated by A  × (A  × (A  × x  )), where (×) is the multiplication between sparse matrix and vector.As (×) combines a sparse matrix and a vector into a single vector, computation complexities of subsequent multiplications can be effectively reduced.After the neighbor information propagation process, we have  (,  − 1) neighbor representations for each layer and for each node  and .For simplicity, we apply the sum operation over these representations to get the final user and item representations: To better explore high-order neighbor information and alleviate the over-smoothing issue, we introduce the residual operation to our graph convolution layer for final node information updating, which is defined as: As the outputs of different layers reflects the relations of different orders, we finally aggregate these outputs into a single vector with the sum operation as follows: where x 0 , = x  and y 0 , = y  are the initial embeddings for user  and item .It is noticed that the central nodes aggregate neighbor information of different layers directly, which has been verified to be useful to address the heterogeneity of the user-item interaction graph [30], compared with recursively updating the node embedding at -th layer with the output from  − 1-th layer.

Multi-Expert with Separate Input
With the design of CIGCN, we have obtained  representations x * , and y * , (1 ≤  ≤ ) for each user  and each item , as shown in Figure 5.Each representation describes the personalized preferences of user  or item  to relations start from behavior .To alleviate the potential gradient conflict when treating multi-behavior data "as labels", we propose a Multi-Expert with Separate Input (MESI) network with a novel separate input design in this section.
Existing multi-behavior methods like NMTR [10], GHCF [6] and MB-GMN [37] optimize multiple tasks with the same static weights for all samples, which are limited by the sample differences, as analyzed in Section 2. To address this problem, we use a hierarchical  [24] and PLE [31] for MTL.Specifically, we use experts to replace the shared bottom layer used in NMTR, GHCF and MB-GMN to learn behavior-aware information.In this paper, each expert is defind as the combination of x * , and y * , , which can be formulated as: where (•) is the hadamard product operation, x * , and y * , are the behavior  related inputs.As the separate input are utilized here for the generation of experts, we can obtain  experts in total.
As different experts may contain different preferences of users or properties of items, it's necessary to combine these experts for the final prediction of each task.We then use the separate input to produce task-aware gate for each task to automatically select a subset of experts which are useful for the prediction of this task.The gate for task  can be defined as: where (||) is the vector concatenation operation, W  ∈ R ×2 and b  ∈ R ×1 are feature transformation matrix and bias matrix, and g  , ∈ R ×1 is the attention vector which are used as selector to calculate the weighted sum of all experts.The final prediction score for task  is calculated as: where g  , ( ) denotes the -th element of vector g  , , ℎ  (•) is the tower function.Following [18], we use average operation as the tower function here for simplicity.

Joint Optimization for MTL
Since we have obtained the prediction value ô , for each type of behavior , we use the Bayesian Personalized Ranking (BPR) [26] loss for multi-task learning, which can be formulated as: where

Datasets.
To reduce biases, we adopt the same public datasets (i.e., Beibei, Taobao, and IJCAI)2 and pre-processings as in MB-GMN [37], and the statistics are shown in Table 1.

Evaluation Metrics.
The Hit Ratio (HR@ ) and Normalized Discounted Cumulative Gain (NDCG@ ) are used to evaluate the performances.By default, we set  = 10 in all experiments.Similar results of other metrics (i.e.,  = 1, 5, 20) on the three datasets can also be obtained, whereas they are not presented here due to the space limitation.And the details of implementation are shown in Appendix A.1.

Overall Performance Comparison
From Table 2, we have the following observations in terms of model effectiveness (analysis of complexity is shown in Appendix A.6):     From the results displayed in Table 4, global-wise attention performs the worst among all variants in most cases, which suggests the importance of learning the customized information for each node.Besides, all the enhanced node-wise variants perform better than pure node-wise attention mechanism, and our proposed attention mechanism achieves the best performance on all three datasets.The results indicate the effectiveness and rationality of our proposed behavior-wise and layer-wise node-wise attention mechanism for high-order multi-behavior relation selection.

On the impact of MTL modules.
To further demonstrate the superiority of our proposed MESI for MTL, we replace it with four state-of-the-art MTL modules, namely, Shared Bottom [5], Bilinear [6], MMOE [24], and PLE [31], and apply them on the top of CIGCN for multi-behavior recommendation.Notice that there are  representations used as separate input generated from CIGCN.To make it applicable for these four modules which use same input, we average the  representations to get one unified input.Resulted variants are named as CIGCN-SB, CIGCN-Bilinear, CIGCN-MMOE, CIGCN-PLE, and CIGF respectively.The results are summarized in Table 5.As we can see, CIGCN-SB performs the worst among all MTL models on all datasets.CIGCN-Bilinear replaces the prediction head of neural network in CIGCN-SB with lightweight matrix transformation and performs better.Possible reason is that the lightweight operation can reduce the risk of overfitting.Besides, both CIGCN-MMOE and CIGCN-PLE have employed the gate network with adaptive attention weights for information fusing, thus outperform the static and same-weighted CIGCN-SB.Finally, our MESI consistently performs the best on all datasets.This verifies the effectiveness of separate input for MTL.

5.3.4
On the impact of GCN aggregators.To explore the impact of different GCN aggregators, we compare the variants of our proposed model with different GCN aggregators, including GCN Aggregator [21], NGCF Aggregator [34], LR-GCCF Aggregator [7], and LightGCN Aggregator [14].The experimental results are illustrated in Figure 6.We can see that NGCF Aggregator performs better than GCN aggregators on all datasets.A possible reason is that  additional feature interactions introduced by NGCF Aggregator provides more information.We also find that LR-GCCF Aggregator performs slightly better than NGCF Aggregator on almost all datasets.The reason is that removing the transformation matrix can alleviate overfitting.Moreover, LightGCN Aggregator obtains the best performance on all datasets by simultaneously removing transformation matrix and activation function.

In-depth Analysis of Model Design
In this part, we conduct experiments to make in-depth analysis about instance-level high-order relation modeling when treating multi-behavior data "as features" and potenial gradient conflict when treating multi-behavior data "as labels".

5.4.1
Instance-level high-order relation modeling.We vary the depth of CIGF to investigate whether our model can benefit from instancelevel high-order relations.And we compare the results with GHCF and MB-GMN which model behavior-level high-order relations.Due to lack of space, we only show the results on Beibei and Taobao datasets in Figure 7, the result of another dataset is consistent.We can see that CIGF consistently outperforms the other methods when the layer number increases.Besides, we can also find CIGF keeps stable on Beibei and increases continuously on Taobao, while MBGCN degrades rapidly on both datasets and GHCF degrades rapidly on Beibei when increasing the layer number.This observation verifies the effectiveness of our proposed method for instance-level high-order relation modeling.
Instance-level high-order relations bring benefits for final recommendation.Besides, it can also reveal the underlying reasons that motivate users' preferences on items.Towards this end, we select and show the top-3 and bottom-3 relations among all possible relations for each order according to the average attention weights of all users in Table 6.Notice that there are only second order relations on Beibei as our model achieves best results with two layers on this dataset.As we can see, the third order relation

Gradient conflict analysis.
To verify that our model can alleviate potential gradient conflict, we perform experiments on user groups with different behavior relevance levels.In particular, we divide the test set into six user groups according to the average Pearson correlation [4] among all behaviors.The calculation of average Pearson correlation can be referred to Appendix A.4.For fair comparison, we select a subset from each user group to keep the interaction number for each user fixed, thus preventing the potential impact of node degree to results [34]. Figure 8 presents the results.We omit the results on the IJCAI dataset due to space limitation, which have consistent trends.For more rigorous results, we run each experiment 5 times and draw the mean and fluctuation range on the figure.We find that MESI consistently outperforms all baselines among all user groups, which further demonstrates the superiority of MESI for MTL.Besides, with the increase of behavior correlations, MESI gets better performances, while the performances of other baselines fluctuate or even decrease.A possible reason is the negative transfer caused by potential gradient conflict when knowledge is transferred across different tasks.
To understand the reason why our proposed MESI can alleviate potential gradient conflict, we conduct experiments to compare the experts utilization among our MESI and other gate-based models (MMOE and PLE).Following [31], we visualize the average weight distribution of experts used by the target behavior prediction in Figure 9. Notice that we omit gates used for other behaviors as our goal is to predict the interaction probability of target behavior.Besides, for the sake of comparison, we fix the number of experts as 3 on Beibei dataset and 4 on Taobao and IJCAI datasets for both MMOE and PLE.It is shown that our MESI achieves better differentiation between different experts while MMOE and PLE have a nearly uniform distribution for all experts.Thus our MESI can selectively leverage information of different behaviors to update the gradient to avoid potential conflict.

CONCLUSIONS
In this paper, we propose the CIGF framework for multi-behavior recommendations.To explicitly model instance-level high-order relations, we introduce the CIGCN module, which leverages matrix multiplication as the interaction operator to generate high-order interaction graphs, and perform graph convolution on these graphs to explore relation integration.To alleviate potential gradient conflict, we propose the MESI network, which uses behavior-specific separate inputs explicitly.By doing so, the risk of negative transfer is reduced.We conduct comprehensive experiments on three real-world datasets and show that the proposed CIGF outperforms all the state-of-the-art methods on all three datasets.Further analysis shows that CIGF can fully capture high-order relationships and effectively alleviate negative transfer.

A APPENDIX A.1 Parameter Settings
Our proposed CIGF is implemented in TensorFlow [2].For a fair comparison, we set the embedding size of both users and items to 16 for all models, and initialize the model parameters with Xavier method [11].We adopt Adam [20] to optimize the models and set the learning rate of 0.001 and batch size of 256, respectively.Moreover, the number of GCN layers for graph models is searched from {1,2,3,4,5}.We only use one head in the graph compression layer for simplicity as it has already achieved enough performance improvements.Other parameter settings are kept consistent with MB-GMN [37].All experiments are run for 5 times and average results are reported.

A.2 The Coupled Gradient Issue in MTL
For the sake of simplicity, we assume that the learned user/item representation in existing MTL models can be expressed as: where   (•) and   (•) denote the representation learning function, x  and y  are the initial embeddings for user  and and item , and A is the corresponding adjacency matrix of MBG G. Notice that A is optional for   (•) and   (•) to generalize them to non-graph functions.
Taking (x *  , y *  ) as same input for MTL, the loss function can be formulated as: where ô , denotes the predictive probability that user  will interact with item  under the k-th behavior,   , is the true label, (•) is the loss function, and   (•) is the predictive function in MTL models.Then we have: where (•) is the hadamard product operation,   , = . As r  denotes the derivative of a scalar to a vector, it is also a vector.∀  ∈ {1, 2, . . .,  }, r ′  determines the updating magnitude and direction of the vector x *  • y *  .
We can see that the gradients from all behaviors are coupled.Similar to Section 3.3, we can find that there are gradient conflicts due to the coupled gradient issue if we use same input for MTL.

A.3 Decoupled Gradient of MESI for MTL
In contrast, our proposed MESI takes separate inputs x * , and y * , ( ∈ {1, 2, . . .,  }) for MTL.The loss function for MESI can be formulated as: where where is a scalar.
In the above derivation process, it can be clearly seen that our proposed MESI decouples the gradients of different behaviors and selectively uses information of different behaviors to update the gradients, which alleviates the issue of gradient conflict.

A.4 Calculation of Pearson Correlation
We choose the Pearson correlation to divide users into different test groups.The Pearson correlation between behavior  and  for user  can be calculated as follows: The multi-behavior data can be treated "as labels" for multi-task supervised learning.Figure 10 shows the label correlations with the venn diagram when treating multi-behavior data as labels, where different overlaps represent different label correlations.

Figure 2 :
Figure 2: Histogram of user numbers w.r.t interaction numbers for different behaviors.

Figure 1 :
Figure 1: An example of multiple types of behaviors on an e-commerce website, the corresponding MBG and a comparison of behavior-level and instance-level relation.

Figure 3 :
Figure 3: Network structure of existing models and our proposed MESI model.Blue rectangles represent shared layers, pink and green rectangles represent task-specific layers, and pink and green circles denote task-specific gates.

Figure 4 :
Figure 4: An example of the gradient conflict.

4. 2
Compressed Interaction Graph Convolution4.2.1 Graph Interaction.Inspired by the success of DCN [33] and xDeepFM [22] which model high-order feature interactions with explicit feature crossing, we use the adjacency matrix multiplication as the graph interaction operator for explicit instance-level highorder relation modeling.As shown in Figure 5, we first partition the MBG into several behavior-specified graph G 1 , G 2 • • • G  .The corresponding adjacency matrices are A 1 , A 2 • • • A  , which can be formulated as: , , )|(, ) ∈ O +  , (, ) ∈ O −  denotes the training dataset.O +  indicates observed positive user-item interactions under behavior  and O −  indicates unobserved user-item interactions under behavior .Θ represents set of all model parameters,  is the Sigmoid function and  is the  2 regularization coefficient for Θ.

5. 3 . 2
On the impact of attention module.To demonstrate the effectiveness of our attention module, we consider four variants:(1) global-wise: The attention weight is global-wise for all user/item.(2)node-wise: The attention weight is shared by all layers and behaviors but different for each user/item.(3) node-wise+layer: The attention weight is shared by all behaviors but different for each layer or user/item.(4) node-wise+beh: The attention weight is shared by all layers but different for each behavior or user/item.

Figure 7 :
Figure 7: Effect of the layer number.The solid line and the dotted line represent HR and NDCG, respectively.Table6: The selected top-3 and bottom-3 relations.

Figure 8 :
Figure 8: Average performances for user groups with different behavior correlations.

Figure 9 :
Figure 9: Expert utilization in gate-based models

2 𝐾 (𝐾 − 1 )Figure 10 :
Figure 10: Venn diagram of label correlations on the three datasets.1/0 means have or not have this type of behavior.E.g., 0110 represents those users who only have favorite and cart behaviors with items.

Table 2 :
The overall comparison.★indicates a statistically significant level -value<0.05comparingCIGF with the best baseline (indicated by underlined numbers).•NGCF and LightGCN perform better than DMF and AutoRec on most datasets, which demonstrates the advantage of GNN in extracting high-order collaborative signals.By distinguishing different behaviors, NMTR, DIPN, and MATN achieve much better performances than DMF and AutoRec.This verifies the necessity to extract and model the relation information between different types of behaviors.• NGCF, LightGCN, NMTR, DIPN, and MATN perform worse than NGCF  , LightGCN  , MBGCN, and MB-GMN on most datasets, which indicates the incapability of NNs models and single-behavior GNNs models in modeling high-order multibehavior relations.This justifies the necessity to simultaneously consider multi-behavior and high-order relations.
5.3.1 On the effectiveness of key components.To evaluate the effectiveness of sub-modules in our CIGF framework, we consider three model variants: (1) Base Model: We remove CIGCN part (i.e., the behavior-specific graph are used for convolution directly) and replace the MESI network with bilinear module.This variant cannot model instance-level high-order relations and use same input for MTL.(2) w/o CIGCN : The CIGCN part is removed.(3)w/o MESI :The MESI part is replaced with bilinear module.As shown in Table 3, both CIGCN and MESI bring performance improvements compared with base model, and the complete CIGF framework achieves the best results.Therefore, we claim that both instancelevel high-order multi-behavior relation and separate input are effective and complementary to each other.And it's necessary to treat multi-behavior data both "as features" and "as labels".

Table 4 :
Performances of different attention variants.

Table 5 :
Impact of MTL modules.
ô, denotes the predictive probability that user  will interact with item  under the k-th behavior,   , is the true label,  * (•) is the loss function used for optimization.And g  , denotes the gate for task , f  , denotes the expert generated from input x * , and y * , , which can be referred to Section 4.3.For arbitrary reference input vector x * , and y * , ( ∈ {1, 2, . . .,  }) to be optimized, we then have: * , • y * , ) *   ,

Table 7 :
6.6.1 Complexity Analysis.Time Complexity.We analyze the time complexity of CIGF where the CIGCN module is the main cost. , •  , where B  , denotes the number of edges existed in all graphs of set B  , ,  is the behavior number,  is the layer number and  is the embedding size.In CIGCN, the dense graphs B , , in set B  , are transformed into  sparse graph for computation.As  is usually very small, the time complexity is comparable with existing GNNs, which is further verified with experiments in Section A.6.2.Space Complexity.The learnable parameters in our proposed CIGF are mainly from the user and item embedding x  and y  , which is similar to existing GNNs.Besides, as dense graph B , , in set B  , are transformed into sparse behavior-specified graphs G 1 , G 2 , • • • , G  for computation, no additional memory space is needed to store these graphs, which makes the memory footprint of the intermediate process acceptable.Training time comparison (seconds per epoch) of different methods on all three datasets.

Table 8 :
Testing time comparison (seconds per epoch) of different methods on all three datasets.Efficiency Analysis.Apart from the model effectiveness, the training efficiency also matters.Table7shows the training time (one epoch) comparison between our CIGF and two representative baselines on all three datasets.The best baseline MB-GMN requires the longest training time, while our CIGF is faster with 30.64%, 37.92%, and 22.27% time reduction on the three datasets.Besides, though GHCF is slightly faster than our CIGF on the Beibei dataset, it inapplicable to the IJCAI dataset due to the unaffordable memory usage brought by non-sampling learning.Besides, as shown in Table 8, our proposed CIGF is 16.89%, 5.24%, and 13.39% faster than the fastest of the other models on three datasets.GHCF performs well in training efficiency, while it performs worst in testing.The possible reason is that the non-sampling learning loss of GHCF dramatically improves the efficiency of loss calculation, thus significantly improving training efficiency.While the GNN part, which contributes to the main complexity of GHCF, is more complicated, so it takes more time to test.In summary, we claim that CIGF has the best overall efficiency.