Parallel Knowledge Enhancement based Framework for Multi-behavior Recommendation

Multi-behavior recommendation algorithms aim to leverage the multiplex interactions between users and items to learn users' latent preferences. Recent multi-behavior recommendation frameworks contain two steps: fusion and prediction. In the fusion step, advanced neural networks are used to model the hierarchical correlations between user behaviors. In the prediction step, multiple signals are utilized to jointly optimize the model with a multi-task learning (MTL) paradigm. However, recent approaches have not addressed the issue caused by imbalanced data distribution in the fusion step, resulting in the learned relationships being dominated by high-frequency behaviors. In the prediction step, the existing methods use a gate mechanism to directly aggregate expert information generated by coupling input, leading to negative information transfer. To tackle these issues, we propose a Parallel Knowledge Enhancement Framework (PKEF) for multi-behavior recommendation. Specifically, we enhance the hierarchical information propagation in the fusion step using parallel knowledge (PKF). Meanwhile, in the prediction step, we decouple the representations to generate expert information and introduce a projection mechanism during aggregation to eliminate gradient conflicts and alleviate negative transfer (PME). We conduct comprehensive experiments on three real-world datasets to validate the effectiveness of our model. The results further demonstrate the rationality and effectiveness of the designed PKF and PME modules. The source code and datasets are available at https://github.com/MC-CV/PKEF.


INTRODUCTION
Recommender systems are information filtering techniques designed to provide personalized services based on user preferences.In our daily lives, recommendation systems are widely used in various scenarios such as e-commerce, social media, music, and video platforms.Early collaborative filtering (CF) techniques [31] made recommendations based on users' historical interactions with items, but they had limitations in effectively utilizing diverse user behavior information for recommendations.In the real world, user behavior goes beyond a single type and includes various behaviors such as viewing, adding to cart, and purchasing.Among them, we mainly focus on a specific behavior, namely target behavior (e.g., buy) and considering other behaviors as auxiliary behaviors [8,19,37].These multiple behavior signals carry rich user preferences, which can be leveraged to comprehensively understand user needs and provide better services.
Recent researches have focused on effectively leveraging multiple behavior signals for recommendations.Existing frameworks for multi-behavior recommendation contain two steps: multi-behavior fusion and multi-behavior prediction [16].In the fusion step, advanced neural networks are applied to capture the correlations between users and items across multiple behaviors.In the prediction step, multi-task learning (MTL) is devised to further utilize the heterogeneous interaction information [17].
Multi-behavior Fusion.Early studies applied matrix factorization [21,30,33] to multi-behavior recommendation.With the rise of deep learning, neural network-based approaches [8,12,37] have become popular in multi-behavior fusion.These methods can model the complex relationships between users and items, capturing richer user interests and item features.Among them, graph neural networks [5,13,14,25,26,35] have been widely applied in multi-behavior recommendation due to their ability to efficiently utilize high-order connectivity between users and items [4,38].For example, MBGCN [19] and GNMR [36] utilize graph neural networks to improve recommendation performance.However, these methods do not consider using dependencies between behaviors to assist model learning.In the real world, user behaviors often follow a hierarchical order, such as view → cart → buy.User preference information from upstream behaviors (e.g., view) can be used to assist downstream tasks (e.g., cart and buy) [6,8,40].CRGCN [40] and MB-CGCN [6] integrate the cascade dependencies between behaviors into graph convolutional networks (GCNs), facilitating the learning of user and item embeddings.These models, which consider behavioral hierarchy, have demonstrated better performance compared to previous approaches.Multi-behavior Prediction.Multi-task learning (MTL) is a commonly used approach in multi-behavior prediction, as it can effectively utilize complex heterogeneous signals from multiple tasks to jointly optimize the model.Existing multi-task learning methods typically have coupled inputs for different tasks [3,24,32].They generate multiple experts in different ways and aggregate the expert information for subsequent tasks.For example, MMOE [24] utilizes coupled representations to generate multiple experts and assigns learnable weights to each task to aggregate the expert information.PLE [32] further improves this approach by generating the specific experts for each task on the basis of the shared experts for all tasks.
Although cascade graph convolutional networks and MTL-based multi-behavior recommendation methods have made significant progress for multi-behavior fusion and prediction respectively, they also have their limitations: • Ignorance of the imbalanced behavioral distribution.As shown in Figure 1 (plotting the data distribution), the interactions for different behaviors are highly imbalanced.One behavior (e.g., view) may account for the majority of the total interactions.In the cascade behavior modeling, this imbalance problem is further exacerbated.As shown in Figure 2, in the cascade stream, upstream behaviors have richer interaction information compared to downstream behaviors.Thus, in the process of behavior propagation, the learned relationships are dominated by upstream behaviors, leading to a biased relationship learned by the model towards upstream behaviors, which interferes with downstream behavior prediction.• Negative transfer problem.When training multiple tasks, the performance of certain tasks can be negatively affected or interfered with by other tasks, resulting in performance degradation.This is known as the negative transfer phenomenon [34].In multi-task learning, although coupled inputs can share information from different behaviors, they can also introduce potential gradient conflict issues (explained in Section 3.2.1).Additionally, when aggregating expert information from different behaviors for a specific task, noise from other behaviors is often introduced, leading to negative transfer problems.
To address these two issues, we propose a Parallel Knowledge Enhancement based Framework (PKEF) for Multi-behavior Recommendation.It consists of the Parallel Knowledge Fusion module (PKF) and the Projection Disentangling Multi-Experts network (PME).To address the first issue, PKF combines the cascade and parallel paradigms, leveraging parallel knowledge for adaptive enhancement of different behaviors' representations while learning hierarchical correlation information to correct the information bias caused by imbalanced behavioral interactions.
To address the second issue, PME regards different behaviors as independent tasks, generates corresponding expert informations for each behavior with separate inputs, and aggregates the expert information from different behaviors using learnable weights.Considering that the aggregation of different behaviors may introduce noise during the learning process for a specific behavior task, PME introduces a projection mechanism during aggregation to disentangle the shared and unique parts for other behavioral experts.The shared part is used for aggregation, avoiding the introduction of harmful information.For the unique part, an auxiliary loss is designed for optimizing, which promotes the effectiveness of complementary shared information.PME alleviates the negative transfer phenomenon while solving the gradient conflict problem (explained in Section 3.2.2).
In summary, our work makes the following contributions: • We investigate the issues of ignorance of the imbalanced behavioral distribution in the cascade paradigm of multi-behavior recommendation and the negative transfer phenomenon in MTL.We propose an innovative multi-behavior recommendation framework (PKEF) to address these issues.It consists of the Parallel Knowledge Fusion module (PKF) and the Projection Disentangling Multi-Expert network (PME).• To achieve better recommendation performance, we address the issue of imbalanced data distribution for different behaviors by enhancing the hierarchical information propagation in the cascade process using parallel knowledge (PKF).Additionally, we alleviate the gradient conflict introduced by coupled MTL inputs and propose a projection-based denoising method to remove harmful information between behaviors, effectively solving the negative transfer problem (PME).
• We conduct comprehensive experiments on three real-world datasets to demonstrate the effectiveness of our model.Further experimental results verify the rationality and effectiveness of the designed PKF and PME modules.

RELATED WORK
Multi-behavior Recommendation.Multi-behavior recommendation methods use multiple user-item interactions to solve the data sparsity problem.In recent years, this approach has attracted widespread attention.
Early multi-behavior recommendation methods usually handle multi-behavior data by introducing multiple matrix factorization [21,30,33] or designing new sampling strategies [11,23,27].The former one extends the traditional matrix factorization technique by conducting it on multiple matrices with shared embeddings, such as CMF [42].The latter one uses multiple behaviors as auxiliary data and designs new sampling strategies to enrich the training samples, such as MF-BPR [23] and VALS [7], which introduce and improve negative sampling strategies.
With the development of deep learning techniques [22,41,44], researchers have started to explore multi-behavior recommendation models based on deep neural network (DNN) or graph convolutional network (GCN).DNN-based models usually design models to learn embeddings from each behavior and integrating them into the prediction of target behaviors.For example, DIPN [12] and MATN [37] use different attention mechanisms to model the relationship between behaviors for embedding learning and aggregation.NMTR [8] differs from the above methods by using a multi-task learning model in which all behaviors of the users serve as prediction targets and the prediction scores of the previous behavior are passed to the next behavior for prediction.
GCN-based models learn user embeddings by constructing a unified user-item graph and performing graph convolution operations.GHCF [4] explicitly models the high-order relationship between users and items through GCN and performs multi-task learning to predict each behavior through a non-sampling approach.MBGCN [19] takes behavior semantic into account, capturing it by item-item propagation layer and combines behavior semantic with behavior contributions learned from user-item propagation layer for score prediction.The recently proposed CRGCN [40] and MB-CGCN [6] take into account the hierarchical correlation between behaviors and achieve great performance by building cascaded graph convolutional networks to capture user preferences.However, due to the imbalanced distribution of the interactions among different behaviors, simply employing cascaded networks will lead to the learned relationships being dominated by high-frequency behaviors, which interferes with downstream behavior prediction.
MTL for Recommendation.With the growing diversity of user interests, the limitations of single-task learning in traditional recommender systems have become more and more obvious, especially in the face of multiple signals.To solve the above dilemma, in recent years, researchers have attempted to apply multi-task learning to recommender systems.One model widely used in multibehavior recommendation is the shared bottom [3] structure, where each task shares the same bottom parameters to extract common features, while the parameters at the top layer are independent.
However, approaches based on this structure [4,8,39] will lead to negative transfer phenomenon and trigger a seesaw effect for tasks with weak relevance.To solve these problems, MTL structures based on gated expert algorithm are proposed.MOE [18] divides the shared bottom structure into multiple experts that learn different features separately.MMOE [24] extends MOE by introducing a task-specific gating mechanism to obtain different fusion weights in multi-task learning.PLE [32] further proposes to employ shared or task-specific experts at the bottom layer and combine them adaptively through gating networks.However, these methods use coupled inputs for multiple tasks, which leads to the gradient conflict problem and negative transfer phenomenon, thus affecting the model performance (Illustrated in Section 3.2.1).

PROBLEM DEFINITION 3.1 Problem Definition
We define  and  as a user and an item, respectively.Meanwhile, U and V denote the user and item sets, respectively.The adjacency matrices of multiple behaviors can be represented by a set, i.e., whether the user  interacted with the item  under behavior .Furthermore, in order to represent the heterogeneous interaction information of users and items more conveniently, we define the multiplex user-item bipartite graph G = (H, E, M), where H = U ∪ V, E = ∪  =1 E  is the edge set including all behavior records between users and items.In the multi-behavior recommendation, we assume that  ∈ {1, 2, ...,  }, and the number corresponds to the upstream and downstream relationships between behaviors.The larger the number, the more downstream the behavior (i.e.,  is the most downstream behavior).Last but not least, there exists a target behavior (denotes as M  ) to be optimized, which is purchasing (buying) for e-commerce scenarios.

Gradient Issue in MTL
3.2.1 Gradient Conflict with Coupled Input.Most of the existing methods use the coupled input for MTL, as summarized in Section 2. This may cause a gradient conflict issue in MTL which restricts their learning ability for each task.As the classical MTL methods directly couple the representations of different behaviors together with different weights, we have: where  is the number of behaviors,   is the weight of -th behavior.Taking (e *  , e *  ) as input for MTL, the loss function can be formulated as: where ô  denotes the predictive probability that user  will interact with item  under the k-th behavior,    is the true label, (•) is the loss function, and   (•) is the predictive function in MTL models.Then we have: where (•) is the hadamard product operation, . As r  denotes the derivative of a scalar to a vector, it is also a vector.∀  ∈ {1, 2, . . .,  },    r  determines the updating magnitude and direction of the vector e *  •e *  .We can see that the gradients from all behaviors are coupled, and they jointly optimize the same vector e *  • e *  , which leading to gradient conflicts.As a result, the harmful information coupled in the input affects the learning of the target behavior information in the training process, leading to negative transfer.

Projection Disentangling Multi-Experts with Separated Input.
To handle the above problem, we first need to utilize the separated input of each behavior to generate the behavior-specific expert information and behavior-specific gating weight.Thus, we have: where ĝ  ∈ R  ×1 is the weight of gate.   (•) represents the behavior-specific fully connected layer.Further, while aggregating the information from different experts, the gating mechanism simultaneously introduce the negative information from other experts.Thus, we need to extract the information that is useful to the prediction of behavior  from other experts (e  ′  ).In details, we leverage a projection mechanism and have: where e  ′ , ℎ represents the shared information extracted from e  ′  with the guidance of e   .
is a scalar and can be flexibly adjusted.In practice, we adjust the scalar by multiplying by a small value.Finally, we analyse the optimization of input (e   • e   ), and have: where f  (, ) means the expressions with respect to variables  and  under behavior .Without loss of generality, we can find that ∀ ∈ {1, 2, • • • ,  }, the gradient of each behavior optimizes along the direction of their respective input (e.g., the gradient of the -th behavior optimizes e   • e   independently), so that the gradient conflicts problem is successfully solved.

METHODOLOGY
We devise a "Parallel Knowledge Enhancement based Framework" (PKEF) for multi-behavior recommendation, which contains two parts: (1) Parallel Knowledge Fusion (PKF) module; (2) Projection Disentangling Multi-Experts (PME) network.Figure 3 illustrates the technical details of the proposed framework.

Embedding Layer
In industrial applications, users and items are often denoted as high-dimensional one-hot vectors.However, to transform the highdimensional sparse vectors into low-dimensional dense embeddings, we apply the embedding lookup operation for user  and item  to obtain the embedding vectors.Generally, we have: where E  ∈ R |U| × and E  ∈ R |V| × are the embedding tables for users and items, respectively, |U| and |V| are the total number of users and items.x  ∈ R  and y  ∈ R  denotes the embedding vectors of user  and item , and  is the embedding size.

Parallel Knowledge Fusion
The recent multi-behavior methods ignore the imbalanced distribution of different behaviors in the fusion step.Thus the learning of these models will be more inclined to high-frequency behaviors, resulting in poor prediction effects on target behaviors.To solve the above problem, in our model, we combine both cascade and parallel paradigms to learn complex interactions between multiple behaviors.Our Parallel Knowledge Fusion (PKF) module utilizes parallel knowledge to enhance the representations of different behaviors while learning hierarchical correlation information, so as to correct the information bias caused by the imbalance of behavior interaction distribution.

Cascade Correlation
Learning.As we have the adjacency matrices M 1 , M 2 , ..., M  for different behaviors, for convenience, we further process the matrices, and it can be formulated as: where M  is the user-item adjacency interaction matrix of behavior , M  ∈ R ( |U|+|V| ) × ( |U|+|V| ) , |U| and |V| denote the number of users and items, respectively.As graph neural networks [14,35] have been widely used to model the high-order interactions between users and items, we conduct a GNN-based paradigm to encode the information of each behavior.Specifically, in each behavior , we apply the message passing to capture the high-order interaction information.Here, we simply leverage LightGCN [14] as the GCN aggregator to aggregate information on each layer : where z , = z ,  ||z ,  .(||) is the concatenate operation.Â = D −1 (A  + I) is the left normalized adjacency matrix with added self-connections and D is a diagonal degree matrix, which is defined as D  =  (A  + I)   .I denotes an identity matrix.And the initial z 1,0 = x  ||y  .
Further, following MB-CGCN [6], we conduct a cascade paradigm to learn the hierarchical correlation information of different behaviors.We have: where   denotes the total layers of GNN of the -th behavior.Here, we apply a residual connection to combine the first and the last layer of the upstream behavior representation as the input of the downstream behavior.For brevity, we illustrate the knowledge fusion between cascade and parallel stream with a projection scheme.

Parallel Interaction
Enhancing.In the previous part, we have modeled the hierarchical correlations of different behaviors.However, as we have illustrated in the introduction, imbalanced distribution of multiplex interactions will impact the learning of target behavior.In order to handle this problem, we further conduct a parallel learning paradigm which independently learns the representation of each behavior, and then fusioning the knowledge on each layer corresponding to the cascade stream.Similar to the process of Equation 9, we first apply the same way to each behavior, and have: where p , = p ,  ||p ,  .(||) is the concatenate operation.Â is the same as in Equation 9.And the initial p 1,0 = x  ||y  .
Then we devise two schemes to fuse the knowledge between the parallel and cascade streams, improving the Equation 9. Besides, we conduct comparison experiments with other schemes (shown in Section 5.3.2).For simplicity, we denote e ,  = Â p , and e ,  = Â z , .
(1) Projection-enhanced Knowledge Fusion.This scheme is inspired by DUMN [2], in which they used the representation projection mechanism to decouple the implicit feedback representation by the explicit feedback representation.Here, on each layer, we project the parallel representation onto the cascade representation and use the part that is collinear with it to enhance the cascade representation.It can be formulated as: where (•) is the vector inner product operation.e ,  and e ,  are the representations of the parallel and cascade streams, respectively.p ,  contains a mixture of behavior-specific and hierarchical correlation information.
(2) Vanilla-enhanced Knowledge Fusion.Meanwhile, inspired by the vanilla attention [43], we devise a fusion scheme that has the similar form with it, and have: , e ,  , e ,  − e ,  , e ,  • e ,  ]) where (•) is the hadamard product operation, W , ∈ R 4×4 and b , ∈ R 4×1 are feature transformation matrix and bias matrix. is the dimension of embedding.
For the output of each behavior, we have: where   is the number of GNN layers of the -th behavior.

Projection Disentangling Multi-Experts Network
As we have obtained the representations of each behavior  in the previous section, we need to design a proper structure to further leverage the multiplex signals with these representations.It has been verified in many methods [3,24,32] that a multi-task learning module can perfectly handle this.The MTL structure first couples the representations of all behaviors, then generate kinds of experts by the coupled input, further applies a gating mechanism to aggregate the expert information as the output, and finally utilizes the prediction losses of different behaviors to jointly optimize the model.However, the existing MTL structures utilize a couple representation as the input while introducing noise from other behaviors while using gating mechanisms to aggregate information from different experts.This leads to the gradient conflict during the learning process.Thus, we proposed a well-designed MTL module to handle the above problems.The following are details.

Generating of Experts.
As coupled input contains mixed information of different behaviors, making the gradient coupled and conflict, we do not combine the multi-behavioral representations together.We directly leverage each representation to generate the behavior-specific experts: where (•) is the hadamard product operation.

Aggregating of Experts.
In order to alleviate the negative information transfer from other behavior-specific experts, we improve the gating mechanism with a representation projection mechanism.Take the behavior  as an example, we have: where (•) is the vector inner product operation.q  ′ and q  are the representations of the  ′ -and -th behavior, respectively.q  ′ , ℎ contains a mixture of the  ′ -and -th behavioral correlation information.q  ′ ,  , which represents the unique part of the  ′ -th behavior, and q  , which denotes the -th behavior, are distinctive and should be as orthogonal as possible.
As we can see, the projection mechanism disentangle q  ′ by the guidance of q  , thus the shared and unique parts of other behaviors can be further utilized to alleviate the negative transfer caused by gating aggregation.To be specific, we take the shared representations q  ′ , ℎ ( ′ ∈ {1, 2, ...,  } ∩  ′ ≠ ) of other behaviors and q  as targets of aggregation by the -th gate, and have: where W  ∈ R  × and b  ∈ R  ×1 are feature transformation matrix and bias matrix, and g  ∈ R  ×1 is the attention vector which are used as selector to calculate the weighted sum of all experts.ℎ  (•) is the tower function.ô,

𝑢𝑣
is the prediction score of whether user  will have interaction with item  under behavior  at the cascade stream.
Besides, we design a prediction task for the unique representation q  ′ ,  .This task takes full advantage of the mutually exclusive relationship between q  ′ ,  and q  , facilitating the learning of the -th behavior.Details are shown in Section 4.4.3.

Joint Optimization
4.4.1 Parallel Loss.As we have obtained multi-behavioral representations from the parallel stream, we design a parallel loss to help the learning of each representations.In details, we have: where ô, is the prediction score of whether user  will have interaction with item  under behavior  at the parallel stream.And we apply a Bayesian Personalized Ranking (BPR) [28] loss to optimize the model.

Cascade
Loss.Similar to the above, we devise a cascade loss for the cascade and as we have obtained the final prediction ô,  , we have: where the definition of parameters is the same to what in the parallel loss.

Unique Loss.
To make full use of the unique representation q  ′ ,  , we design an auxiliary prediction task.Specifically, we leverage q  ′ ,  to predict the interactive information of " ′ without ".In details, we have: where ô ′ ,,  = q  ′ , , • q  ′ , , is the prediction score.
indicates observed positive user-item interactions under behavior  and O −  indicates unobserved user-item interactions under behavior .In short, we remove from the behavioral adjacency matrix M  ′ the positive items that M  ′ shares with M  .Thus, we fully utilize the "Only  ′ " interactive information with the help of the unique representations of behavior  ′ .
In all, the final loss can be formulated as: where Θ represents set of all model parameters. is the  2 regularization coefficient for Θ.  edges across all graphs in the set E  ,  denotes the behavior number,   refers to the number of GNN layers of the -th behavior, and  represents the embedding size.Overall, the time complexity of PKEF is comparable to that of existing GNN-based methods.

Space Complexity.
The learnable parameters in our proposed PKEF primarily come from the user and item embeddings, denoted as x  and y  respectively.This is similar to existing methods.Furthermore, the dense graphs G  in the set G are transformed into sparse behavior-specified matrices M 1 , M 2 , • • • , M  for computational purposes.This transformation allows us to perform computations without requiring additional memory space to store the dense graphs.Hence, the memory usage during the intermediate process remains within an acceptable range.

EXPERIMENTS 5.1 Experimental Setting
5.1.1Dataset Description.We follow MB-CGCN [6] and CRGCN [40], and adopt the same three datasets for evaluation, i.e., Beibei, Taobao and Tmall.For these datasets, we adhere to previous studies' methodology of removing duplicates by retaining the earliest entry [9,19].Table 1 provides a summary of the statistical information for the three datasets used in our experiments.

Evaluation
Protocols.In all our experiments, we assess the performance of our proposed PKEF model and baseline models based on the top- recommended items, using two evaluation metrics: Hit Ratio (HR@k) and Normalized Discounted Cumulative Gain (NDCG@k).Specifically, we set  = 10 for our evaluations.

Performance Comparison
Table 2 shows the performance of methods on three datasets with respect to HR@10 and NDCG@10.We have the following findings: • Our PKEF model achieves the best performance across all three datasets.Specifically, in terms of HR and NDCG metrics, PKEF outperforms the best baselines on Beibei, Taobao, and Tmall datasets by 95.16%, 12.33%, 29.78% and 38.58%, 15.95%, 29.21%, respectively.Our PKEF model demonstrates significant enhancements in recommendation accuracy, particularly when compared to the best baseline, MB-CGCN.This substantial progress highlights the effectiveness of our model.• Multi-behavior models perform better than single-behavior models.For example, MBGCN performs better than LightGCN.This indicates the superiority of utilizing multiple types of interactions.• LightGCN consistently outperforms MF-BPR and NeuMF, while MBGCN outperforms NMTR.This demonstrates the advantages of the GCN model, which leverages high-order neighbor information on the user-item bipartite graph to learn embeddings for users and items.• Finally, GNMR and MBGCN outperform RGCN by considering the contribution of each behavior in the multi-behavioral fusion step.Compared to NMTR and MBGCN, which only propose parallel learning paradigms during behavior fusion, CRGCN and MB-CGCN explicitly incorporate the cascade relationships of multiple behaviors during the fusion step, achieving performance that is second only to our model.This indicates the necessity of considering hierarchical correlation between behaviors.The performance of PKEF and its variants are summarized in Table 3, and we come to these conclusions:

Ablation Study
• Comparing the performance of PKEF and its last two variants, we can find that each variant brings about performance degradation when any key component is removed or replaced with other modules.This demonstrates the rationality and effectiveness of the two key designations.• It is worthwhile noticing that Base Model achieves the worst performance on all three datasets compared to other variants with multi-behavior learning.In particular, this variant has a performance decline up to 35.04%, 38.70%, and 35.08% in terms of HR (38.14%, 42.42%, and 35.37% in terms of NDCG) on Beibei, Taobao, and Tmall datasets.This further demonstrates the effectiveness of the combination of PKF and PME for solving the multi-behavior recommendation problem.

Impact of the Knowledge Fusion Schemes.
To further explore the forms of knowledge fusion between the parallel and cascade streams, we devise two alternative schemes for general usage (Illustrated in Section 4.2.2).Besides, we make a comparison between the proposed two schemes with simple Summation (simply add the representation of different streams up) and Linear Trans.(apply a linear transformation to transfer the parallel knowledge).And as shown in Table 5.3.2,we can observe that summation perform the worst among the four schemes.A probable reason is that the distribution of the two representations of the streams is completely different.So, simple summation may cause the harmful impact to the distribution of representations.Besides, Linear Trans.leverage a implicit way to transfer the knowledge, which may lead to a negative information transfer when transfering the parallel knowledge.Vanilla Fusion weights the fusion representations at different scales, alleviating the impact of representation distribution.While the Projection Fusion utilizes a projection mechanism to explicitly extract the useful information from the parallel knowledge, and thus obtain the best performance on these three datasets.

Impact of the MTL module.
To further demonstrate the superiority of our proposed PME in Multi-Task Learning (MTL), we compare it with four state-of-the-art MTL models: Shared Bottom [3], Bilinear [4], MMOE [24], and PLE [32].These models are applied on top of PKF for multi-behavior recommendation.To ensure   5 summarizes the results.PKF+SB performs the worst among all MTL models across all datasets.PKF+Bilinear, which replaces the neural network's prediction head with a light-weight matrix transformation, shows better performance, which is likely due to reduced risk of overfitting.Both PKF+MMOE and PKF+PLE employ gate networks with adaptive attention weights for information fusion, outperforming the static and equally weighted PKF+SB.Notably, our PME consistently outperforms all other models on all datasets, reaffirming its effectiveness for MTL tasks.

Parameter Analysis
5.4.1 Impact of the number of layers.We investigate the impact of higher-order information on model performance by varying the number of GNN layers.Specifically, we search the layer numbers in the range of {1, 2, 3, 4} and use different numbers of layers for different behaviors.The experimental results are shown in Figure 4, where the numbers on each block indicate the layer for the buy behavior that achieves the best performance while keeping the layer numbers fixed for the view and cart behaviors.Due to lack of space, we only show the results on Beibei and Taobao, the results of another dataset are similar.
Based on the results, it is evident that for both datasets, PKEF demonstrates the highest performance when the GNN layers are configured as (4, 1, 1).Furthermore, the influence of stacking different numbers of layers on performance varied for different behaviors.There is a tendency to use deeper propagation layers for the graph of view and shallower layers for downstream behaviors such as buy.One possible reason is that the view behavior contains richer interaction information and requires stacking more layers to capture higher-order information for learning better user preferences.Whereas in downstream behaviors with sparse interactions, excessive layers may introduce noise and lead to overfitting.5.4.2Impact of the coefficients of different behaviors.We investigate the impact of the behavioral coefficient parameter   on the performance of PKEF.There are three behavior types in Beibei and Taobao (view, cart, and buy), which means there are three loss coefficients  1 ,  2 , and  3 , respectively.The value of  3 is determined when  1 and  2 are given.We use grid search in the range {0, 1/6, 2/6, 3/6, 4/6, 5/6, 1} and plot the results for NDCG@10 (shown in Figure 5).For both datasets, PKEF achieves the best performance with coefficient parameters set to (0, 4/6, 2/6), and the performance remains relatively consistent across different parameters.This indicates that the model can effectively adapt to different data distributions and has good generalization ability.The results of the Tmall dataset, which we omit due to space constraints, reach similar conclusions.

Case Study under Different Behavioral
Correlations.We experimentally verify whether our model can alleviate potential gradient conflicts.Specifically, we divide the test users into five user groups according to the average Pearson correlation among all behaviors and select subsets from each user group.To prevent the node degree from potentially influencing the results [35], we keep similar average number of user interactions among different subsets while maximizing the number of users in each subset.For more rigorous results, we run experiments 5 times on each dataset and plot the mean and fluctuation range on the figure.The experimental results on Beibei and Taobao datasets are shown in Figure 6.We find that PME consistently outperforms all other MTL methods across all user groups, further demonstrating the superiority of PME for MTL.Additionally, with the increase of Pearson correlation, the performance of PME grows more rapidly compared to other MTL methods, while other MTL methods even show fluctuations and decline.A possible reason is the negative transfer caused by potential gradient conflicts when knowledge is transferred across different tasks.We omit the results on the Tmall dataset due to space limitations, which have consistent conclusions.

Visualization of Gating Aggregation.
We conduct experiments to compare the expert utilization between our PME model and other gate-based models (MMOE and PLE).By visualizing the average weight distribution of experts used for predicting the target behavior (shown in Figure 7), we observe that PME achieves better differentiation among experts compared to MMOE and PLE.We exclude gates used for other behaviors in our analysis to solely focus on predicting the interaction probability of the target behavior.Besides, in order to ensure the fairness of the comparison, for MMOE and PLE, we fix the number of experts in Tmall to 4, and 3 in

CONCLUSION
In this paper, we propose the Parallel Knowledge Enhancement based Framework (PKEF) for multi-behavior recommendation.To handle the problems of the existing multi-behavior approches, we devise Parallel Knowledge Fusion (PKF) module and Projection Disentangling Multi-Experts network (PME).PKF combines cascade and parallel paradigms to enhance behavior representations, addressing information bias caused by imbalanced behavioral interactions.PME treats each behavior as an independent task, generating specific expert information for each behavior using separate inputs.Besides, for each behavior, it leverages a projection mechanism to disentangle the shared and specific parts from other behaviors and aggregates the shared part while designing an auxiliary loss to further utilize the unique part.Thus, the negative transfer is significantly alleviated.Further, we perform extensive experiments on three real-world datasets to validate the effectiveness of our PKEF.The results provide further evidence of the rationale and effectiveness of the designed PKF and PME modules.

Figure 1 :Figure 2 :
Figure 1: Histogram of user numbers w.r.t interaction numbers for different behaviors.

Figure 3 :
Figure3: Illustration of the proposed PKEF framework.(⊕) denotes the element-wise addition operation.Lines of different colors correspond to representations of different colors in the Projection module (e.g., blue lines denote   and green lines represent  ℎ ).For brevity, we illustrate the knowledge fusion between cascade and parallel stream with a projection scheme.
O  = (, , )|(, ) ∈ O +  , (, ) ∈ O −  denotes the training dataset.O +  indicates observed positive useritem interactions under behavior  and O −  indicates unobserved user-item interactions under behavior .  is the coefficient of behavior . is the Sigmoid function.

4. 5 . 1
Time Complexity.The time complexity of PKEF primarily lies in the GNN parts, which consist of cascade and parallel streams.Both the cascade and parallel parts have a computational complexity

5. 3 . 1
Impact of the Key Components.To evaluate the effectiveness of sub-modules in our PKEF framework, we consider three model variants: (1) Base Model: We remove both the PKF and PME parts, so that the model only has the cascade stream and utilizing a bilinear paradigm (separated input with a light-weight matrix transformation); (2) PKEF w/o PKF: The PKF part is removed; (3) PKEF w/o PME: The PME part is replaced with bilinear module.

Figure 4 :
Figure 4: Impact of GNN Layers for different behaviors.compatibilitywith the classical MTL models (i.e., Shared Bottom, MMOE, and PLE), which expect the same input representation, we weigh the  separate representations generated by PKF to obtain a unified input.The resulting variants are named PKF+SB, PKF+Bilinear, PKF+MMOE, PKF+PLE, and PKF+PME.Table5summarizes the results.PKF+SB performs the worst among all MTL models across all datasets.PKF+Bilinear, which replaces the neural network's prediction head with a light-weight matrix transformation, shows better performance, which is likely due to reduced risk of overfitting.Both PKF+MMOE and PKF+PLE employ gate networks with adaptive attention weights for information fusion, outperforming the static and equally weighted PKF+SB.Notably, our PME consistently outperforms all other models on all datasets, reaffirming its effectiveness for MTL tasks.

Figure 5 :
Figure 5: Impact of the Behavioral Coefficients.

Figure 6 :
Figure 6: Average performances for user groups with different behavior correlations.

Figure 7 :
Figure 7: Expert utilization in gate-based models

Table 2 :
The overall performance comparison.Boldface denotes the highest score and underline indicates the results of the best baselines.★ represents significance level -value < 0.05 of comparing PKEF with the best baseline.

Table 4 :
Performances of different knowledge fusion schemes.

Table 5 :
Performances of different MTL module.