Multi-behavior Self-supervised Learning for Recommendation

Modern recommender systems often deal with a variety of user interactions, e.g., click, forward, purchase, etc., which requires the underlying recommender engines to fully understand and leverage multi-behavior data from users. Despite recent efforts towards making use of heterogeneous data, multi-behavior recommendation still faces great challenges. Firstly, sparse target signals and noisy auxiliary interactions remain an issue. Secondly, existing methods utilizing self-supervised learning (SSL) to tackle the data sparsity neglect the serious optimization imbalance between the SSL task and the target task. Hence, we propose a Multi-Behavior Self-Supervised Learning (MBSSL) framework together with an adaptive optimization method. Specifically, we devise a behavior-aware graph neural network incorporating the self-attention mechanism to capture behavior multiplicity and dependencies. To increase the robustness to data sparsity under the target behavior and noisy interactions from auxiliary behaviors, we propose a novel self-supervised learning paradigm to conduct node self-discrimination at both inter-behavior and intra-behavior levels. In addition, we develop a customized optimization strategy through hybrid manipulation on gradients to adaptively balance the self-supervised learning task and the main supervised recommendation task. Extensive experiments on five real-world datasets demonstrate the consistent improvements obtained by MBSSL over ten state-of-the art (SOTA) baselines. We release our model implementation at: https://github.com/Scofield666/MBSSL.git.


INTRODUCTION
Recommender systems have emerged as indispensable means of promoting personalized suggestions for users in a variety of applications, ranging from e-commerce platforms [11], online video websites [3] and location-based services [1].Collaborative Filtering (CF) is the most extensively adopted paradigm for recommendation, which develops from traditional Matrix Factorization algorithms [15] to novel Neural Network (NN) architectures like Autoencoders [24] or Graph Neural Networks (GNNs) [13,29].
Nevertheless, most CF solutions are designed based on a singulartype of user-item interaction behavior, whereas the real-world recommendation is more like a multi-behavior setting, i.e., the interaction behaviors between users and items are multi-typed with one target behavior to be optimized [34,35].For example, in E-commerce services, users can interact with items in multiple manners, including page view, favorite and purchase, among which purchase behavior is the optimization target as it is directly related to Gross Merchandise Value (GMV).
Up to now, great efforts on multi-behavior recommendation have been made towards modeling the complex semantics of different interaction behaviors.MATN [34] and MBGMN [36] encode the interaction pattern multiplicity and behavior dependencies via memory-augmented Transformer and Graph Meta Network respectively.To characterize discriminative semantics of different behaviors, some research works [4,6] learn one specific representation for each behavior.In addition, one recent study CML [31] firstly incorporates contrastive learning to tackle the label scarcity problem Figure 1: The overall architecture of MBSSL.i) Behavior-aware graph neural network ( (•)) which performs behavior-specific embedding propagation and cross-behavior dependency modeling.ii) Inter-behavior SSL contrasting between target and auxiliary subgraphs to alleviate skewed data distribution and data sparsity under the target behavior.iii) Intra-behavior SSL contrasting between target and Edge Dropout (ED) subgraphs to counteract interaction noises from auxiliary behaviors.
for the target behavior.Despite their success, the multi-behavior recommendation still encounters the following challenges.
The robustness against data sparsity and interaction noises.
In spite of the fact that interaction data of auxiliary behaviors could offer quite complementary information for recommendation on the target behavior, the data sparsity under the target behavior remains an issue.One possible solution is proposed in CML [31] which makes full use of supervision signals from auxiliary behaviors via conducting contrastive learning between each auxiliary behavior and target behavior pair.However, auxiliary behaviors might contain noisy interactions which are detrimental to the target task at the same time.Hence, simply adopting the contrastive learning paradigm in CML is likely to exacerbate the negative transfer towards noise distributions in auxiliary behaviors, greatly undermining the true semantics of the target behavior.In this regard, an approach to making comprehensive and adaptive use of interaction data plays a vital role in performance enhancement.
The optimization imbalance between auxiliary tasks and the target task.Existing multi-behavior recommendation solutions basically adopt the Multi-task Learning (MTL) paradigm to optimize the auxiliary and target task jointly [4,6,9,18,34].However, ignoring the estimation of contributions of each task to the optimization target will suffer from a serious optimization imbalance problem where auxiliary tasks might dominate the network weights, resulting in worse performance on the target task.What's more, existing multi-task learning methods like GradNorm [8] or PCGrad [40] do not apply to the scenario where the self-supervised learning (SSL) task is treated as the auxiliary task, in that the SSL task has a confounding effect on the target task, depending on the particular design of SSL.Therefore, another key problem in multi-behavior recommendation is the elaborated design of the optimization method to mitigate the optimization imbalance between the auxiliary and target tasks.
Towards this end, we propose a Multi-Behavior Self-Supervised Learning solution, short-handed as MBSSL, to ensure the performance for multi-behavior recommendation.
Specifically, we first devise a behavior-aware graph neural network augmented with behavior representation learning and the selfattention mechanism to jointly model the behavior inner-context and behavior inter-dependencies.For dealing with the sparse supervision signals under the target behavior, a comprehensive selfsupervised learning paradigm is introduced to contrast nodes from inter-behavior and intra-behavior levels respectively.The interbehavior SSL transfers informative semantics from auxiliary behaviors to the target behavior via selectively constructing negative node pairs.To further increase the robustness to noisy interactions, the intra-behavior SSL consolidates the self-supervised information in the target behavior to counteract the potential negative transfer brought by inter-behavior SSL.In addition, based on the observation that the SSL task exhibits an arbitrary optimization trend with respect to the target task, we design a multi-behavior optimization method that hybridly rectifies the direction and magnitude of gradients to balance SSL tasks and the target task in optimization.
In a nutshell, our main contributions are as follows: • We develop a new self-supervised learning framework named MBSSL for multi-behavior recommendation, which embodies a behavior-aware graph neural network (Section 3.1) to uncover latent cross-behavior dependencies and a comprehensive SSL paradigm (Section 3.2) at inter-behavior and intra-behavior levels to alleviate the problem of data sparsity and interaction noises.
• To the best of our knowledge, we are the first to research the optimization imbalance problem between the SSL task and the target recommendation task.Accordingly, we propose a hybrid gradient manipulation method on both the magnitude and direction to adjust the optimization trend (Section 3.3).
• Extensive experiments are conducted on five real-world datasets compared with ten SOTA baselines (Section 4).The consistent performance uplift on two representative metrics demonstrates the effectiveness of MBSSL.

PRELIMINARY
In typical recommender systems, we define the set of users and items as U ( ∈ U) and I ( ∈ I) respectively, where |U| and |I| represent the number of users and items.G = (V, E) is a bipartite graph based on the user-item interactions, where V = U ∪ I is denoted as the whole node set involving users and items, and the edge set E represents the observed interactions.In the multi-behavior scenario with  types of behaviors, the whole bipartite graph G can be segmented into  behavior subgraphs {G 1 , G 2 , .., G  } based on the interaction behavior type.To reflect the multi-behavior interaction data, we define a tensor X ∈ R | U |× | I |× , where the individual element   , ∈ X equals to 1 if  interacted with  under behavior , otherwise   , = 0. Therefore, the task of multi-behavior recommendation is formulated as: Input: The multi-behavior interaction tensor X ∈ R |U |×|I |× under multiplex  types of behaviors.
Output: A recommendation model that estimates the probability that a user  interacts with item  under the -th behavior, i.e., the target behavior.

METHODOLOGY
In this section, our MBSSL framework is presented in detail.The architecture of MBSSL is depicted in Figure 1, which encapsulates three key components: i) behavior-aware graph neural network which collectively captures behavioral context and dependencies (Section 3.1), ii) inter-behavior self-supervised learning to facilitate knowledge transfer, and iii) intra-behavior self-supervised learning to counteract noisy interactions.

Behavior-aware Graph Neural Network
Motivated by the message-passing architecture in GNNs, we develop a behavior-aware graph neural network to capture complex CF signals between nodes and among behaviors.More concretely, we first obtain contextualized node embeddings under each behavior subgraph via behavior-specific embedding propagation.Thereafter, we enhance the embeddings with personalized behavioral correlations via cross-behavior dependency modeling.3.1.1Behavior-specific Embedding propagation.Under such a multibehavior setting, we first construct each bipartite behavior subgraph G  according to the interaction behavior type  and then we perform embedding propagation on each subgraph to obtain the representation of each node under each behavior.In order to explicitly manifest the discriminative semantic of each behavior and capture the contextualized user preference, we also embed behaviors on top of nodes [4,6,23] and incorporate the representations of each behavior into the message-passing paradigm as: (2) 3.1.2Cross-behavior dependency modeling.Given that different behaviors would interweave with each other in an implicit manner and the correlations among behaviors vary by user, we leverage the self-attention [20] mechanism to model the cross-behavior dependency.Specifically, we concatenate the embeddings of node  under all the behaviors as   ∈ R × , and then the coefficients a , ∈ R  reflecting the dependency between behavior  and other behaviors for user  are computed by: where are two behavior-specific parameters and  ′ is the output dimension size.Hence, the enhanced embedding of node  under behavior  can be readily calculated as: (4) Since the embeddings of different layers express different connections, we utilize the mean-pooling to integrate the embeddings of all layers as follows: ; (5)

Multi-behavior Self-supervised Learning
As mentioned before, sparse supervision signals of the target behavior may lead to severe bias of learned representations compared with those of auxiliary behaviors.Besides, the overlook of noisy interactions brought from auxiliary behaviors would exaggerate the immoderate reliance on certain interactions.Accordingly, we introduce a novel self-supervised learning paradigm to conduct self-discrimination contrastive learning from both inter-behavior and intra-behavior levels.

3.2.1
Inter-behavior Self-supervised Learning.In view of the fact that supervision signals in auxiliary behaviors are much richer than that in the target behavior, we perform selective contrastive learning between auxiliary behaviors and the target behavior to enable knowledge transfer, and thus alleviate the data sparsity at the first step.
In particular, each auxiliary behavior  from {1, 2, ...,  − 1} is to be contrasted with the target behavior  to provide distinct semantics.The common practice will treat the views of the same node as positive pairs (i.e., {(e , , e , )| ∈ U}), and the views of any different nodes as the negative pairs (i.e., {(e , , e , )|,  ∈ U,  ≠  }).However, in the recommendation setting which encapsulates various interactions, the same two subjects will have some commonalities (e.g., users share similar preferences or items have similar attributes.).In this case, the constructed negative pairs following the common practice are likely to include many false negatives (i.e., highly similar nodes), which will discard true semantic information [17].Therefore, we propose to find potential false negatives based on the calculated similarity score using swing algorithm [37] and eliminate them when contrasting node pairs.
To be more specific, the similarity score of two users ,  in subgraph G  ( ∈ {1, 2, ...,  }) using swing algorithm is calculated as: where  is a smoothing coefficient.Then the final similarity score  (, ) is the average of the score in each subgraph  G  (, ).
Then we define the false negatives for  as users with top- similarity scores,   () = { | (, ) ∈  ( (),  )}.Accordingly, the inter-behavior contrastive loss between the target behavior  and auxiliary behavior  which eliminates specific false negatives is defined as: where  is the temperature hyperparameter in softmax and  (•) denotes the inner-product of two vectors.When analogously combined with the contrastive loss of the item side, the inter-behavior contrastive loss between the target behavior  and auxiliary behav- Therefore, the ultimate interbehavior contrastive loss is the sum of each pair of auxiliary behavior and target behavior as: 3.2.2Intra-behavior Self-supervised Learning.For alleviating the skewed data distribution across different behaviors, the inter-behavior self-supervised learning encourages the similarity of node representations under the target behavior and auxiliary behaviors.However, in view of the higher proportion of noisy interactions under auxiliary behaviors, more noises would be implicitly transferred into the target behavior as well, making the learned representations dominated by auxiliary signals while lose the intrinsic semantics under the target behavior.Hence, we devise an intra-behavior selfsupervised learning to generate and contrast structurally-augmented views of the target behavior subgraph, in which way we consolidate and amplify the impact of supervision signals within the target behavior itself to counteract the negative transfer towards noise distributions in auxiliary behaviors.Specifically, we first generate two augmented views G 1  , G 2  from the target behavior subgraph by performing edge dropout introduced in [32].We denote G  = (V  , E  ) as the target behavior subgraph, and then two augmented views are elaborated as: where M 1 , M 2 ∈ {0, 1} | E  | are two random masking vectors controlling the kept edge set with the same dropout out ratio .
After encoding the two augmented views together with auxiliary behavior subgraphs respectively, we obtain the node representations of augmented views, denoted as e 1 , , e 2 , .Similar to Section 3.2.1,we then contrast views at the node scale with positive pairs as {(e 1 , , e 2 , )| ∈ U} and negative pairs as {(e 1 , , e 2 , )|,  ∈ U,  ≠  }.The optimization objective of user side is calculated via InfoNCE loss likewise as: Analogously combining the loss of item side, we obtain the final objective function of intra-behavior self-supervised learning as

Adaptive Multi-behavior Optimization
In this subsection, we present our optimization objective and a novel adaptive optimization method for multi-behavior setting.

Optimization
Objective.For learning model parameters in an effective and stable way, we leverage a recently proposed nonsampling objective for recommendation [6] which has been proved to be superior to the traditional Bayesian Personalized Ranking (BPR) loss.For a specific batch of users B and the whole item set I, the non-sampling recommendation loss under behavior  is: where x , denotes the estimated probability of user  interacting with item  under behavior ; I +  represents the interacted items of user  under behavior  and  +  ,  −  are two hyperparameters.Then the loss of the main supervised recommendation task is the weighted sum of recommendation loss under each behavior with   as the coefficient: In our framework, we aim to boost the recommendation performance through customized self-supervised learning tasks, so the ultimate learning objective is the combination of main supervised task loss and SSL task loss, which is defined as: where   ( = 1, 2, ...,  − 1) and  are hyperparameters to control the strength of the corresponding SSL task.
For one batch of the training data, the computational cost of L  is  (|B|+|I|) 2 +|E | , where |E | is the total number of positive interactions.Besides, all of the SSL task losses share the same time complexity, which equals  |B| (2 + |V |) .Therefore, the overall complexity of L is  (|B||V | + |E |), which is comparable to SOTA multi-behavior recommendation models, (e.g., CML, GHCF).3.3.2Hybrid Manipulation on Gradients.Similar to Equation (12) where SSL tasks are viewed as auxiliary tasks to improve recommendation, existing SSL recommendation models [21,31,32,38] jointly optimize the main supervised recommendation task along with the SSL task.However, they suffer from two limitations.On one hand, they neglect the potential for a significant optimization imbalance across tasks, which could deteriorate the performance of the target task.This problem becomes particularly prominent when utilizing SSL tasks to serve as auxiliary tasks because SSL pretext tasks barely coincide with the target task perfectly under a manually-designed manner.Take our MBSSL model as an example, Figure 2(a) and 2(b) highlight two examples from Beibei, which respectively demonstrate the imbalance phenomenon of gradient direction and gradient magnitude between SSL tasks and the target task.
On the other hand, tuning the weights of multiple auxiliary task losses (i.e.,   ,  in Equation ( 12)) is time-consuming, and fixed weights do not apply to all the batches throughout the dynamic training process.Towards this end, we develop an adaptive optimization method for SSL recommendation models with hybrid manipulation on both the magnitude and direction of gradients.In MBSSL, each SSL task loss (i.e., L ||, the optimizer prefers to approach the -th auxiliary task rather than the target task which leads to the performance degradation.As a solution, we dedicate to alter the direction and magnitude of auxiliary gradient G  , which embraces larger magnitude than the target task G   , so as to guide the optimization towards the target task.In terms of auxiliary gradients with smaller magnitude as well as conflicting directions, we keep them unaltered for preventing overfitting.To be more specific, we first modify the gradient direction via projecting the auxiliary gradient to the normal plane of the target gradient if they conflict with each other, i.e., their cosine similarity is negative.The projection strategy is formulated as: Although the amount of destructive gradient interference has been reduced via direction modification, large gradient magnitudes of auxiliary tasks still hinder the optimization towards the target task.Therefore, we further balance the gradient magnitude based on [16,22] with a relax factor  to control the magnitude proximity between G , and G  : Through such a hybrid manipulation altering both the gradient direction and magnitude of auxiliary tasks, the learning process could readily optimize towards the target task, thus boosting the recommendation performance of target behavior.The complete update procedure of hybrid manipulation is shown in Algorithm 1.

Model Analysis
In this section, we compare the proposed SSL paradigm with that of existing representative work on multi-behavior recommendation.
Algorithm 1: The Hybrid Manipulation on Gradients.
Input: Initial shared parameters  0 ; Relax factor  ; Learning rate ; Comparison with CML.Likewise, the cross-behavior SSL in CML is performed between each auxiliary and target behavior pair to capture cross-type behavior dependency.Specifically, the SSL paradigm follows the conventional rule, i.e., the views of any different users will be regarded as negative pairs.However, we can conclude that users may share similar preferences based on rich semantics and huge data volume of behaviors, which means the common practice may lead to many false negative pairs.Therefore, in our inter-behavior SSL, we selectively construct negative pairs based on the calculated structural node similarities to facilitate the knowledge transfer between auxiliary and target behaviors.Comparison with S-MBRec.The star-style SSL in S-MBRec constructs additional positive samples by finding similar users based on the data under target behavior.However, the data is so sparse that the calculated node similarities are not reliable.What's worse, the negative transfer will be further amplified under the current SSL paradigm which encourages the alignment between unreliable positive samples.Accordingly, we aim to make full use of data under all the behaviors to select potential similar users with a high confidence level, and refuse to augment positive samples accounting for the robustness against interaction noises [7,17].
All of the existing work solely rely on the inter-behavior SSL to handle the data sparsity issue, which is not enough though.And the inter-behavior SSL is likely to introduce noises from auxiliary behaviors.As a solution, we conduct intra-behavior SSL within the target behavior itself with an aim to counteract the auxiliary noises via amplifying the impact of the target behavior.

EXPERIMENTS
In this section, we conduct extensive experiments on five real-world datasets to evaluate our proposed model.In particular, we aim to figure out the following research questions: • RQ1: How effective is MBSSL in multi-behavior recommendation scenarios compared to existing methods?• RQ2: How do the sub-modules of MBSSL affect the recommendation performance?• RQ3: How robust is MBSSL to data sparsity under the target behavior and to noisy interactions from auxiliary behaviors?  1 summarizes the statistics of all the datasets.
• Beibei.This is the dataset collected from Beibei, the largest infant product e-commerce platform in China.It involves three types of interaction behavior, i.e., page view, cart, purchase, among which purchase is the target behavior.• Taobao.This is an open dataset obtained from the largest ecommerce site Taobao, which contains the same interaction type with Beibei.We directly use the processed dataset in GHCF [4].• Tmall.This is another processed dataset on Taobao provided by CML [31], which contains one additional behavior favorite.• IJCAI-Contest.This dataset is provided in IJCAI15 Challenge.
It is collected from a business-to-customer retail system, which shares the same behavior types with the Tmall data.• Videos.This is a dataset collected from an online short video platform.There are totally four types of interaction behavior between users and videos, i.e., click, like, comment, download.In this dataset, we regard download as the target behavior.
4.1.2Baselines.We compare our method with the following stateof-the-art methods from three types: Single-Behavior, Self-supervised Learning and Multi-Behavior recommendation models.For Single-Behavior and Self-supervised Learning methods, we normally utilize target behavior data in the same way as CML [31] or GHCF [4] to build the model.However, in order to eliminate the performance gap resulting from different volumes of training data, we additionally conduct experiments on these models by treating all the behaviors as the same type, which leads to stronger baselines.Single-Behavior Recommendation Models.
• NGCF [29]: It is a neural collaborative filtering method utilizing GNNs, with an aim to capture high-order connections.• LightGCN [13]: It simplifies the GCN structure to improve training efficiency and generalization ability for recommendation.
Self-supervised Learning Recommendation Models.
• SGL [32]: This method explores SSL on graph structure and accordingly devises three unified augmentation operators including node dropout, edge dropout and random walk.• SimGCL [39]: It is a simple yet effective graph-augmentationfree contrastive learning method that can regulate the uniformity in a smooth way.
Multi-Behavior Recommendation Models.
• MATN [34]: It proposes an attention-based transformer encoder to help preserve cross-type behavior collaborative signals and type-specific behavior contextual information.

• MBGMN [36]: It enhances multi-behavior modeling with Graph
Meta Network which incorporates the meta learning paradigm.• CML [31]: It designs a contrastive learning framework for multibehavior recommendation and further utilizes meta learning to learn the customized weights for each user.
• EHCF [6]: It conducts knowledge transfer among behaviors and proposes a novel non-sampling objective for multi-behavior recommendation.• S-MBRec [10]: It is another SSL-based model which considers the discrepancies and commonalities of multiple behaviors.
• GHCF [4]: It is an improvement over EHCF which relies on the GNNs to model the complex high-hop user-item correlations.4.1.3Evaluation Methodology.For a fair comparison with various models on recommendation, we adopt the widely-used leave-oneout evaluation and two ranking metrics, Recall@K and NDCG@K.Note that we utilize the same evaluator as EHCF [4] or GHCF [6], i.e., we rank all the items except positive ones for each user, which is more persuasive than randomly sampling a subset of non-interactive items for each user (e.g., in MBGMN [36] or CML [31], pairing each positive item instance with 99 randomly sampled items).4.1.4Parameter Settings.We implement our MBSSL model with Pytorch and the model is optimized using the Adam optimizer with learning rate 0.001 during the training phase.We set  to 0.5 for swing algorithm.The batch size ranges from {256, 512, 1024}.By default, the size of the latent dimension is set as 64 and the number of propagation layers is 4. In addition, the negative weight of the non-sampling loss is chosen from {0.1, 0.01} for all the datasets.For preventing overfitting, we set the embedding dropout ratio as 0.3 and the edge dropout ratio  as 0.5.

Performance Comparison (RQ1)
In the evaluation, we first perform experiments to make recommendations on five datasets where we set the recommendation length  as 10, 50.From Table 2, we summarize the following observations: • Our model MBSSL generally outperforms all the baselines on all datasets, with the improvements over the best baseline ranging from 18.64% to 19.71% in terms of Recall@10 and NDCG@10.The significant improvements can be attributed to two main reasons: i) Through inter-behavior and intra-behavior SSL, our model effectively addresses the data sparsity under the target behavior and noisy interactions from auxiliary behaviors, which are two key problems of multi-behavior recommendation.ii) The proposed hybrid manipulation method on gradients empowers the capability of balancing the optimization between the SSL task and the main supervised recommendation task, which further preserves the recommendation performance.• Interestingly, we find that some multi-behavior models (e.g., MATN, MBGMN) may perform worse than some single-behavior models which treat all the behaviors as the same type.(e.g., Light-GCN).The poorer performance of MBGMN and MATN may suggest their disadvantages of differentiating behavior types.Whereas, methods like EHCF, GHCF and MBSSL are superior to single-behavior models in general, which highlights the significance of behavior modeling.
Table 2: The performance comparison on five datasets.Note that baselines with the "all" suffix use data from all the behaviors to build the single-behavior model.The best results are illustrated in bold and the number underlined is the runner-up.And the number with a star ( * ) indicates the result is statistically evaluated with  < 0.05 under t-test compared to other baselines.

Ablation Study (RQ2)
In this subsection, we evaluate the rationality of each designed module in our model, with four variants as follows: • w/o : The cross-behavior dependency modeling is removed in the behavior-aware graph neural network.• w/o    : We do not perform inter-behavior SSL in the multibehavior self-supervised learning part.• w/o    : We remove intra-behavior SSL in the multi-behavior self-supervised learning part.• w/o   : We disable the hybrid manipulation on gradients, i.e., we do not tackle the optimization imbalance.Instead, we assign equal weights for each auxiliary loss.
The ablation study results are shown in Table 3.Note that due to space limitations, we only show the results of Recall@10 and NDCG@10 on two datasets, Beibei and Taobao.For the other datasets, the observations are similar.From Table 3, we can find: • The cross-behavior dependency modeling plays a vital role in performance, which indicates that the self-attention mechanism has a strong capability on capturing the implicit pair-wise dependencies across behaviors.• Removing either part of the multi-behavior self-supervised learning will undermine the performance.Specifically, the performance gap between MBSSL and w/o   demonstrates the effectiveness of the inter-behavior SSL on narrowing the gap of skewed representations and alleviating the data sparsity of the target behavior.In addition, the intra-behavior SSL contributes to the performance further, indicating the necessity to address the noisy interactions from auxiliary behaviors.• When we replace the HMG by assigning equal weights for all the auxiliary losses, the performance experiences a great decline which verifies the existence of the optimization imbalance.This also suggests that during the multi-behavior learning process, HMG adaptively rectifies the gradients to balance the auxiliary tasks, which improves the target task's performance significantly.

Robustness Analysis (RQ3)
4.4.1 Robustness to data sparsity.In recommender systems, the recommendation towards inactive users with few available interactions is quite challenging, so we aim to illustrate the effectiveness of our model in alleviating the data sparsity.Specifically, we split the test users into five groups based on sparsity degrees, i.e., the  number of interactions under the target behavior (i.e., purchase), then we compute the average NDCG@50 on each group of users.Due to space limitations, we present the comparison results with representative baselines on two public datasets Beibei and Taobao, results on other datasets would be later attached if needed.In Figure 3, the x-axis denotes the different data sparsity degrees, the left side of y-axis displays the number of test users in the corresponding group quantified by bar while the right side of y-axis is the averaged metric value quantified by line.
Based on the results, we have the following observations: i) Our model generally obtains a better performance compared to other SOTA methods on these two datasets.On Beibei dataset, despite the slight inferiority to GHCF and CML on part of active users, our model manifests a good capability of recommendation for inactive users who occupy a considerable amount of user populations.ii) The performance will encounter a slight descent when the number of interactions increases, and this may be caused by different amounts of auxiliary data.For example, on Taobao dataset, the number of auxiliary behavioral records for users who have more than 12 purchase records is much fewer than for users who have less than 9 purchase records.4.4.2Robustness to noisy interactions.As mentioned above, the inter-behavior SSL would inevitably introduce auxiliary noises to the representations under the target behavior, under which circumstance we propose intra-behavior SSL.In order to study the capability of intra-behavior SSL to relieve the impact of noises, we artificially add a certain proportion of noisy interactions into the auxiliary data (i.e., 10%, 20%, 30%) and then compare the performance decline percentage of MBSSL with that of several representative SOTA baselines (i.e., CML, S-MBRec, GHCF). Figure 4 shows the results on Beibei and Taobao.
We can observe that i) It is obvious that adding noises into the auxiliary data reduces the performance of all the methods.Generally, CML is more sensitive to the interaction noises while MBSSL presents the least performance degradation.What's more, the degradation gap between MBSSL and each baseline is more apparent as the noise ratio increases.This suggests the capability of MBSSL to figure out informative graph patterns and to relieve the over-reliance on certain interactions.ii) The performance decrease of GHCF is lower than that of CML and S-MBRec in most of the cases which indicates that the SSL paradigm in CML and S-MBRec would amplify the negative impact of noises.However, the intra-behavior in MBSSL aims to counteract the noises via consolidating the target self-supervised signals, which consequently makes MBSSL obtain a more stable performance in terms of different noise ratios.2.

Impact of Hybrid Strategies (RQ4)
Given the heuristic observation that there exists an obvious discrepancy between auxiliary tasks and the target task in both gradient direction and magnitude, we propose to manipulate the gradients of auxiliary tasks in these two dimensions to balance the optimization.Besides, to equip our model with a good generalization ability, we only alter auxiliary gradients with larger magnitudes than the target gradient while keeping auxiliary gradients with small magnitudes unaltered.In this subsection, with an aim to verify the rationality of our gradient manipulation method proposed in Section 3.3.2,we compare three different strategies as follows: • Strategy A: In each iteration, we only alter the direction of auxiliary gradients following [40].We manipulate the directions of all the eligible gradients regardless of their magnitudes.• Strategy B: In each iteration, we only alter the magnitude of auxiliary gradients following [16], i.e., we reduce the magnitude of the auxiliary gradient if it is larger than that of the target gradient and vice versa.• Strategy C: In each iteration, we first alter the direction using Strategy A, and then we alter the magnitude based on Strategy B. Similarly, we conduct the manipulation on all the gradients regardless of their magnitudes.We show the comparison results on Beibei and Taobao in Table 4, the results on the other datasets are similar.From the table, we find that i) Solely utilizing direction-based or magnitude-based methods obtains suboptimal performance, which corresponds to our observation that the SSL task yields great disparity with the target task in both direction and magnitude.ii) The performance gap between Strategy C and our model suggests that deliberately keeping the conflict between the target gradient and auxiliary gradients with small magnitudes could improve the generalization ability and ease the overfitting issue.

Hyperparameter Study (RQ5)
In this subsection, we conduct extensive experiments to examine the effects of several key hyperparameters, which include the number of propagation layers , the temperature , the relax factor  and the number of eliminated false negatives  .Figure 5 shows   5, we observe that more embedding propagation layers yield better performance due to the strengthened capability of capturing high-hop signals.However, when the number of layers exceeds 4, the performance suffers from a huge decline, which is resulted from the over-smoothing effect.4.6.2Effect of Temperature.The temperature is tuned carefully in {0.1, 0.2, 0.4, 0.8}.According to the curves shown in Figure 5(b) and 5(e), we find that the best selection of  varies by dataset.However, either too small (e.g., 0.1 ) or too large (e.g., 0.8) is not appropriate which suggests that a large temperature value will undermine the ability to distinguish between negative samples.Conversely, a too small value would excessively exaggerate the effects of some negative samples while making others useless.4.6.3Effect of Relax Factor.Since relax factor controls the magnitude proximity between auxiliary gradients and target gradient, its selection varies according to the task.Therefore, a big relax factor is appropriate on Videos while Beibei prefers a smaller one, and the others perform best with a moderate relax factor value. 4.6.4Effect of Eliminated False Negatives Numbers.Based on the average interaction numbers on each dataset, we determine the number of eliminated false negatives of user side and item side from {10, 50, 100, 500} and {5, 10, 50, 100} respectively.Figure 5(g) and 5(h) show the results on Beibei and Taobao where darker color means better performance.We conclude that either too small or too big value of  degrades the performance, which proves the necessity of sufficient negative samples and the effectiveness of the selective inter-behavior SSL.

RELATED WORK 5.1 Graph-based Recommendation
In terms of graph-based models for recommendation, recent years have witnessed the effectiveness of Graph Neural Networks (GNNs) due to the powerful capability on modeling high-order connectivity.For example, NGCF [29] is one collaborative architecture which decides the message propagation on both the graph structure and the affinity with the central node.LightGCN [13] distills a more concise and accurate GCN model for recommendation by omitting two burdensome operations, i.e., feature transformation and nonlinear activation.On top of methods focusing on model designs, another line of methods are dedicated to enriching the user-item bipartite graph via fusing various side information, ranging from social influences [5,33], item-item relatedness [27,28,30,41], to user and item attributes [2,19] .

Multi-behavior Recommendation
Recent studies have attempted to explore the multiplicity via various deep learning techniques.NMTR [9] extends the neural collaborative filtering (NCF [14]) framework to multi-behavior settings, which performs a joint optimization on cascading prediction tasks.To avoid the sampling bias issue, EHCF [6] efficiently correlates the prediction of each user behavior in a transfer way without negative sampling.Later on, researchers take advantages of GNNs to explore the high-hop user-item interactions.MBGCN [18] and GHCF [4] learn discriminative behavior representations using GCNs while MBGMN [36] utilizes graph meta network to capture interaction diversity and behavior heterogeneity.To alleviate data sparsity of the target behavior, CML [31] and S-MBRec [10] incorporate self-supervised learning into multi-behavior recommendation.

Self-supervised Learning on Graphs
Inspired by the advances in CV and NLP domain, self-supervised learning (SSL) on graphs has recently been explored.InfoGraph [25] and DGI [26] learn node representations based on mutual information maximization between a node and the local structure while GRACE [42], GCA [43] and GraphCL [12] conduct node-level samescale contrast.When it comes to the recommendation scenario, recent studies have adopted SSL to achieve better recommendation performance.Yao et al. [38] propose a two-tower DNN architecture with uniform feature masking and dropout for self-supervised item recommendation.Wu et al. [32] further devise a unified SSL framework with three augmentation operators for graph-based recommendation.

CONCLUSION
In this work, we develop a novel self-supervised learning framework with an adaptive optimization method for enhancing multibehavior recommendation.Our framework effectively captures the behavior semantics and correlations via a graph neural network incorporating the self-attention mechanism.To alleviate the data sparsity and noisy interactions issue, we contrast nodes via interbehavior SSL and intra-behavior SSL respectively.In addition, we take the initial step to study the optimization imbalance of the SSL task and recommendation task, and design a hybrid manipulation method on gradients accordingly, which has been proved effective on five real-world datasets.In the future, we plan to design a more elaborated SSL by fully capturing the structural information based on the various interaction data.
= LeakyRelu(W () • mean({e () , ⊙ e ()  :  ∈ N , })), (1) where e (+1) , denotes the embedding of node  under behavior  in the ( + 1)-th propagation layer; N , is the set of immediate neighbors of  under behavior type ; W () is layer-specific and ⊙ denotes the element-wise product of two vectors; e ()  represents the embedding of behavior  in the -th layer, which is updated by multiplying another layer-specific parameter W  :

1 ssl1Figure 2 :
Figure 2: The illustration of large differences of gradient direction and magnitude between auxiliary tasks and target task.

Figure 4 :
Figure 4: Performance w.r.t noise ratio.The bar denotes the decrease percentage of the performance reported in Table2.
Eliminated Number  on Taobao

Figure 5 :
Figure 5: Hyperparameter study of MBSSL.theestimated performance decrease/increase percentage w.r.t a randomly selected datum point.4.6.1 Effect of Propagation Layer Numbers.From Figure5, we observe that more embedding propagation layers yield better performance due to the strengthened capability of capturing high-hop signals.However, when the number of layers exceeds 4, the performance suffers from a huge decline, which is resulted from the over-smoothing effect.
,1 1 , ...L ,−1 1 , L 2 ) can be viewed as an auxiliary task loss denoted as L , while the main supervised task loss can be rewritten as the target task loss L  .Let  denote the set of bottom shared parameters;  denotes the -th training iteration within one epoch; G   , G  , respectively denote the gradient of the target task and auxiliary task with respect to  , i.e., G   = ∇  L   and G  , = ∇  L  , .Hence, our goal is to balance G  , and G   during each iteration via modifying the direction and magnitude of G  , adaptively.Intuitively, gradient with larger magnitude will dominate the optimization trend, so in the case where ||G  , || ≥ ||G

Table 1 :
The statistics of datasets.

Table 3 :
The experimental results of ablation study.

Table 4 :
The comparison results of different strategies.