MultiCBR: Multi-view Contrastive Learning for Bundle Recommendation

Bundle recommendation seeks to recommend a bundle of related items to users to improve both user experience and the profits of platform. Existing bundle recommendation models have progressed from capturing only user-bundle interactions to the modeling of multiple relations among users, bundles, and items. CrossCBR, in particular, incorporates cross-view contrastive learning into a two-view preference learning framework, significantly improving SOTA performance. It does, however, have two limitations: (1) the two-view formulation does not fully exploit all the heterogeneous relations among users, bundles, and items; and (2) the “early contrast and late fusion” framework is less effective in capturing user preference and difficult to generalize to multiple views. In this article, we present MultiCBR, a novel Multi-view Contrastive learning framework for Bundle Recommendation. First, we devise a multi-view representation learning framework capable of capturing all the user-bundle, user-item, and bundle-item relations, especially better utilizing the bundle-item affiliations to enhance sparse bundles’ representations. Second, we innovatively adopt an “early fusion and late contrast” design that first fuses the multi-view representations before performing self-supervised contrastive learning. In comparison to existing approaches, our framework reverses the order of fusion and contrast, introducing the following advantages: (1) Our framework is capable of modeling both cross-view and ego-view preferences, allowing us to achieve enhanced user preference modeling; and (2) instead of requiring quadratic number of cross-view contrastive losses, we only require two self-supervised contrastive losses, resulting in minimal extra costs. Experimental results on three public datasets indicate that our method outperforms SOTA methods. The code and dataset can be found in the github repo https://github.com/HappyPointer/MultiCBR.


INTRODUCTION
As a popular marketing strategy, item bundling has been gradually adopted by various online services, such as composing a set of compatible fashion items into an outfit or creating a music list by including songs of similar styles.Each individual item gains more opportunities to be exposed in diverse bundles as a result of item bundling.Consequently, indecisive consumers may be persuaded to purchase a specific item due to the justification provided by the innovative bundles.Because of these benefits, bundle recommendation has been proposed recently and received a lot of attention.
Early studies on bundle recommendation [34] simplify it as a special type of user-item recommendations and directly adopt Collaborative Filtering (CF)-based methods, to capture the user-bundle interaction patterns.However, such simplification ignores the fact that a bundle is not an atomic unit but rather a collection of fine-grained elements, i.e., individual items.Following works [4,5,8,12,28,52] take into account the user-item interaction and bundle-item affiliation information in order to take advantage of the valuable information carried by the items.Factorization models [4,8], multi-task learning [4,8,12], and Graph Neural Networks (GNNs) [5,12,28,52] are typical techniques and have been shown to be effective in dealing with multiple relations among users, bundles, and items.BGCN [5], in particular, sorts user preference over bundles into two views: bundle view (represents preference for the entire bundle) and item view (represents preference for the items within bundles), which is insightful in guiding advanced model design.By explicitly modeling the cooperative association between the two different views with cross-view contrastive learning, CrossCBR [28] further improves the performance.Despite the impressive performance improvement, there are still a few limitations.
First, the formulation of two-view preference does not fully exploit the multiple heterogeneous relations among users, items, and bundles.As shown in Figure 1, CrossCBR performs bi-directional information propagation over the user-bundle (UB) and user-item (UI) graphs, while applying uni-directional aggregation on the bundle-item (BI) graph to garner information from item to its affiliated bundle.However, we argue that this uni-directional aggregation is insufficient to make full use of the BI graph, such as the crucial patterns for bundle composition.More importantly, for items and bundles that have fewer connections, the collaborative effects in the BI graph can significantly enhance these sparse nodes' representations.Consequently, fully exploiting the BI graph is an indispensable part of bundle recommendation (See the evidence in Section 3.4).
Second, but perhaps more important limitation is an "early contrast and late fusion" design in CrossCBR, neglecting the cross-view user preference and restricting it from generalizing to more complicated scenarios such as multiple views (e.g., auxiliary information such as content or other relational data).As shown in Figure 1, CrossCBR just models user preference within each view, i.e., ego-view preference of   and   , while overlooking the potential cross-view preference.In addition, it requires constructing contrastive losses between every pair of views.Hence, when the number of views increases, the number of contrastive loss terms also grows quadratically.As a result, in terms of either effectiveness or efficiency, it is challenging for the model to capture the cooperative effect of multiple views, which would be beneficial to the performance but cannot be learned from isolated individual views.To address the above limitations, we propose a novel Multi-view Contrastive learning framework for Bundle Recommendation, short as MultiCBR.First, we reorganize the multiple heterogeneous relations into three views: user-bundle interaction view, user-item interaction view, and bundle-item affiliation view, each of which focuses on one relation of UB, UI, and BI, as shown in Figure 1.This multi-view formulation ensures that all three graphs are fully exploited via bi-directional graph propagation, especially for the BI graph, which is sub-optimally modeled by previous works.At the same time, the three-view decomposition can better differentiate each type of relation.Second, to leverage the powerful contrastive learning into the multi-view framework, we adopt a converse strategy as opposed to CrossCBR, i.e., "early fusion and late contrast".As shown in Figure 1, we first fuse the multi-view representations into a unified one, and then apply self-supervised contrastive learning.Such a simple revision brings in two major advantages: 1) "early fusion" inherently captures user preference from both ego-view and cross-view (more details are presented in Section 2.3); and 2) "early fusion and late contrast" eliminates the quadratic number of cross-view contrastive losses and just requires only two contrastive losses.This enables the incorporation of multiple views with few extra computational costs or optimization difficulties.
To evaluate MultiCBR, we conduct extensive experiments on three public benchmark datasets, i.e., Youshu, NetEase, and iFashion.Overall speaking, MultiCBR outperforms SOTA methods (e.g., CrossCBR [28]) on all the three datasets.In particular on iFashion, we achieve over 20% relative improvements.Further ablation and model studies verify the effectiveness of our model as well as the hypothesis.The key contributions of this work are summarized as follows: • To the best of our knowledge, we are the first to propose a multi-view contrastive learning framework for bundle recommendation to fully exploit all the relations among users, bundles and items, especially addressing the BI sparsity problem by introducing the BI view.• We propose the strategy of "early fusion and late contrast" to better capture user preference from both ego-view and cross-view, while introducing few extra expenses.• Extensive experiments on three benchmark datasets indicate that our method outperforms SOTA baselines.

METHODOLOGY
In this section, we present the overall framework of our proposed method MultiCBR, as shown in Figure 2. In particular, we first give a brief introduction of the problem formulation, followed by the three key modules of MultiCBR: 1) multi-view representation learning, 2) multi-view fusion and prediction, and 3) joint optimization.

Problem Formulation
The problem of bundle recommendation aims to route a set of users where each bundle is composed from a set of items Even though various information can be used to train a bundle recommender system, in this paper, we are particularly interested in using the user-bundle interactions and bundle-item affiliation information Z  × = {  | ∈ B,  ∈ I}, where ,  ,  are the number of users, bundles, and items, respectively.Given the three binary-valued matrix X, Y, and Z, where X and Y are the historical interaction data, the target is to predict unseen user-bundle interactions.

Multi-view Representation Learning
In this work, we devise a multi-view representation learning framework that consists of three views: user-bundle interaction view, user-item interaction view, and bundle-item affiliation view, each of which specifically concentrates on one type of relations and learns view-tailored user and bundle representations.We separately learn user and bundle representations from three views, instead of blending the relations together (e.g., building a tripartite graph [12]), since the mixture of relations disturbs each other and results in poor overall performance [28].The representation learning of the three views is as follows.

User-Bundle Interaction View.
In this view, we aim to learn user and bundle representations that capture user preference based on the user-bundle CF signals.Based on the user-bundle interaction matrix X, we construct a user-bundle bipartite graph (UB graph) and utilize the SOTA GNN-based CF signal modeling model LightGCN [22] to learn the representations of users and bundles.In particular, the graph propagation is depicted as: where e   ( ) , e   ( )  ∈ R  are the -th layer's embeddings for user  and bundle ;  is the embedding dimension; N    and N    are the neighbors of the user  and bundle  in the UB graph.We use superscript   to denote the representations are for the user-bundle interaction view.After obtaining embeddings from different layers, we pool them into unified bundle-level representations: 2.2.2 User-Item Interaction View.The user-item interaction view targets at modeling the user-item CF signals and generate corresponding user and bundle representations.In particular, we first employ a graph learning module over the user-item interaction graph to obtain the user and item representations.It not only captures the user preference over individual items but yields the item embeddings serving to represent the bundle at a fine-grained level.In order to yield bundle representations in this view, we perform a uni-directional aggregation from the items to their corresponding bundles, which is guided by the bundle-item affiliation graph.
Based on the user-item interaction matrix Y, we can build a bipartite UI graph.Analogous to the bundle level representation learning, a LightGCN is used to model the user-item CF signals.We present the details of graph propagation as follows: where e We aggregate the items representations and devise the bundle representation as follows: where N   is the first-order neighbors of bundle  in the BI graph.Even though we utilize both UI and BI graphs in this view, only UI CF signals are well captured via bi-directional graph learning, while the BI graph is just used to pool the item embeddings into bundle representation and less explored.Therefore, we incorporate a third view to specifically mine the BI graph patterns, presented in the following section.

Bundle-Item Affiliation
View.The bundle-item affiliation view concentrates on the bundle composition information, which should be injected into both user and bundle representations.We conduct bi-directional information propagation over the bundle-item affiliation graph while employ a uni-directional information aggregation from items to users over the user-item interaction graph.To be noted, this view does not incorporate any additional data compared with the user-item interaction view, nevertheless, it can alter the concentration from user-item CF patterns to the bundle-item composition patterns.Therefore, the bundle composition information will affect both user and bundle representations in this view, endowing the overall model with more information from bundle composition patterns.
Based on the bundle-item affiliation matrix Z, we curate a bipartite BI graph, and then similarly adopt a LightGCN kernel to learn on the graph as follows: where e  ( ) , e  ( )  ∈ R  are the -th layer's embeddings for bundle  and item ; N  and N  are the neighbors of the bundle  and item  in the BI graph.The superscript  represents that the representations are for the bundle-item affiliation view.Similar to Equation 2 and Equation 4, we aggregate the embeddings from different layers and obtain the representations e   and e   , defined as: To obtain the user representation according to the bundle-item compositional representations, we aggregate the representations of items that each user interacts with, represented as: where N    are the items that have been interacted with user  in the UI graph.

Multi-view Fusion and Prediction
By performing the graph convolutional operations on three view-specific graphs (i.e., UB, UI, and BI graphs), three representations are obtained for each user and bundle, i.e., {e    , e    , e   } and {e    , e    , e   }.In this section, we introduce a simple strategy to fuse the multi-view representations with a set of view coefficients and yield the final representation for prediction.More importantly, we deem that the inner-product prediction function following an early fused representations is able to capture both ego-view and cross-view user preference.Therefore, the explicit cross-view alignment proposed by CrossCBR [28], which is cumbersome for computation and optimization, is no longer required.

Multi-view Fusion.
We aim to fuse the heterogeneous views into a unified one in the representation level.Previous approaches [5,28] just combine the predictions of two views (as shown in Figure 1), which neglect the feature-level cooperation.Moreover, they just use equal weights to combine multiple views, which is sub-optimal for cross-view cooperation since each view contributes differently to the overall preference modeling.Hence, in order to capture the heterogeneity of multiple views and achieve better cooperative effects across multiple views, we early fuse the multi-view representations with a set of view coefficients, thus obtain the final user and bundle representations e  and e  : where { 1 ,  2 ,  3 } are the view coefficients to balance the contributions of three views and

Prediction.
We generate the preference score  * , for user  on bundle  by applying the inner-product between the overall user and bundle representations, denoted as  * , = e  • e  .Although it is common to employ inner-product as the preference scoring function in recommender systems, we argue that the inner-product on the fused representations inherently models both the cross-view and ego-view preference.By replacing e  , e  with Equation 9 ego-view preference (10) where the cross-view preference is the inner-product of the user and bundle representations from different views, while the ego-view preference is the inner-product of the user and bundle representations from the same view.Contemporary methods [5,28] follow a late-fusion strategy, in which they separately calculate the preference for each view and then sum the prediction scores.In other words, they just model the ego-view preference.However, our early fusion strategy inherently enable the model to capture both ego-view and cross-view preference.To be noted, CrossCBR can also achieve cross-view cooperation through the cross-view contrastive learning.However, it just increases the cosine-similarity of the representations from different views.We argue that our way of cross-view preference modeling is more direct and effective.

Joint Optimization
We employ a joint optimization protocol by combining the self-supervised contrastive loss with the typical BPR (Bayesian Personalized Ranking [33]) loss.

Contrastive Loss.
Contrastive learning has witnessed great success in computer vision [9,11], natural language processing [11], and graph learning [21,30,37,54,55].Slightly after that, recommender systems quickly adapt contrastive learning into various recommendation scenarios, including CF-based user-item [43,51], sequential (session) [53], cold-start [41], multi-behavior [45], cross-domain recommendations [47], and etc. CrossCBR [28] and MIDGN [52] are the first to incorporate contrastive learning into bundle recommendation.CrossCBR constructs cross-view contrastive losses to enhance the representation affinity of the same user (bundle) from different views, while decrease that of different users (bundles).Analogously, MIDGN builds contrastive losses between global and local views, where different views of the same user (bundle) form positive pairs.The main idea behind these two works is to use cross-view contrastive loss to achieve cross-view cooperation.
In this paper, we discard the explicit cross-view contrastive loss and just adopt a simple selfsupervised contrastive loss on the unified representations.Since the "early fusion" strategy already endows MultiCBR with the capability of multi-view cooperation modeling, it is no longer necessary to build cross-view contrastive losses, and the self-supervised contrastive loss is sufficient to enhance the model with capabilities such as countering sparsity or noise reported in previous works [43,51].
To construct the positive pairs for applying contrastive learning and also enhance the robustness to counter potential noise, we adopt data augmentation to generate different representations of the same user or bundle.In detail, for each user , we obtain two different representations e ′  and e ′′  indicating the same user  under different data augmentations.e ′  and e ′′  are considered as a positive user pair.Similarly, for each bundle , e ′  and e ′′  are generated as the positive bundle pair.For data augmentation, we follow the previous works [28,43,49,51] and adopt three different methods: edge dropout (ED), message dropout (MD) and noise augmentation (Noise).The edge dropout is based on the graph structure, where we randomly drop a small percentage of edges in the input graph to generate an augmented graph.Different representations are obtained by learning on augmented graph structures.Both message dropout and noise augmentation are embedding-based.Message dropout randomly masks some elements of the propagated embeddings with a certain dropout ratio .Noise augmentation directly adds a small noise vector Δ which subjects to ||Δ|| 2 =  and  is a small constant.Stronger data augmentation methods could further enhance the model performance and those data augmentation techniques developed in general self-supervised graph contrastive learning can be easily adapted to our framework without any specialized consideration, which is left for future work to explore.
For contrastive loss, we adopt InfoNCE [19] built upon the generated contrastive pairs (e ′  , e ′′  ) and (e ′  , e ′′  ).The equations are as follows: where cos(, ) is the cosine similarity function,  is the hyper-parameter known as temperature softmax.We adopt the popular implementation of in-batch negative sampling [43] to construct the negative pairs.To be noted, we only need two contrastive loss terms no matter how many views are included.Meanwhile, for previous cross-view contrastive learning framework, when the number of views increases (e.g., the introduction of a third view in our work), the cross-view contrastive loss terms grow quadratically, causing extra computational and optimization overhead.In summary, MultiCBR not only retains the advantage of contrastive learning but also reduces the computational and optimization overhead.
2.4.2Optimization.BPR Loss has been widely used as the main learning objective for current bundle recommendation models, formally obtained by: where  ′ is the negative sample that are randomly chosen from the bundles that have not been interacted with user ,  = {(, ,  ′ )| ∈ U, ,  ′ ∈ B,   = 1,   ′ = 0}, and  (•) is the sigmoid function.Finally, we optimize the whole framework by the joint loss: where is the L2 regularization term.To be noted, same as CrossCBR, our model also just requires three sets of embeddings for users, bundles, and items, denoted as , E (0)  }.

Complexity Analysis
For space complexity, the parameters of MultiCBR only include three sets of embeddings: , E (0)  .The total space complexity of MultiCBR is O (( +  + )).Compared with CrossCBR, MultiCBR introduces the additional bundle-item affiliation view without adding extra parameters.
For time complexity, the computational cost of MultiCBR mainly comes from graph learning of three views and contrastive learning.The time complexity of graph learning in MultiCBR is  ).The time complexity for calculating contrastive loss is  (2 ( + 1) ||), which is the same as CrossCBR.Owing to the "early fusion and late contrast" structure, the time consumption of contrastive loss does not increase as the number of views grows.

EXPERIMENTS
To evaluate our proposed approach, we conduct experiments on three public bundle recommendation datasets: Youshu, NetEase, and iFashion.In particular, we aim to answer the following research questions: • RQ1: Can MultiCBR outperform the SOTA baseline models?
• RQ2: Are all the views and the "early fusion and late contrast" framework helpful for the overall performance?• RQ3: What are the key characteristics of MultiCBR w.r.t.user preference modeling and efficiency?

Experimental Settings
Datsets: We follow the three public bundle datasets from CrossCBR, i.e., Youshu [8], NetEase [4] and iFashion [10,27], corresponding to scenarios of book list, music list and fashion outfit, respectively.We adopt the same split of training, validation, and testing with CrossCBR.And the dataset statistics are depicted in Table 1.NDCG@K and Recall@K are leveraged as the evaluation metrics, where  ∈ {20, 40} and all-ranking protocol is used.

Model Youshu
NetEase iFashion R@20 N@20 R@40 N@40 R@20 N@20 R@40 N@40 R@20 N@20 R@40 N@40 The User-item Recommender Models just utilize the user-bundle interaction data without considering the affiliated items within each bundle.The following methods are considered: • MFBPR [33]: Matrix Factorization optimized by the BPR loss.
• LightGCN [22]: it utilizes a light-version graph learning kernel to model the CF signals.
• SGL [43]: it applies self-supervised contrastive learning to the LightGCN model and achieves SOTA performance.• XSimGCL [50]: it also leverages self-supervised contrastive graph learning, while it novely proposes a add small-scale random noise as data augmentation.
• LightGCL [3]: this method utilizes a novel method of SVD to generate augmented views for self-supervised contrastive learning.The Bundle-specific Recommender Models take into account all the relations among users, bundles and items.We incorporate the following baselines: • DAM [8]: it generates bundle representation from its included items through an attention score and employs multi-task learning to learn both user-item and user-bundle preferences.• BundleNet [12]: a tripartite graph among user, bundle and item is leveraged to learn the representations via GNN.
• BGCN [5]: it organizes the user preference into bundle and item views, each of which is modeled by a GCN module.It first makes predictions on each view and then sum the prediction scores.
• MIDGN [52]: it formulates user preference in to local and global views and leverages the intent disentanglement to learn user's intents for bundle recommendation.
• CrossCBR [28]: it is the SOTA method, which utilizes a cross-view contrastive learning to achieve cross-view cooperative association based on the BGCN's view construction method.Among all the baselines, SGL, XSimGCL, LightGCL, MIDGN and CrossCBR incorporate contrastive learning and are stronger than other non-contrast baselines.We re-implement several most representative baselines i.e., SGL, XSimGCL, LightGCL, BGCN and CrossCBR, and directly take the reported results of other models from the original paper of CrossCBR and MIDGN, since we use the identical dataset and split.Since MIDGN does not take into account the iFashion dataset, we use their released code to obtain the result.For all the models mentioned above and our proposed MultiCBR, the model performance could vary in each training due to randomness, which is relatively more significant in small datasets like Youshu.For fair comparison, we repeated the model training three times with the same hyper-parameter and take the average performance as the reported value.

Performance Comparison (RQ1)
The overall performance is presented in Table 2. First, MultiCBR outperforms all the baseline methods on all three datasets.Especially on iFashion, we improve the SOTA performance by a large margin.This may because the bundle-item graph is very sparse in iFashion (there is only 3.86 items in each bundle as shown in Table 1) and the newly introduced BI view is significant to capture the valuable BI composition patterns.For similar reasons, the performance improvement on Youshu is relatively weaker.Bundles in Youshu generally have more items, and more importantly, most items in the bundle possess rich user-item interactions.As a result, introducing BI view, which is originally designed for countering the BI sparsity problem, helps less with the performance on Youshu.We will further discuss the effects of graph learning on BI view thoroughly in section 3.4.1.Second, the contrastive learning-based methods achieve better performance.For the user-item recommender models, SGL, XSimGCL, and LightGCL perform similar and are better than the methods without contrastive learning.Analogously, MIDGN and CrossCBR beat all the other bundle-specific methods.This phenomenon shows that contrastive learning is a powerful and general approach to boosting bundle recommendation.Third, CrossCBR is the strongest baseline, justifying that the two-view formulation (aka.bundle and item view) plus cross-view contrastive learning is better than the view formulation in MIDGN (aka.local and global view).Finally, MultiCBR outperforms CrossCBR, indicating the effectiveness of the three-view representation learning and the novel framework of "early fusion and late contrast".

Ablation Study (RQ2)
To further identify the contribution of each module of MultiCBR, we conduct a series of ablation study.Since the performance improvement of Youshu is marginal, the ablation study and the model study (in the following subsection) are mainly conducted on the NetEase and iFashion datasets.First, in order to verify the effectiveness of each view, we individually remove two views of UI and BI and contruct two ablated models MultiCBR-UI and MultiCBR-BI.According to the results in Table 3, by removing either view of UI and BI, the performance drops.Nevertheless, MultiCBR-UI and MultiCBR-BI are still much better than SGL, which only has the UB view.We can conclude that both UI and BI graphs provide valuable information for the user-bundle preference modeling.Second, we aim to investigate whether the "early fusion and late contrast" framework is useful.Even though it is cumbersome to generalize CrossCBR from two views to multiple views, it is still doable to adapt the pair-wise cross-view contrastive loss to three views.In particular, we add the BI view to CrossCBR and build an extension, named CrossCBR+BI, which includes six cross-view contrastive losses for user and bundle on three pairs of views: UB-UI, UB-BI and UI-BI.To be noted, CrossCBR+BI is a faithful implementation of the "early contrast and late fusion" strategy that is adopted by CrossCBR, instead of purely incorporating an additional BI view.From the results in Table 3, we have the following observations.First, CrossCBR+BI outperforms CrossCBR on iFashion but under-perform CrossCBR on NetEase, implying that the modeling of multiple views is more challenging than two views, where the design for effective multi-view cooperation is the crux.The cross-view contrastive learning framework may fail to capture the cooperation across multiple views or even worsen the performance by the introduction of the BI view.Second, MultiCBR beats CrossCBR+BI on both datasets, showing that the "early fusion and late contrast" framework is more effective than the "early contrast and late fusion" framework in exploiting the multi-view inputs.
Third, to quantitatively validate the effectiveness of contrastive loss, we remove it from our model and build a variation MultiCBR-CL.Comparing with MultiCBR, MultiCBR-CL dramatically decreases the performance and is even worse than CrossCBR-CL (the model variation that is constructed by removing the contrastive loss module from CrossCBR).That is to say, "late contrast" is a prerequisite for the success of "early fusion" and the specific ordered combination is the key to MultiCBR.

Model Study (RQ3)
In this section, we aim to investigate the key characteristics of MultiCBR: 1) how does the auxiliary BI view help to improve the model performance, especially in terms of the BI sparsity issue?2) how does MultiCBR capture the cross-view and ego-view preference under the "early fusion and late contrast" framework?3) how do the key hyper-parameters, i.e., the view coefficients, the contrastive loss weight, and the temperature in contrastive loss, affect the model performance?4) how about the computational efficiency of MultiCBR?3.4.1 Effects of the BI view.Due to the lack of graph propagation on the BI view, the performance of CrossCBR is largely restricted by the sparsity of the bundle-item graph, especially in iFashion dataset where the average bundle size is only 3.86.Additionally, some items in the bundles never interact with any user.The embeddings of such items cannot be learned without the BI view.MultiCBR alleviates such weakness and enhances the recommendation performance through learning from all UB, UI and BI views.
B-I Sparsity Issue.Among the three datasets, the performance improvement on iFashion is the largest because we think MultiCBR can address the BI sparsity issue, where the BI connection in iFashion is the most sparse one.However, according to the dataset statistics in Table 1, Youshu and NetEase are relatively dense in terms of the BI affiliation graph since each bundle has much more items than iFashion.A natural question raises: does MultiCBR still perform well on sparse-version of Youshu and NetEase?To verify this hypothesis, we randomly drop certain rate of BI connections in the Youshu and NetEase, generating multiple sparse datasets with different level of BI sparsity.The
7  results are shown in Table 4 and Table 5.On Youshu dataset, the relative performance improvements on sparse datasets are more significant than the original dataset (aka.No Drop), demonstrating that MultiCBR will perform much better if the BI connection is sparser.On both Youshu and NetEase datasets, the overall performance of both CrossCBR and MultiCBR generally decreases when the drop ratio increases, showing that the sparser the BI graph the more challenging of the task.However, the relative performance improvements of MultiCBR compared with CrossCBR generally increases when the sparsity ratio grows, showing that MultiCBR is better than CrossCBR in tackling the BI sparsity issue.B-I-U Sparsity Issue.MultiCBR can alleviate the sparsity issue also because that it is able to learn better representations of items without user-item interactions.To verify this hypothesis, we split the bundles in iFashion and NetEase datasets into several groups according to the percentage of items without user interactions in each bundle (denoted as sparsity rate in Figure 3).We call this as the B-I-U sparsity issue.The hit rates of the top 20 recommendation results (denoted as hit@20) of MultiCBR and CrossCBR are evaluated on each group of bundles.The results are shown in Figure 3, where the horizontal axis implies the range of the sparsity rate of bundles in the same group.The lines in Figure 3 represent the evaluated model performance of CrossCBR and MultiCBR, and the bars measure the relative improvement of MultiCBR in each group.As shown in Figure 3, when the percentage of items without user interactions in the bundle increases, the performance of both CrossCBR and MultiCBR decreases, implying the difficulty of representation learning for items without user interactions.Regarding the relative improvement of MultiCBR, in the first bundle group of NetEase, where all items in the bundle possess at least one user interaction, CrossCBR even slightly outperformed MultiCBR.But as the percentage of items without user interaction increases, MultiCBR soon outperforms CrossCBR, and the relative improvement grows rapidly.This phenomenon indicates that representation learning on BI view in MultiCBR effectively enhances the representation learning of items without user interactions, causing better overall performance.In iFashion dataset, MultiCBR outperforms CrossCBR significantly in all three bundles groups.We attribute this to the small average bundle size in iFashion dataset, where BI interactions are always sparse and graph learning on BI view greatly enhances the model performance.This further demonstrates the superiority of MultiCBR in modeling sparse BI signals.

Cross-and ego-view preference modeling.
The key advantage of MultiCBR lies in that it can inherently model both the cross-view and ego-view preference, as presented in Equation 10.To justify this hypothesis, we separate the cross-view and ego-view terms of Equation 10 and sum them up as prediction score for cross-view and ego-view, respectively.We rank all the candidate items according to the prediction scores and obtain the performance, which can directly reflect the corresponding preference modeling capacity.In Figure 4, we plot the cross-view, ego-view, and overall performance for three different models: CrossCBR, CrossCBR+BI, and MultiCBR.Overall speaking, MultiCBR performs best on the three types of prediction, indicating that our model is able to explicitly optimize both cross-and ego-view preference thus achieves the best overall performance.Comparing CrossCBR with MultiCBR on NetEase, the performance drop on crossview prediction is larger than that on ego-view, showing that our method excels in modeling cross-view preference.Once again, CrossCBR+BI under-performs CrossCBR on NetEase for both ego-view and cross-view, showing that an effective framework may fail the multi-view cooperation.
We also analyze the Alignment-Dispersion [15,28,38] properties of the learned representations to characterize our model, as shown in Table 6.In particular, we calculate the cosine similarity of representations of the same user (bundle) from different views to characterize the alignment properties of the model, denoted as A  and A  for users and bundles respectively (where the subscript indicates the alignment pair, e.g., A   ,  represents the user representation alignment between UB and UI view).We calculate the cosine similarity of the representations of different Table 6.The cross-view alignment and dispersion analysis of the representations.For alignment, the larger the better, while for dispersion, the lower the better.

Metrics
NetEase users (bundles) to characterize the dispersion properties of the model, denoted as D  and D  for users and bundles, respectively (for CrossCBR+BI that does not fuse the representations in a unified one, we first calculate the dispersion of each view and then average them to obtain the overall dispersion).Surprisingly, for almost all of the alignment and dispersion metrics, CrossCBR+BI is better than MultiCBR.This phenomenon reminds us that, in multi-view bundle recommendation, enhancing the cross-view affinity or ego-view dispersion may not be exactly consistent with learning objectives, aka. the user-bundle preference.The cross-view alignment is based on intuition and empirical results, and there is no solid proof about why explicit cross-view alignment can enhance the performance.We discard the cross-view alignment and directly model the cross-view and ego-view preference, demonstrating to be a better solution.

Effects of data augmentation.
We try various data augmentation methods to generate positive pairs for contrastive learning and present the performances of MultiCBR models using different data augmentation methods in Table 7. MultiCBR_ED represents Edge Dropout, MultiCBR_MD refers to Message Dropout, and MultiCBR_Noise corresponds to Noise augmentation.To alleviate the influence of randomness, for each data augmentation, we repeat the model training three times using the fixed best hyper-parameter we found, and take the average performance as the reported value in Table 7.We further adopt ANOVA significance test [16] over the performances of MultiCBR using different data augmentations.All of the evaluated p-values are below 0.05,

NetEase iFashion
Fig. 5. BI view coefficients analysis.The horizontal axis of all the sub-figures is the view coefficient on the BI view, and the vertial axis corresponds to the NDCG@20 score of the model.
which indicates that data augmentation has a statistically significant influence on MultiCBR's performance.The results in the table also demonstrate that MultiCBR retains the outstanding performance that is far beyond the baseline performances using all three data augmentations.
From the comparison of these three data augmentation methods, noise augmentation gains the best overall performance, suggesting that proper and stronger data augmentations could further enhance the model performance.
To make the comparison fair between MultiCBR and the SOTA method CrossCBR, we also implement the advanced data augmentation method Noise to CrossCBR, denoted as CrossCBR_Noise.The results on NetEase and iFashion datasets are illustrated in Table 7. From the results we can observe that CrossCBR_Noise achieves slightly better or comparable performance with the original CrossCBR model, while still obviously underperforms our proposed MultiCBR.This observation implies that various data augmentation approaches may affect the performance of cross-or multiview contrastive learning, but just with limited or small scale.

Hyper-parameter Analysis.
Several key hyper-parameters are crucial for MultiCBR and require carefully tuning to achieve optimal performance.We analyze how our model will perform against the changing of these hyper-parameters, including the BI view coefficient, the contrastive loss weight, and the temperature of the contrastive loss.
Effects of view coefficients.In this paper, we incorporate different view coefficients on each graph learning module to capture the heterogeneity of graphs.Especially, the BI view is newly introduced in MultiCBR and we pay special attention to the corresponding BI view coefficient.To analyze its effect, we showcase the effect of BI view coefficient that is crucial to the performance.Figure 5 illustrates the performance change in terms of BI-view coefficients.To further investigate the effect of BI view coefficient on ego-view and cross-view preference modeling, we also present the ego-view and cross-view model performance.In Equation 10, we just keep the ego-view preference terms and train the model, yielding the ego-view model performance.Analogously, if we just keep the cross-view preference terms and we can obtain the cross-view model performance.Compared with the case where BI view is discarded ( 3 = 0), the performance curves on both datasets demonstrate that BI view is helpful for all of the ego-view, cross-view, and overall performances.The contribution of BI view differs on different datasets to achieve the best performance, indicating that a good balance between multiple views is crucial for optimal multi-view cooperation.
Fig. 6.The contrastive loss weight  1 analysis.The horizontal axis represents various values of  1 and the vertical axis shows the corresponding performance of ndcg@20.

! !
Fig. 7.The temperature  analysis.The horizontal axis represents various values of  and the vertical axis shows the corresponding performance of ndcg@20.
Effects of contrastive loss weight  1 .Contrastive loss is one of the key components of MultiCBR, and the weight of it may severely affect its performance.We keep all the other hyperparameters as the optimal values and vary the value of hyper-parameter  1 to see the performance change.The results of dataset NetEase and iFashion are shown in Figure 6.We have the following observations: 1) on both datasets, the performance varies largely when  1 changes, indicating that improper setting of contrastive loss weight may lead to failure of the model; 2) the best setting of  1 can be quickly identified via the grid search within a small range, demonstrating that it is easy to tune our model in practice.
Effects of temperature .Another hyper-parameter that most of the contrastive learning-based models are sensitive to is the temperature  in Equation 11.In order to investigate how temperature  impact the model, we fix all the other hyper-parameters as the optimal settings and just change .The results illustrated in Figure 7 demonstrate that MultiCBR is quite sensitive to the hyperparameter  and improper setting of  may lead to catastrophic result.Fortunately, analogous to the hyper-parameter  1 , it is not difficult to obtain the best setting of  by grid searching over a restricted candidate set of .To be noted, the property of  that we observed is similar with the conclusion derived by CrossCBR [28].

Computational Efficiency.
To demonstrate the computational efficiency of our model, we record the average running time for one training epoch of our model and two baselines on two devices (i.e., Titan V and A5000), as depicted in Table 8.From the results, we have the following observations.First, MultiCBR is more efficient than CrossCBR+BI since the "early fusion and late contrast" framework only includes two terms of contrastive loss while CrossCBR+BI has six terms.This verifies our hypothesis that the "early fusion and late contrast" design is more efficient than the

RELATED WORK
In this section, we review the related works from three branches: 1) bundle recommendation, 2) graph neural network for recommendation; and 3) contrastive learning for recommendation.

Bundle Recommendation
Bundle recommendation [35] has evolved from modeling single user-bundle relation to the modeling of multiple relations among users, bundles and items.FPMC [34] treats each bundle as an atomic object and employs factorization models to model the user-bundle interactions.Later on, people realize that bundle is not an atomic unit, while its included items are crucial for the user-bundle preference modeling.Early works [4,8] utilize factorization models or attention mechanism to capture the rich compositional patterns of bundles.In addition, multi-task learning is also utilized to concurrently model both user-item and user-bundle preference, which can enhance the preference modeling from both item and bundle levels.Despite great performance improvement of these works [4,8], the backbone for these models is still based on typical shallow-layered factorization models or attention mechanism, which cannot model complicated higher-order interactions and perform less effectively.
Recently, GNNs have been employed to capture the complicated relations between users, items and bundles, therefore extensively improve the SOTA performance.Representative methods include BundleNet [12] and BGCN [5].BundleNet first builds a tripartite graph among user, bundle, and item to unify all the relations into one graph.It can model the high-order and multi-path interactions over the user-item-bundle tripartite graph via message passing.Different from BundleNet, Chang et al. proposes BGCN [5] that separates the relations among user, bundle, and item into two parts and constructs two levels of graphs, each corresponding to one level of user preference.In particular, it constructs the bundle-level graph based on the user-bundle interactions and constructs the itemlevel graph based on the user-item interactions and bundle-item affiliations.Thereby, graph neural networks are utilized to learn representations of user and bundle over the two levels of graphs.BGCN not only achieves competitive performance on benchmark datasets but also contributes a novel and compelling approach to modeling bundle recommendation, i.e., the two-view formulation to capture both bundle-level and item-level user preferences.
With the emergence of contrastive graph learning, bundle recommendation obtains further enhancements.For example, MIDGN [52] separates the user-bundle preference into local and global view and then apply contrastive loss in-between the two views.In addition, it also adopts the idea of intent disentanglement to model multiple latent intents for each interaction.MIDGN extensively improves the performance of bundle recommendation on the benchmark datasets, demonstrating the effectiveness of contrastive learning.Concurrently, another contrastive learningbased work, i.e., CrossCBR [28], is proposed.It builds two views of bundle and item, following the formulation of BGCN [5], and then leverages the cross-view contrastive learning to increase the representation affinity of the same user (bundle) while decreases that of different user (bundle).CrossCBR significantly boosts the performance of BGCN and largely improves the SOTA with a simple and efficient graph contrastive learning framework.The success of CrossCBR verifies two main ideas: 1) the bundle and item view formulation is essential in capturing different levels of user preference, and 2) cross-view contrastive learning that explicitly models the cooperative association while enlarging the uniformity of representations is crucial for the performance boosts.In this work, MultiCBR inherits the idea of CrossCBR for view's formulation and further extends the two views formulation to multiple views.MultiCBR also improves the contrastive learning framework to cope with the multi-view scenario.
In parallel to our formulation of bundle recommendation, there are also some works that are highly related to the topic of bundle recommendation.For example, He et al. extends the problem of bundle recommendation to the scenario of conversational recommendation, thus formulating conversational bundle recommendation [23].Different from recommendation, bundle generation is also a key problem which aims to automatically construct bundle based on existing bundles or user preferences [1,6,14,29].Some other works, such as basket or package recommendation [2,24,31], release the strict constraint of pre-defined bundle but have a very similar formulation with bundle recommendation, i.e., recommending a set of items to users.These works are also popular and meaningful, however they have different settings and formulations with our work.Basket recommendation recommends a set of items to users, while this set of items is not necessarily a predefined bundle.For bundle recommendation, platforms usually construct some bundles and treat them as a special type of items.Such bundles could be built by sellers (bundles in e-commerce platforms) or users in a crowd-sourcing manner (playlists in the music platform).Thereby, technically, we can have an id for each bundle, while it can hardly have an id for the basket, since each basket does not appear that frequently as bundles.

Graph Neural Network for Recommendation
Graph neural networks (GNNs) have experienced explosive development in recent years and become the de facto methods to model graph structured data for various tasks, such as graph node classification, link prediction, and graph classification, where representative methods include GCN [26], GAT [36], GraphSage [20], and etc.As one type of graph structured data, the useritem interactions is naturally suitable for the GNNs and various GNN-based recommendation methods [22,40,48] are proposed.He et al. builds a highly-scalable GCN framework, which is based on random walk and can operate on billions of nodes and edges.It is not only a pioneering work that bring GNN to recommender models but also provides practical solutions for the deployment of GNN-based recommender models.Wang et al. formulates the user-item interactions as a bipartite graph and utilizes GCN kernels to learn user and item representations [40].NGCF achieves great performance on various CF-based recommendation scenarios and applications.Following NGCF, a lighter but more effective version of CF-based GNN model, i.e., LightGCN [22], is proposed.By removing the feature transformation and non-linear activation function, LightGCN captures the high-order CF signals much better and more efficiently.With the proliferation of contrastive learning, incorporating contrastive learning into the GNN-based recommender models further improve the SOTA performance, which will be introduced in the following section.
Besides typical user-item recommendation, GNNs have also been widely adapted to various sub-problems in recommendation, such as sequential [7,13], session [44], multimedia [42], KGenhanced [39], and bundle recommendation [5,12], and etc.For example, SRGNN [44] proposes to construct a graph for each session and employ GCN to capture the transitional patterns of items within each session.To incorporate knowledge graph into recommender models, KGAT [39] integrates knowledge graph into the user-item interaction graph, then employ GNN to learn representations of users and items over the holistic graph.Analogously, GNN is suitable for bundle recommendation, where diverse relations exist among user, item, and bundle.By using GNN to capture the user-item, user-bundle, and bundle-item relations, a list of works have been proposed, such as BundleNet [12], BGCN [5], CrossCBR [28], etc., as we describe in the above sub-section.In this paper, our model inherits the current merits of GNN in bundle recommendation and further promote this research direction.

Contrastive Learning for Recommendation
Recently, contrastive learning revamps self-supervised learning and achieves enormous progress in areas spanning computer vision [9,11,18], natural language processing (NLP) [15], visionlanguage tasks [32] and graph learning [21,30,37], etc.For example, SimCLR [9] presents a simple self-supervised framework to learn visual representations via contrastive learning.It significantly outperforms previous SOTA self-supervised and semi-supervised methods, even matching the performance of a supervised ResNet-50.Inspired by the great success of SimCLR in computer vision, contrastive learning has been quickly adapted to other domains, such as NLP, graph learning, and recommender systems.
Specifically, in the domain of recommender system, contrastive learning has been incorporated into multiple recommendation scenarios, including typical user-item recommendation [43,51], sequential (session) recommendation [46,53], cold-start recommendation [41], cross-domain recommendation [47], as well as bundle recommendation [28,52] etc.The main idea of current contrastive learning in recommendation lies in increasing the affinity of positive pairs while decreasing that of negative pairs.The positive pairs are constructed either from stochastic data augmentation [3,43,50,51,53] or based on the inherent multiple views of the same node [28,47,52].
The key of contrastive learning-based recommender models lies in constructing the contrastive pairs.For user-item based recommender models, SGL [43] proposes random node dropout, edge dropout, and random walk to generate an augmented view and build a positive pair.SimGCL [51] and XSimGCL [50] finds that a small-scaled random noise is an effective data augmentation method, while LightGCL [3] introduces SVD as a novel data augmentation method.In sequential recommendation, various sequence augmentation methods, such as dropout, replace, shuffle items in a sequence, have been applied and achieved great performance.In bundle recommendation, MIDGN [52] and CrossCBR [28] first construct two views of representations, then apply contrastive loss over the two views.Even though data augmentation is helpful for performance improvement, the enhancement is marginal according to the report of CrossCBR [28].In this work, we just borrow the data augmentation methods of previous works [43,50] and specifically focus on the order of "fusion and contrast", which is an interesting problem in the multi-view contrastive learning problem.

CONCLUSION AND FUTURE WORK
In this paper, we addressed the problem of bundle recommendation with a novel multi-view contrastive learning framework MultiCBR.We constructed a multi-view representation learning framework, which fully exploits all the relational information among users, bundles and items.Especially, in order to alleviate the influence of sparse bundle-item affiliation signals, we introduce graph learning on bundle-item affiliations, resulting in better representation learning and increased performance.Moreover, we adopted an "early fusion and late contrast" strategy, which enhances the user preference modeling through directly modeling both ego-view and cross-view user preference.In addition, compared with previous "early contrast and late fusion" approaches, MultiCBR is more efficient w.r.t.both computational and optimization costs.We conduct extensive experiments on three benchmark dataset, and the results indicate that MultiCBR outperforms SOTA methods on three public datasets.Especially on the iFashion dataset, our method improves the SOTA over 30%.We also conduct diverse ablation and model studies to illustrate the working mechanism of MultiCBR, i.e., the introduce BI view can counter the BI sparsity issue and the "early fusion and late contrast" is an effective and efficient framework.
Albeit the effectiveness and efficiency of MultiCBR, there remain multiple key problems that deserve further investigation in the future.First, current multi-view fusion still relies on the coarselevel of views with hyper-parameters (i.e., view coefficients), however, automatically learning fine-grained user (bundle)-level coefficients for multi-view fusion may achieve better cooperation.Second, our results illustrate that the cosine similarity based cross-view contrastive loss is not always in line with the recommendation objective.Hence, further investigation on the co-effects of contrastive loss and BPR loss is promising to demystify this phenomenon.Third, even though the "early fusion and late contrast" framework has been justified in our setting of three views, it is interesting to consider more views, such as multimodal features of the items.Moreover, this novel multiview contrastive learning mechanism can be generalized to other problems, such as crossdomain recommendation and multimedia recommendation.Lastly, MultiCBR is a collaborative filtering-based bundle recommendation model that utilizes only relational data.One promising future direction is to introduce multimodal features (i.e., images and descriptions of items) and further boost the recommendation quality.Modeling user preference with multimodal features could further alleviate the cold-start problem where relational data is missing for new items in the system.

MultiCBR: 1 )Fig. 1 .
Fig. 1.Brief comparison between MultiCBR and CrossCBR: 1) MultiCBR extends two views to three views, where the additional BI view aims to capture the bundle composition patterns implied by BI graph; and 2) MultiCBR adopts an "early fusion and late contrast" framework, which can better model user preference while introducing few extra expenses.

Fig. 2 .
Fig.2.The overall framework of MultiCBR consists of three parts: 1) multi-view representation learning, including user-bundle interaction view (UB Int.View), user-item interaction view (UI Int.View), and bundleitem affiliation view (BI Aff.View); 2) multi-view fusion and prediction; and 3) joint optimization of BPR loss and self-supervised contrastive loss.
where |   |, |   |, |  | are the number of edges in UB graph, UI graph, BI graph respectively,  is the number of propagation layers,  is the embedding size,  the number of epochs and  is the batch size.In comparison, the time complexity of graph learning in CorssCBR is  ((2 |   | + 2 |   | + |  |) | |  ).The extra time complexity in MultiCBR caused by introducing the BI view is  ((|   | + 2 |  |) | |

3. 1 . 1
Compared Methods.Both user-item and bundle-specific recommender models are leveraged for baselines.

Fig. 3 .
Fig. 3.The performance comparison among bundle groups with different B-I-U sparsity rates.

Fig. 4 .
Fig. 4. The cross-view and overall performance comparison for several ablated models.
1 and  2 are hyper-parameters to balance different loss terms, and ∥Θ∥ 2 2

Table 2 .
The overall performance comparison.R represents Recall and N represents NDCG.

Table 4 .
Performance comparison on the Youshu dataset with sparsified bundle-item graph of various dropout rate.

Table 5 .
Performance comparison on the NetEase dataset with sparsified bundle-item graph of various dropout rate.

Table 7 .
Performance of MultiCBR under different data augmentations.

Table 8 .
The efficiency analysis of MultiCBR.Each value is the average running time (seconds) for one training epoch of the model on corresponding dataset.

Table 9 .
The convergence speed comparison of CrossCBR and MultiCBR.Each number is the training epochs used until convergence of corresponding model on certain dataset.contrastandlatefusion" strategy.Second, MultiCBR still takes more training time than CrossCBR.As is analyzed in section 2.5, MultiCBR incorporates one more view than CrossCBR and the graph learning of the additional BI view results in additional costs.Nevertheless, the BI view is necessary to address the BI sparsity issue, thus mainly contributing to the performance improvement.In addition to analyzing the per-epoch training time, we also record the number of epochs used until the model convergence.As shown in Table9, MultiCBR takes the same number of training epochs to converge on the NetEase dataset, while takes fewer number of training epochs on iFashion.Considering both the convergence speed as well as the per-epoch training time, we can conclude that MultiCBR is approaximately as effiecient as the SOTA method CrossCBR.