Leveraging Multimodal Features and Item-level User Feedback for Bundle Construction

Automatic bundle construction is a crucial prerequisite step in various bundle-aware online services. Previous approaches are mostly designed to model the bundling strategy of existing bundles. However, it is hard to acquire large-scale well-curated bundle dataset, especially for those platforms that have not offered bundle services before. Even for platforms with mature bundle services, there are still many items that are included in few or even zero bundles, which give rise to sparsity and cold-start challenges in the bundle construction models. To tackle these issues, we target at leveraging multimodal features, item-level user feedback signals, and the bundle composition information, to achieve a comprehensive formulation of bundle construction. Nevertheless, such formulation poses two new technical challenges: 1) how to learn effective representations by optimally unifying multiple features, and 2) how to address the problems of modality missing, noise, and sparsity problems induced by the incomplete query bundles. In this work, to address these technical challenges, we propose a Contrastive Learning-enhanced Hierarchical Encoder method (CLHE). Specifically, we use self-attention modules to combine the multimodal and multi-item features, and then leverage both item- and bundle-level contrastive learning to enhance the representation learning, thus to counter the modality missing, noise, and sparsity problems. Extensive experiments on four datasets in two application domains demonstrate that our method outperforms a list of SOTA methods. The code and dataset are available at https://github.com/Xiaohao-Liu/CLHE.

I don't have many bundles to train a bundle construction model.
You have many item-level user feedback data.
How to bundle sparse or cold-start items?

Mix-and-Match Multimodal Features
Semantics Similar buy Item-level User Feedback Co-Interact

Challlenges Our Ideas
How do various data correspond to diverse bundling strategies.

INTRODUCTION
Product bundling has been a popular and effective marketing strategy, tracing back from ancient commercial times and persisting through to the rapidly growing e-commerce and online services today.By combining a set of individual items into a bundle, both the sellers (or service providers) and consumers can benefit a lot from multiple aspects, including the reduced cost of packaging, shipment, and installation, to promoting sales of old or new items by combining them with some popular or essential items with discounts.To implement product bundling, the first and foremost step is constructing bundles from individual items, aka.bundle construction, which is traditionally carried out by human experts.However, the explosive growth of item sets poses significant challenges to such high-cost manual approaches.Hence, automatic approaches to bundle construction are imperative and have garnered more and more attention in recent years.By analyzing prior studies, we find that they mostly build the bundles based on the co-occurrence relationship of items in existing training bundles.However, there are two key problems that have not been well studied: 1) previous approaches heavily rely on largescale high-quality bundle dataset for training, and 2) they cannot properly handle the sparsity and cold-start issues.First, most previous bundle construction methods require high-quality supervision signals from a large set of well-curated bundles.However, there is a dilemma in such an approach especially for platforms that have not offered bundle service before or have just deployed bundle service for a short period of time, it is difficult for such platforms to collect sufficient bundle data for training.Second, even for platforms with mature bundle services, the situation is far from ideal due to the various cold-start problems.On the one hand, there are quite a number of items that are only involved in a few bundles, consequently, it is challenging to obtain informative representations for these sparse items to construct new bundles.Worse still, there many new items, which haven't been part of previous bundles while are continuously pushed online, and how to swiftly bundle these cold-start items with existing warm items is crucial for platforms to promote new products and keep sustained growth.
Addressing these challenges, instead of seeking any silver bullet, we are more keen on practical solutions that make full use of the large amount of easy-to-access resources: multimodal features and item-level user feedback.The motivation behind this solution is that these data are well aligned to diverse bundling strategies.First, multimodal features, such as text, image, and audio, contain rich semantic information that is helpful to find either similar or compatible items and form bundles, as shown in Figure 1.More importantly, most items, even those sparse and newly introduced items, usually have one or multiple such features.A plethora of previous efforts, such as personalized recommendation [47], have demonstrated the efficacy of multimodal features in handling sparse and cold-start items.Second, item-level user feedback information endows precious crowd-sourcing knowledge that is crucial to bundle construction.Intuitively, the items that users frequently co-interact with are strong candidates for bundling.More importantly, a large amount of such user feedback signals are available even to platforms that do not offer bundle services.Compared with previous works [20], we pioneer the integration of multimodal features and item-level user feedback for bundle construction.
Given the outlined motivations, we aim to leverage both multimodal features and item-level user feedback, along with the existing bundles, to develop a comprehensive model for bundle construction.However, it is non-trivial to design a model to capture all three types of information and achieve optimal bundle construction performance.First, how to learn effective representations in each modality and well capture the cooperative association among the three modalities is a key challenge.Second, some items might not be associated with user feedback or affiliated to bundles comprehensively, thus the so-called modality-missing issue may degrade the modeling capability.What's more, during the inference stage of bundle construction, we usually need to provide several seed items as a partial bundle to initiate the construction process.However, the incompleteness of the partial bundle imposes noise and sparsity challenges to the bundle representation learning, which will impede the bundle construction performance.
In this work, to address the aforementioned challenges, we propose a Contrastive Learning-enhanced Hierarchical Encoder (CLHE) for bundle construction.In order to obtain the representations of items, we make use of the recently proposed large-scale multimodal foundation models (i.e., BLIP [27] and CLAP [50]) to extract the multimodal features of items.Concurrently, we pretrain a collaborative filtering (CF)-based model (i.e., LightGCN [23]) to obtain the items' representations that preserve the user feedback information.Then, we employ a hierarchical encoder to learn the bundle representation, where the self-attention mechanism is devised to expertly fuse multimodal information and multi-item representations.To tackle the modality missing problem and the sparsity/noise issues induced by the incomplete partial bundle, we employ two levels of contrastive learning [37,49], i.e., item-level and bundle-level, to fully take advantage of the self-supervision signals.We conduct experiments on four datasets from two domains, and the results demonstrate that our method outperforms multiple leading methods.Various ablation and modal studies further justify the effectiveness of key modules and demonstrate multiple crucial properties of our proposed model.We summarize the key contributions of this work as follows: • We introduce a pioneering approach to bundle construction by holistically combining multimodal features, item-level user feedback, and existing bundles.This integration addresses prevailing challenges such as data insufficiency and the cold-start problem.

METHODOLOGY
We first formally define the problem of bundle construction by considering all three types of data.Then we describe the details of our proposed method CLHE (as shown in Figure 2).

Problem Formulation
Given a set of items I = { 1 ,  2 , • • • ,   }, each item has a textual input   , which can be its title, description, or metadata, and a media input   , which can be an image, audio, or video of the item.In addition, for the items that have been online for a while, we have collected some item-level user feedback data, which is denoted as a user-item interaction matrix X  × = {  | ∈ U,  ∈ I}, where Figure 2: The overall framework of our proposed method CLHE, which consists of two main components: hierarchical encoder (aka.multimodal feature extraction, item and bundle representation learning) and contrastive learning.

Item Representation
Learning.We first detail the feature extraction process and then present the self-attention encoder.Multimodal Feature Extraction.We seek large-scale multimodal foundation models to extract the textual and media features of items.Compared with previous uni-modal feature extractors, such as in Computer Vision (CV) [22,36], Natural Language Processing (NLP) [15,34], or audio [8,26], multimodal foundation models are more powerful to capture the multimodal semantics of the input data, which have demonstrated to be effective in transferring or generalizing to various downstream tasks.Concretely, for image data, we use the BLIP [27] model to extract both textual and visual features.For audio data, we use the CLAP [50] model to extract textual and audio features.After the feature extraction, we obtain the textual feature t  ∈ R 768 and media feature m  ∈ R 768 .Given their shared representation space, we perform a simple average pooling over them, resulting in the content feature of the item, denoted as c  = average(t  , m  ).
Item-level User Feedback Feature Extraction.We employ the well-performing CF-based model, i.e., LightGCN [23], to obtain item representations from user feedback.Specifically, we devise a bipartite graph based on the user-item interaction matrix, then train a LightGCN 1 model over the bipartite graph, denoted as: where p ∈ R  are embeddings for user  and item  at the -th layer, and  is the dimensionality of the hidden representation; N  and N  are the neighbors of the user  and item  in the user-item interaction graph.We only make use of the item representation p  , which captures the item-level user feedback information.It is tailored by aggregating the item representations over  layers' 1 Other CF-based models can also be used.
propagation, denoted as: ID Embedding Initialization.We also initialize an id embedding v  ∈ R  for each item to capture its bundle-item affiliation patterns.Please note that for those items (both during training and testing) that do not have user feedback features, we copy the content feature to its user feedback feature slot.Analogously, for the cold-start item that do not have an id embedding, we copy its corresponding content feature to take the slot.
Modality Fusion via Self-attention.Given the three types of features, i.e., c  , p  , and v  , we first apply a feature transformation layer to project the multimodal and user-feedback features into the same latent space with the id embeddings, then we concatenate all the three features into a feature matrix F  ∈ R 3× , denoted as: where W  ∈ R 768× and W  ∈ R 768× are the transformation matrices for multimodal and user-feedback features, respectively; concat(•) is the concatenation function.Then, we devise a selfattention layer to model the correlations of multiple features, denoted as: where W   ∈ R  × and W   ∈ R  × are the trainable parameters for this item-level encoder to project the input feature embeddings into the key and value spaces; F( )  ∈ R 3× is the hidden feature representations in the intermediate layer , and

is the softmax function and F𝐿
denotes the features' representations after  layers of self-attention.We then average the multiple features to obtain the item representation f  ∈ R  after multimodal fusion, formally defined as: 2.2.2 Bundle Representation Learning.After obtaining the item representation, we build a second self-attention module to learn the representation of the given partial bundle.For a certain partial bundle   , its representation e  2 is learned by: where W   ∈ R  × and W   ∈ R  × are the trainable parameters in the bundle-level to project the input item embeddings into the key and value spaces; is the hidden representations in the middle layer , and Ê(0)  = concat({f  }  ∈ ); Ẽ  denotes the features' representations after  layers of self-attention.We then average the multiple features to obtain the item representation e  after multimodal fusion, formally defined as:

Contrastive Learning
Even though the hierarchical encoder can well attain the correlations among multiple features and multiple items, it still suffers from noise, sparsity, or even cold-start problems in both item and bundle levels.Specifically, at the item level, the items that have fewer user feedbacks or are involved in fewer bundles during training may also be prone to deteriorate representations, which is the so-called sparsity issue.Even worse, some cold-start items may have never interacted with any users or been included in any bundles before, therefore, the cold-start problem will severely deteriorate the representation quality.Second, at the bundle level, the partial bundle's representation is susceptible to noise and sparsity issues.Instead of a complete bundle that is sufficient to depict all the functionalities or properties of the bundle, the given partial bundle only encompasses some of the items.Consequently, the bundle representation may be biased due to the arbitrary seed items.
To tackle these problems, we aim to harness contrastive learning over both item and bundle levels to mine the self-supervision signals.Recently, contrastive learning has achieved great success in various tasks, including CV [10], NLP [18], and recommender systems [49].The main idea is to first corrupt the original data and generate some augmented views for the same data point, and then leverage an InfoNCE loss to pull close the representations across multiple augmented views for the same data point, while pushing away the representations of different data points.Therefore, the representations could be more robust to combat noise and sparsity.

Item-level Contrastive Learning.
For each item , we tailor its representation f  in Equation 5. We leverage various data augmentations to generate the augmented view f ′  .The item-level data augmentation methods we used include: 1) No Augmentation (NA) [37]: just use the original representation as the augmented feature without any augmentation; 2) Feature Noise (FN) [53]: add a small-scaled random noise vector to the item's features; 3) Feature Dropout (FD) [49]: randomly dropout some values over the feature vectors; and 4) Modality Dropout (MD): dropout the whole feature of a randomly selected modality on a randomly selected item.Then, we use the InfoNCE [37] to generate the item-level contrastive loss, denoted as: where cos(•) is the cosine similarity, and  is the temperature.

Bundle-level Contrastive Learning.
For each bundle  and its original representation e  , we also implement various data augmentations to generate an augmented view e ′  .The data augmentation methods we leveraged include: 1) Item Dropout (ID): randomly dropout some items in the bundle; and 2) Item Replacement (IR): randomly select some items in the bundle and replace them with some other items that have not appear in the bundle.Following on, the bundle-level contrastive loss is tailored by:

Prediction and Optimization
After obtain the partial bundle representation e   and the item representations f  , we leverage the inner-product function to induce the score ŷ  , that indicates the possibility of item  being included into bundle  to make it complete, defined as: To optimize our model, we follow the previous approaches [31,51] and leverage the negative log-likelihood loss, therefore, the loss for bundle  is denoted as: where  (•) is the softmax function which produces the probabilities over the entire items.In collaboration with the contrastive loss and regularization, we have the final loss, denoted as: where  1 ,  2 and  are hyper-parameters to balance different loss terms, ∥Θ∥ 2 2 is the L2 regularization term, and ∥Θ∥ denotes all the trainable parameters in our model.

EXPERIMENTS
We evaluate our proposed methods on two application domains of product bundling, i.e., fashion outfit and music playlist.We are particularly interested in answering the research questions as follow: • RQ1: Does the proposed CLHE method beat the leading methods?• RQ2: Are the key modules, i.e., hierarchical transformer and contrastive learning, effective?• RQ3: How does out method work in countering the problems of cold-start items, modality missing, noise and sparsity of the partial bundle?How the detailed configurations affect its performance and how about the computation complexity?

Experimental Settings
There are various application scenarios that are suitable for product bundling, such as e-commerce, travel package, meal, etc., each of which has one or multiple public datasets.However, only datasets that include all the multimodal item features, user feedback data, and bundle data can be used to evaluate our method.Therefore, we choose two representative domains, i.e., fashion outfit and music playlist.We use the POG [11] for fashion outfit.For the music playlist, we use the Spotify [7] dataset for the bundle-item affiliations, and we acquire the user feedback data from the Last.fmdataset [2].Since the average bundle size is quite small in POG (it makes sense for fashion outfit), we re-sample a second version POG_dense which has denser user feedback connections for each item.In contrast, the average bundle size in Spotify dataset is large, thus we sample a sparser version Spotify_sparse, which has smaller average bundle size.To be noted, we keep the integrity of all the bundles in all the versions, which means we do not corrupt any bundles during the sampling.For each dataset, we randomly split all the bundles into training/validation/testing set with the ratio of 8:1:1.The statistics of the datasets are shown in Table 1.We use the popular ranking protocols of Recall@K and NDCG@K as the evaluation metric, where K=20.

Compared Methods.
Due to the new formulation of our work, there are no previous works that have exactly same setting with ours.Therefore, we pick several leading methods and adapt them to our settings.For fair comparison, all the baseline methods use all the three types of extracted features that are same with our method.In addition, they all use the same negative log-likelihood loss function.
• MultiDAE [51] is an auto-encoder model which uses an average pooling to aggregate the items' representations to get the bundle representation.
• MultiVAE [31] is an variational auto-encoder model which employ the variational inference on top of the MultiDAE method.• Bi-LSTM [21] treats each bundle as a sequence and uses bidirectional LSTM to learn the bundle representation.• Hypergraph [54] formulates each bundle a hyper-graph and devises a GCN model to learn the bundle representation.• Transformer [3,46] tailors a transformer to capture the item interactions and generate the bundle representation.• TransformerCL is the version that we add bundle-level contrastive loss to the above Transformer model.
According to the contrastive learning, we search  1 ,  2 and  in range of {0.1, 0.2, 0.5, 1, 2} and {0.1, 0.2, 0.5, 1, 2, 5}, respectively.Besides, we dropout features and modalities in augmentation step randomly with the ratio in range of {0, 0.1, 0.2, 0.5} and add noise with a weight in range of {0.01, 0.02, 0.05, 0.1}.We search the number of propagation layers , ,  from {1, 2, 3}.For the baselines, we follow the designs in their articles to achieve the best performance.Certainly, we keep the same settings to ensure a fair comparison.

Overall Performance Comparison (RQ1)
Table 2 shows the overall performance comparison between our model CLHE and the baseline methods.We have the following observations.First, our method beats all the baselines on all the datasets, 2) the performance of Spotify_sparse is relatively smaller than that on Spotify since the sparser bundle-item affiliation data, justifying our hypothesis that large-scale and high-quality bundle dataset is vital to bundle construction.Finally, we have an interesting observation that the performance improvements on the four datasets is negatively correlated with "#Avr.B/I ", as shown in Table 1, in another word, in scenarios that items are included in fewer bundles (i.e., the dataset include more sparse items), our method performs even better.This phenomenon further justifies the advantage of our method in countering the issue of sparse items.

Ablation Study of Key Modules (RQ2)
To further evaluate the effectiveness of the key modules of our model, we conduct a list of ablation studies and the results are shown in Table 3.First and foremost, we aim to justify the effectiveness of the user feedback features.Thereby, we remove the user feedback features from our model (i.e., remove p  from f  ) and build an ablated version of model, i.e., w/o user feedback.According to the result in Table 3, after removing user feedback features, the performance reduces clearly, verifying that user feedback feature is significant for bundle construction.Second, we would like to evaluate whether each component of the hierarchical encoder is useful.We progressively remove the two self-attention modules from our model and replace them with an vanilla average pooling, thus yielding three ablated models, i.e., w/o item, w/o bundle, and w/o both.The results in Table 3 show that the removal of either selfattention modules causes performance drop.These results further verify the efficacy of our self-attention-based hierarchical encoder framework.Third, to justify the contribution of contrastive learning, we progressively remove the two levels of contrastive loss, thus generating three ablations, i.e., w/o item, w/o bundle, and w/o both.Table 3 depicts the results, which demonstrate the both contrastive losses are helpful, especially on the sparser version of datasets.

Model Study (RQ3)
To explicate more details and various properties of our method, we further conduct a list of model studies.
3.4.1 Cold-start Items.One of the main challenges for bundle construction is cold-start items that have never been included in previous bundles.It is difficult to directly evaluate the methods solely based on cold-start items since there are few testing bundles where both the input and result partial bundles purely consist of coldstart items.Nevertheless, we come up with an alternative way to indirectly test how these methods perform against cold-start items.Specifically, we remove all the cold-start items and just keep the warm items in the testing set, i.e., the warm setting.We test our method and all the baseline models on this warm setting.The results shown in Table 4 illustrate that: 1) the performance of all the models on the warm setting are much better than that of the warm-cold hybrid setting (the normal setting as shown in Table 2), exhibiting that the existence of cold-start items significantly deteriorate the performance; and 2) the performance gap between CLHE and the strongest baseline in the hybrid setting is obviously much larger than that on the warm setting, implying that our method's strength in dealing with cold-start items.

Sparsity and Noise in Bundle.
Another merit of our approach is that the contrastive learning is able to counter the sparsity and  noise issue when the input partial bundle is incomplete.To elicit this property, we corrupt the testing dataset to make the input partial bundle sparse and noisy.Specifically, we randomly remove certain portion of items from the input partial bundle to make them sparser.
To make the partial bundle more noisy, we randomly sample some items from the whole item set and add them to the bundle.Then we test our model and the model without both levels of contrastive loss, and the performance curves are shown in Figure 3, where the x-axis is the ratio of bundle size after corruption compared with the original bundle, and the ratio=1 corresponds to the original clean data.From this figure, we can derive the conclusion that: 1) with the sparsity and noise degree increasing, both our method and baselines' performance drops; 2) our method still outperforms baselines even under quite significant sparsity or noise rate, such as removing 50% seed items or adding 50% more noisy items; and 3) the contrastive loss in our model is able to combat the parse and noise bundle issue to some extent.

Data Augmentations.
Data augmentation is the crux to contrastive learning.We search over multiple different data augmentation strategies at both item-and bundle-level contrastive learning, in order to find the best-performing setting.In Table 5, we present the performance of CLHE under various data augmentations at both item-or bundle-level.Overall speaking, data augmentation methods may affect the performance and proper selection is important for good results.our method is computationally heavy since it takes the longest time for each training epoch; on the other hand, our method takes the least training time to reach convergence on three datasets.In conclusion, our method is effective and efficient during training, while the inherent complexity induced by hierarchical self-attention may impose the inference slower.We argue that various self-attention acceleration approaches could be considered in practice, which is out of the scope of this work.
3.4.5Case Study.We would like to further illustrate some cases to portrait how the hierarchical encoder learn the association of multimodal features and the multiple items' representations.Specifically, for both item-and bundle-level self-attention modules, we take the last layer's output representation as each feature's (item's) representation, and calculate the cosine similarity score with the whole item (bundle).We cherry pick some example items and bundles, as shown in Figure 5.The results of feature-item similarity exhibit that the three type of features could play distinctive roles in different items, showing the importance of all the three types of features.For the item-bundle similarity results, we can find that items do not equally contribute to their affiliated bundles, thus it is crucial to model the bundle composition patterns.Here we just intuitively illustrate some hints about the bundle representation learning, more sophisticate analysis, such as feature pair or item pair co-effects, is left for future work.

Hyper-parameter Analysis.
We also present the model's performance change w.r.t. the key hyper-parameters, i.e., the temperature  in contrastive loss and the weights  1 ,  2 for the two contrastive losses.The curves in Figure 6 reveal that the model is still sensitive to these hyper-parameters, and proper tuning is required to achieve optimal performance.

RELATED WORK
We review the literature about bundles, including: 1) bundle recommendation and construction, and 2) bundle representation learning.

Bundle Recommendation and Construction
Product bundling is a mature marketing strategy that has been applied in various application scenarios, including fashion outfit [11,29], e-commerce [42], online music playlist [4,7], online games [14], travel package [32], meal [28], and etc.. Personalized bundle recommendation [5,9,37] is the pioneering work that first focuses on bundle-oriented problems in the data science community.Soon after that, researchers realize that just picking from predefined bundles cannot satisfy people's diverse and personalized needs.Thereby, the task of personalized bundle generation [1,6,13,17,24,44] is naturally proposed where the model aims to automatically generate a bundle from a large set of items catering to a given user.It has to simultaneously deal with both users' personalization and item-item compatibility patterns, where the user-item interaction is specifically utilized for personalization modeling.In this paper, we only focus on bundle construction, which is committed to generate more bundles to enrich the bundle catalog for the platform.In addition, most of the bundle-oriented research in general domain still falls into the id-based paradigm, where very few domains, such as the fashion domain, have explored multimodality.We extend the multimodal learning to one more domain of music playlist.Moreover, we also leverage user feedback to multimodal bundle construction.

Bundle Representation Learning
Bundle representation learning is the crux of all the bundle-oriented problems.Initial studies [39] treat a bundle as a special type of item and just use the bundle id to represent it.Naturally and reasonably, people get to consider the encapsulated items within a bundle to generate more detailed representation.The simplest method is performing average pooling over the included items [51].Later on, sequential models, such as Bi-LSTM [21], are utilized to capture the relations between two consecutive items.However, the items within a bundle are not ordered essentially, and sequential models cannot well capture all the pair-wise correlations.To address the limitation, attention models [9,24,33], Transformer [3,30,35,40,43,46] and graph neural networks (GNNs) [5,16,38,54,55] are leveraged to model not only every pair of items within a bundle, but also the higher-order relations by stacking multiple layers.
Even though many efforts have been paid to the item correlation learning to achieve good bundle representation, the multimodal information has been less explored.Multimodal information, such as textual, visual, or knowledge graph information of items, demonstrates to be effective in general recommendation [45,47,48].In the fashion domain, visual and textual features have been extensively investigated for pairwise mix-and-match [20,52] or outfit compatibility modeling [12,41].However, these works have not been extended to other domains, such as music playlist, where the audio modality has been rarely studied in the bundle recommendation or construction problem.More importantly, we argue that the user-item interaction information, which is widely utilized in the personalized recommendation problem, can serve as an additional modality in bundle construction.Sun et al. [42] leverage a pre-trained CF model to obtain item representation to enhance the bundle completion task, while they have not fully justify the rationale and motivation.To the best of our knowledge, none of the previous works put together all the user-item interaction, bundle-item affiliation, and item content information for bundle construction.

CONCLUSION AND FUTURE WORK
In this work, we systematically study the problem of bundle construction and define a more comprehensive formulation by considering all the three types of data, i.e., multimodal features, item-level user feedback data, and existing bundles.Based on this formulation, we highlight two challenges: 1) how to learn expressive bundle representations given multiple features; and 2) how to counter the modality missing, noise, and sparity problem.To tackle these challenges, we propose a novel method of Contrastive Learningenhanced Hierarchical Encoder (CLHE) for bundle construction.Our method beats a list of leading methods on four datasets of two application domains.Extensive ablation and model studies justify the effectiveness of the key modules.
Despite the great performance that has been achieved by this work, there is still large space to be explored for bundle construction.First, the current evaluation setting is a little bit rigid and inflexible, it is interesting to extend it to more flexible setting to align with real applications.For example, given arbitrary number of seed items, the model is asked to construct the bundle.Second, some of the feature extractors are pre-trained and fixed, i.e., the multimodal feature extraction and user-item interaction models.Is it possible to optimize these feature extractors in an end-to-end fashion thus the extracted features would be more aligned to the bundle construction task?Finally, this work just targets at unpersonalized bundle construction.It is an interesting and natural direction to push forward this work to personalized bundle construction.

Figure 1 :
Figure 1: The motivations of leveraging multimodal features and item-level user feedback for bundle construction.

Figure 3 :
Figure 3: Performance analysis with varying rates of sparsity and noise in the partial bundle.

3. 4 . 4
Computational Complexity.Self-attention calculates every pair of instances in a set, i.e., features of an item or items of a bundle, thus it usually suffers from high computational complexity.We record the time used for every training epoch and the time used from the beginning of training till convergence, and the records of our method and two baselines, i.e., MultiDAE and Transformer, are illustrated in Figure 4.The bar chart reveals that on the one hand,             (a) Feature-Item Similarity; the circle of , ,  corresponds to multimodal, id, and user feedback feature, respectively; each row is an item.(b) Item-Bundle Similarity, where each circle corresponds to one item and each row corresponds to one query partial bundle.

Figure 5 :
Figure 5: Illustration of similarity a) between each feature and the whole item representation; and b) between each item and the whole bundle representation.The size of the circles is positively correlated with its corresponding cosine similarity.
• We highlight multiple technical challenges of this new formulation and propose a novel method of CLHE to tackle them.• Our method outperforms various leading methods on four datasets from two application domains with different settings, and further diverse studies demonstrate various merits of our method.

LightGCN User Feedback Feature Content Feature BLIP CLAP … … Multimodal Feature Extraction Item Representation Learning Modality Fusion (Self-Attention) NA FD FN Item ℒ ! " Item-level Contrastive Loss Data Augmentation Bundle Representation Learning and Prediction An Input Partial Bundle Self-Attention Item Drop Item Replace Data Augmentation Bundle-level Contrastive Loss ℒ # " … ℒ $ Neg Log-likelihood Candidate Items User Item Bundle Content Feature: ( ! User Feedback Feature: ) ! ID Embedding: * ! Matrix Multiplication Dropout Adding Noise
2 , • • • ,   } is the user set.We define a bundle as a set of items, denoted as  = { 1 ,  2 , • • • ,   }, where  = | | is the size of the bundle.Given a partial bundle   ⊂  (i.e., a set of seed items), where |  | < | |, the bundle construction model targets at predict the missing items  ∈  \  .We have a set of known bundles for training, denoted as B = { 1 ,  2 , • • • ,   } and a set of unseen bundles for testing, denoted as B = { +1 ,  +2 , • • • ,  + Ō }, where  is the number of training bundles and Ō is the number of testing bundles.We would like to train a model based on the training set B, for an unseen bundle  ∈ B, when given a few seed items b , aka. the partial bundle, the model can predict the missing items  \   thus to construct the entire bundle.

Table 1 :
The statistics of the four datasets on two different domains.

Table 2 :
The overall performance of our CLHE and baselines."%Improv."denotes the relative improvement over the strongest baseline.The best baselines are underlined.

Table 3 :
Ablation study of the hierarchical encoder and contrastive learning (the performance is NDCG@20).Transformer and TransformerCL achieve the best performance, showing that the self-attention mechanism and contrastive learning can well preserve the correlations among items within the bundle, thus yielding good bundle representations.Third, comparing the results between different versions of dataset, we find that: 1) the performance on POG_dense is much larger than that on POG due to denser user-item interactions, demonstrating that user feedback information is quite helpful to the performance;

Table 4 :
The overall performance (NDCG@20) of our CLHE and baselines on the warm setting."%Improv."denotes the relative improvement over the strongest baseline.