Bootstrap Latent Representations for Multi-modal Recommendation

This paper studies the multi-modal recommendation problem, where the item multi-modality information (e.g., images and textual descriptions) is exploited to improve the recommendation accuracy. Besides the user-item interaction graph, existing state-of-the-art methods usually use auxiliary graphs (e.g., user-user or item-item relation graph) to augment the learned representations of users and/or items. These representations are often propagated and aggregated on auxiliary graphs using graph convolutional networks, which can be prohibitively expensive in computation and memory, especially for large graphs. Moreover, existing multi-modal recommendation methods usually leverage randomly sampled negative examples in Bayesian Personalized Ranking (BPR) loss to guide the learning of user/item representations, which increases the computational cost on large graphs and may also bring noisy supervision signals into the training process. To tackle the above issues, we propose a novel self-supervised multi-modal recommendation model, dubbed BM3, which requires neither augmentations from auxiliary graphs nor negative samples. Specifically, BM3 first bootstraps latent contrastive views from the representations of users and items with a simple dropout augmentation. It then jointly optimizes three multi-modal objectives to learn the representations of users and items by reconstructing the user-item interaction graph and aligning modality features under both inter- and intra-modality perspectives. BM3 alleviates both the need for contrasting with negative examples and the complex graph augmentation from an additional target network for contrastive view generation. We show BM3 outperforms prior recommendation models on three datasets with number of nodes ranging from 20K to 200K, while achieving a 2-9 × reduction in training time. Code implementation is located at: https://github.com/enoche/BM3.


INTRODUCTION
In fast-growing e-commerce businesses, recommender systems play a critical role in helping users discover products or services they may like among millions of offerings.In practice, deep learning techniques have been widely applied in recommendation systems, mainly for exploiting historical user-item interactions, to model users' preferences on items and produce item recommendations to users [41].However, the rich multi-modal content information (e.g., texts, images, and videos) of items has still not been fully explored.
To improve the recommendation accuracy, recent work on multimodal recommendation have studied effective means to integrate item multi-modal information into the traditional user-item recommendation paradigm.For example, some methods concatenate multi-modal features with the latent representations of items [9] or leverage attention mechanisms [4,17] to capture users' preferences on items' multi-modal features.With a surge of research on graph-based recommendations [32,37,47], another line of research uses Graph Neural Networks (GNNs) to exploit item multi-modal information and enhance the learning of user and item representations [31,33,34].For instance, [34] uses graph convolutional networks to separately propagate and aggregate different item multimodal information on the user-item interaction graph.To further improve recommendation performance, other auxiliary graph structures, e.g., the user-user relation graph [31] and item-item relation graph [39], have also been exploited to enhance the learning of user and item representations from the multi-modal information.
Although existing GNN-based multi-modal methods [31,33,39,44,45] can achieve state-of-the-art recommendation accuracy, the following issues may hinder their applications in scenarios involving large-scale graphs.First, they often learn the user and item representations based on pair-wise ranking losses, e.g., the Bayesian Personalized Ranking (BPR) loss [24], which treat observed useritem interaction pairs as positive samples and randomly sampled user-item pairs as negative samples.Such a negative sampling strategy may incur a prohibitive cost on large graphs [30] and bring noisy supervision signals into the training process.For example, previous research [48] has verified that the default uniform sampling [42] in LightGCN [10] obsesses more than 25% training time per epoch.Second, methods utilizing auxiliary graph structures may incur prohibitive memory cost when building and/or training on large-scale auxiliary graphs.More analyses on the computational complexity of existing graph-based multi-modal methods can be found in Table 1 and Table 4.
Self-Supervised Learning (SSL) [5,7] provides a possible solution for learning the representations of users and items without negative samples.Research in various domains ranging from Computer Vision (CV) to Natural Language Processing (NLP), has shown that SSL is possible to achieve competitive or even better results than supervised learning [3,7,38].The main idea of SSL is to maximize the similarity of representations obtained from different distorted versions of a sample using two asymmetry networks, i.e., the online network and the target network.However, training with only positive samples will lead the model into a trivial constant solution [5].
To tackle this collapsing problem, BYOL [7] and SimSiam [5] introduce an additional "predictor" network to the online network and a special "stop gradient" operation on the target network.Recently, BUIR [15] transfers BYOL into the recommendation domain and shows competitive performance on the evaluation datasets.
In this paper, we propose a Bootstrapped Multi-Modal Model, dubbed BM3, for multi-modal recommendation.It first simplifies the current SSL framework by removing its target network, which can reduce half of the model parameters.Moreover, to retain the similarity between different augmentations, BM3 incorporates a simple dropout mechanism to perturb the latent embeddings generated from the online network.This is different from current SSL paradigm that perturbs the inputs via graph augmentation [7,15] or image augmentation [26].The design eases both the memory and computational cost of conventional graph augmentation techniques [15,35], as it does not introduce any auxiliary graphs.Last but not least, we design a loss function that is specialized for multimodal recommendation.It minimizes the reconstruction loss of the user-item interaction graph as well as aligns the learned features under both inter-and intra-modality perspectives.
We summarize our main contributions as follows.First, we propose BM3, a novel self-supervised learning method for multi-modal recommendation.In BM3, we use a simple latent representation dropout mechanism instead of graph augmentation to generate the target view of a user or an item for contrastive learning without negative samples.Second, to train BM3 without negative samples, we design a Multi-Modal Contrastive Loss (MMCL) function that jointly optimizes three objectives.In addition to minimizing the classic user-item interaction graph reconstruction loss, MMCL further aligns the learned features between different modalities and reduces the dissimilarity between representations of different augmented views from a specific modality.Finally, we validate the effectiveness and efficiency of BM3 on three datasets with the number of nodes ranging from 20K to 200K.The experimental results show that BM3 achieves significant improvements over the state-ofthe-art multi-modal recommendation methods, while training 2-9× faster than the baseline methods.

RELATED WORK 2.1 Multi-modal Recommendation
2.1.1Deep Learning-based Models.Due to the success of the Collaborative Filtering (CF) method, most early multi-modal recommendation models utilize deep learning techniques to explore users' preferences on top of the CF paradigm.For example, VBPR [9], which builds on top of the BPR method, leverages the visual features of items.It utilizes a pre-trained convolutional neural network to obtain the visual features of items and linearly transforms them into a latent visual space.To make predictions, VBPR represents an item by concatenating the latent visual features with its ID embedding.Moreover, Deepstyle [19] augments the representations of items with both visual and style features within the BPR framework.To capture the users' preference on multi-modal information, the attention mechanism has also been adopted in recommendation models.For instance, VECF [4] utilizes the VGG model [27] to perform pre-segmentation on images and captures the user's attention on different image regions.MAML [17] uses a two-layer neural network to capture the user's preference on textual and visual features of an item.

Graph-based
Multi-modal Models.More recently, another line of research introduces GNNs into recommendation systems [37], which can greatly enhance the user and item representations by incorporating the structural information in the user-item interaction graph and auxiliary graphs.
To exploit the item multi-modal information, MMGCN [34] adopts the message passing mechanism of Graph Convolutional Networks (GCNs) and constructs a modality-specific user-item bipartite graph, which can capture the information from multi-hop neighbors to enhance the user and item representations.Based on MMGCN, DualGNN [31] introduces a user co-occurrence graph with a model preference learning module to capture the user's preference for features from different modalities of an item.As the user-item graph may encompass unintentional interactions, GRCN [33] introduces a graph refine layer to refine the structure of the user-item interaction graph by identifying the noise edges and corrupting the false-positive edges.To explicitly mine the semantic information between items, LATTICE [39] constructs item-item relation graphs for each modality and fuses them together to obtain a latent item graph.It dynamically updates the graph after items' information is propagated and aggregated from their highly connected affinities using GCNs.FREEDOM [45] further detects that the learning of item-item graphs are negligible and freezes the graph for effective and efficient recommendation.[43] provides a comprehensive survey of multi-modal recommender systems with taxonomy, evaluation and future directions.
Although graph-based multi-modal models achieve new stateof-the-art recommendation accuracy, they often require auxiliary graphs for user and item augmentations and also a large number of negatives for representation learning with BPR loss.Both requirements can lead to high computational complexity and prohibitive memory cost as the graph size increases, limiting the efficiency of these models in scenarios involving large-scale graphs.

Self-supervised Learning (SSL)
SSL-based methods have achieved competitive results in various CV and NLP tasks [12,20].As our model is based on SSL that only uses observed data, our review of SSL methods focuses on those that do not require negative sampling.
Current SSL frameworks are derived from Siamese networks [1], which are generic models for comparing entities [5].BYOL [7] and SimSiam [5] use asymmetric Siamese network to achieve remarkable results.Specifically, BYOL proposes two coupled encoders (i.e., the online encoder and the target encoder) that are optimized and updated iteratively.The online encoder is optimized towards the target encoder, while the target encoder is a momentum encoder.Its parameters are updated as an exponentially moving average of the online encoder.BYOL uses both a predictor on the online encoder and a "stop gradient" operator on the target encoder to avoid network collapse.SimSiam verifies that a "stop gradient" operator is crucial for preventing collapse.However, it shares the parameters between the online and target encoders.On the contrary, Barlow Twins [38] uses a symmetric architecture by designing an innovative objective function that can align the cross-correlation matrix computed from two contrastive representations as close to the identity matrix as possible.
Derived from BYOL, the recently proposed self-supervised framework, BUIR [15], learns the representations of users and items solely from positive interactions.It introduces different views and leverages a slow-moving average network to update the parameters of the target encoder with the online encoder.Same inputs are fed into different but relevant encoders to generate the contrastive views.
With the booming of SSL in CV and NLP, whether and how multimodal features can enhance the representations of users and items under the SSL paradigm in recommendation is still unexplored.In this paper, we propose a simplified yet highly efficient SSL model for multi-modal recommendation.It can achieve outstanding accuracy while also alleviating the computational complexity and memory cost on large graphs.

BOOTSTRAPPED MULTI-MODAL MODEL
In this section, we elaborate on our bootstrapped multi-modal model, which encompasses three components as illustrated in Fig. 1: a) multi-modal latent space convertor, b) contrastive view generator, and c) multi-modal contrastive loss.

Multi-modal Latent Space Convertor
Let e  , e  ∈ R  denote the input ID embeddings of the user  ∈ U and the item  ∈ I, where  is the embedding dimension, and U, I are the sets of users and items, respectively.Their cardinal numbers are set as |U| and |I|, respectively.We denote the modality-specific features obtained from the pre-trained model as e  ∈ R   , where  ∈ M denotes a specific modality from the full set of modalities M, and   denotes the dimension of the features.The cardinal number of M is denoted by |M|.In this paper, we consider two modalities: vision  and text .However, the model can be easily extended to scenarios with more than two modalities.As the multimodal feature spaces are different from each other, we first convert the multi-modal features and ID embeddings into the same latent space.

Multi-modal Features.
The features of an item obtained from different modalities are of different dimensions and in different feature spaces.For a multi-modal feature vector e  , we first project it into a latent low dimension using a projection function   based on Multi-Layer Perceptron (MLP).Then, we have: where denote the linear transformation matrix and bias in the MLP of   .In this way, each uni-modal latent representation h  shares the same space with ID embeddings.

ID Embeddings.
Previous work [39] has verified the crucial role of ID embeddings in multi-modal recommendation.Although the ID embeddings of users and items can be directly initialized within the latent space, they do not encode any structural information about the user-item interaction graph.Inspired by the recent success of applying GCN for recommendation, we use a backbone network of LightGCN [10] with residual connection to encode the structure of the user-item interaction graph.Suppose G = (V, E) be a given graph with node set V = U ∪ I and edge set E. The number of nodes is denoted by |V |, and the number of edges is denoted by |E |.The adjacency matrix is denoted by A ∈ R |V |×|V | , and the diagonal degree matrix is denoted by D. In G, the edges describe observed user-item interactions.If a user has interactions with an item, we build an edge between the user node and the item node.Moreover, we use H  ∈ R |V |× to denote the ID embeddings at the -th layer by stacking all the embeddings of users and items at layer .Specifically, the initial ID embeddings H 0 is a collection of embeddings e  and e  from all users and items.A typical feed forward propagation GCN [14] to calculate the hidden ID embedding H +1 at layer  + 1 is recursively conducted as: where  (•) is a non-linear function, e.g., the ReLu function, Â = D−1/2 (A + I) D−1/2 is the re-normalization of the adjacency matrix A, and D is the diagonal degree matrix of A + I.For node classification, the last layer of a GCN is used to predict the label of a node via a    classifier.On top of the vanilla GCN, LightGCN simplifies its structure by removing the feature transformation W  and non-linear activation  (•) layers for recommendation.As they found these two layers impose adverse effects on recommendation performance.The simplified graph convolutional layer in LightGCN is defined as:

⋯ ⋯
where the node embeddings of the ( + 1)-th hidden layer are only linearly aggregated from the -th layer with a transition matrix D −1/2 AD −1/2 .The transition matrix is exactly the weighted adjacency matrix mentioned above.We use a readout function to aggregate all representations in hidden layers for user and item final representations.However, GCNs may suffer from the over-smoothing problem [2,16,18].Following LATTICE [39], we also add a residual connection [2,14] to the item initial embeddings H 0  to obtain the final representations of items.That is: where the READOUT function can be any differentiable function.We use the default mean function of LightGCN for its final ID embedding updating.
With the multi-modal latent space convertor, we can obtain three types of latent embeddings: user ID embeddings, item ID embeddings, and uni-modal item embeddings.In the following section, we illustrate the design of losses in BM3 for efficient parameter optimization without negative samples.

Multi-modal Contrastive Loss
Previous studies on SSL use the stop-gradient strategy to prevent the model from resulting in a trivial constant solution [5,7].Besides, they use online and target networks to make the model parameters learn in a teacher-student manner [29].BM3 simplifies the current SSL paradigm [5,7] by postponing the data augmentation after the encoding of the online network.We first illustrate the data augmentation in BM3.

Contrastive View Generator.
Prior studies [15,30,36] use graph augmentations to generate two alternate views of the original graph for self-supervised learning.Input features are encoded through both graphs to generate the contrastive views.To reduce the computational complexity and the memory cost, BM3 removes the requirement of graph augmentations with a simple latent embedding dropout technique that is analogous to node dropout [28].The contrastive latent embedding h of h under a dropout ratio  is calculated as: (5) Following [5,7], we also place stop-gradient on the contrastive view h.Whilst we feed the original embedding h into a predictor of MLP. where where || • || 2 is ℓ 2 -norm.The total loss is averaged over all user-item pairs.The intuition behind this is that we intend to maximize the prediction of the positively perturbed item  given a user , and vice versa.The minimized possible value for this loss is −1.
Finally, we stop gradient on the target network and force the backpropagation of loss over the online network only.We follow the stop gradient () operator as in [5,7], and implement the operator by updating Eq. ( 7) as: With the stop gradient operator, the target network receives no gradient from ( h  , h  ).

3.2.3
Inter-modality Feature Alignment Loss.In addition, we further align the multi-modal features of items with their target ID embeddings.The alignment encourages the ID embeddings close to each other on items with similar multi-modal features.For each uni-modal latent embedding h  of an item , the contrastive view generator outputs its contrastive pair as ( h  , h   ).We use the negative cosine similarity to perform the alignment between h  and h  : 3.2.4Intra-modality Feature Masked Loss.Finally, BM3 uses intramodlity feature masked loss to further encourage the learning of predictor with sparse representations of latent embeddings.Sparse is verified scale efficient in large transformers [11,25].We randomly mask out a subset of the latent embedding h  by dropout with the contrastive view generator and denote the sparse embedding as h   .The intra-modality feature masked loss is defined as: Additionally, we add regularization penalty on the online embeddings (i.e., h  and h  ).Our final loss function is:

Top-𝐾 Recommendation
To generate item recommendations for a user, we first predict the interaction scores between the user and candidate items.Then, we rank candidate items based on the predicted interaction scores in descending order, and choose  top-ranked items as recommendations to the user.Classical CF methods recommend top- items by ranking scores of the inner product of a user embedding with all A high score suggests that the user prefers the item.

Computational Complexity
The computational cost of BM3 mainly occurs in linear propagation of the normalized adjacency matrix Â.

EXPERIMENTS
We perform comprehensive experiments to evaluate the effectiveness and efficiency of BM3 to answer the following research questions.
• RQ1: Can the self-supervised model leveraging only positive user-item interactions outperform or match the performance of the supervised baselines?• RQ2: How efficient of the proposed BM3 model in multimodal recommendation with regard to the computational complexity and memory cost?• RQ3: To what extent the multi-modal features could affect the recommendation performance of BM3? • RQ4: How different losses in BM3 affect its recommendation accuracy?

Experimental Datasets
Following previous studies [9,39], we use the Amazon review dataset [8] for experimental evaluation.This dataset provides both product descriptions and images simultaneously, and it is publicly available and varies in size under different product categories.To ensure as many baselines can be evaluated on large-scale datasets, we choose three per-category datasets, i.e., Baby, Sports and Outdoors (denoted by Sports), and Electronics, for performance evaluation 1 .In these datasets, each review rating is treated as a record of positive user-item interaction.This setting has been widely used in previous studies [9,10,39,40].The raw data of each dataset are pre-processed with a 5-core setting on both items and users, and their 5-core filtered results are presented in Table 2, where the data sparsity is measured as the number of interactions divided by the product of the number of users and the number of items.The pre-processed datasets include both visual and textual modalities.Following [39], we use the 4,096-dimensional visual features that have been extracted and published in [21].For the textual modality, we extract textual embeddings by concatenating the title, descriptions, categories, and brand of each item and utilize sentence-transformers [23] to obtain 384-dimensional sentence embeddings.

Baseline Methods
To demonstrate the effectiveness of BM3, we compare it with the following state-of-the-art recommendation methods, including general CF recommendation models and multi-modal recommendation models.
• BPR [24]: This is a matrix factorization model optimized by a pair-wise ranking loss in a Bayesian way.• LightGCN [10]: This is a simplified graph convolution network that only performs linear propagation and aggregation between neighbors.The hidden layer embeddings are averaged to calculate the final user and item embeddings for prediction.
• BUIR [15]: This self-supervised framework uses asymmetric network architecture to update its backbone network parameters.In BUIR, LightGCN is used as the backbone network.It is worth noting that BUIR does not rely on negative samples for learning.

• VBPR [9]: This model incorporates visual features for user
preference learning with BPR loss.Following [31,39], we concatenate the multi-modal features of an item as its visual feature for user preference learning.• MMGCN [34]: This method constructs a modal-specific graph to learn user preference on each modality leveraging GCN.The final user and item representations are generated by combining the learned representations from each modality.
• GRCN [33]: This method improves previous GCN-based models by refining the user-item bipartite graph with removal of false-positive edges.User and item representations are learned on the refined bipartite graph by performing information propagation and aggregation. 1Datasets are available at http://jmcauley.ucsd.edu/data/amazon/links.html • DualGNN [31]: This method builds an additional user-user correlation graph from the user-item bipartite graph and uses it to fuse the user representation from its neighbors in the correlation graph.
• LATTICE [39]: This method mines the latent structure between items by learning an item-item graph from their multimodal features.Graph convolutional operations are performed on both item-item graph and user-item interaction graph to learn user and item representations.
We group the first three baselines (i.e., BPR, LightGCN, and BUIR) as general models, because they only use implicit feedback (i.e., user-item interactions) for recommendation.The other multi-modal models utilize both implicit feedback and multi-modal features for recommendation.Analogously, we categorize BUIR into the selfsupervised model and the others as supervised models as they are using negative samples for representation learning.The proposed BM3 model is within the self-supervised multi-modal domain.

Setup and Evaluation Metrics
For a fair comparison, we follow the same evaluation setting of [31,39] with a random data splitting 8:1:1 on the interaction history of each user for training, validation and testing.Moreover, we use Recall@ and NDCG@ to evaluate the top- recommendation performance of different recommendation methods.Specifically, we use the all-ranking protocol instead of the negative-sampling protocol to compute the evaluation metrics for recommendation accuracy comparison.In the recommendation phase, all items that have not been interacted by the given user are regarded as candidate items.In the experiments, we empirically report the results of  at 10, 20 and abbreviate the metrics of Recall@ and NDCG@ as R@ and N@, respectively.

Implementation Details
Same as other existing work [10,39], we fix the embedding size of both users and items to 64 for all models, initialize the embedding parameters with the Xavier method [6], and use Adam [13] as the optimizer with a learning rate of 0.001.For a fair comparison, we carefully tune the parameters of each model following their published papers.The proposed BM3 model is implemented by Py-Torch [22].We perform a grid search across all datasets to conform to its optimal settings.Specifically, the number of GCN layers is tuned in {1, 2}.The dropout rate for embedding perturbation is chosen from {0.3, 0.5}, and the regularization coefficient is searched in {0.1, 0.01}.For convergence consideration, the early stopping and total epochs are fixed at 20 and 1000, respectively.Following [39], we use R@20 on the validation data as the training stopping indicator.We have integrated our model and all baselines into the unified multi-modal recommendation platform, MMRec [46].

Effectiveness of BM3 (RQ1)
The performance achieved by different recommendation methods on all three datasets are summarized in Table 3. From the table, we have the following observations.First, the proposed BM3 model significantly outperforms both general recommendation methods and state-of-the-art multi-modal recommendation methods on each dataset.Specifically, BM3 improves the best baselines by 3.68%, 6.15%, '*' In pre-processing, DualGNN requires about 138GB memory and 6 hours to construct the user-user relationship graph on Electronics data.and 20.39% in terms of Recall@10 on Baby, Sports, and Electronics, respectively.The results not only verify the effectiveness of BM3 in recommendation, but also show BM3 is superior to the baselines for recommendation on the large graph (i.e., Electronics).Second, multi-modal recommendation models do not always outperform the general recommendation models without leveraging modal features.Although the recommendation accuracy of VBPR building upon BPR dominates its counterpart (i.e., BPR) across all datasets, GRCN and DualGNN using LightGCN as its downstream CF model do not gain much improvement over LightGCN.Differing from the multi-modal feature fusion mechanism of MMGCN, GRCN, and DualGNN, LATTICE uses the multi-modal features in an indirect manner by building an item-item relation graph and performs graph convolutional operation on the graph.We speculate there are two potential reasons leading to the suboptimal performance of MMGCN, GRCN, and DualGNN.i).They fuse the item ID embedding with its modal-specific features.Table 3 shows that LightGCN with ID embeddings can obtain good recommendation accuracy.The mixing of ID embeddings and modal features causes the items to lose their identities in recommendation, resulting in accuracy degradation.ii).They fail to differentiate the importance of multimodal features of items.In MMGCN, GRCN, and DualGNN, they treat features from each modality equally.However, our ablation study in Section 4.7 shows that the extracted visual features may contain noise and are less informative than the textual features.On the contrary, LATTICE learns the weights between multi-modal features when building the item-item graph.The proposed BM3 model alleviates the above issues by placing the contributions of multi-modal features into the loss function.Finally, we compare the recommendation accuracy between self-supervised learning models (i.e., BUIR and BM3).Although BUIR shows comparable performance with LightGCN, it is inferior to BM3.The performance of BUIR depends on the perturbed graph.Better contrastive view results in better recommendation acrruacy.As a result, it obtains fluctuating performance over different datasets.BM3 reduces the requirement of graph augmentation by latent embedding dropout.It is earlier for BM3 to obtain a consistent contrastive view with the original view than BUIR.Moreover, BM3 is more efficient than BUIR, because it uses only one backbone network.

Efficiency of BM3 (RQ2)
Apart from the comparison of recommendation accuracy, we also report the efficiency of BM3 against the baselines, in terms of utilized memory and training time per epoch.It is worth noting that all models are firstly evaluated on a GeForce RTX 2080 Ti with 12GB memory, and the model will be advanced to a Tesla V100 GPU with 32 GB memory if it cannot be fitted into the 12 GB memory.The efficiencies of different methods are summarized in Table 4.
From the table, we can have the following two observations.First, from both the general model and multi-modal model perspectives, graph-based models usually consume more memory than classic CF models (i.e., BPR and VBPR).Specifically, classic CF models require a minimum GPU memory cost for representation learning of users and items.Whilst graph-based models usually need to

Ablation Study (RQ3 & RQ4)
To fully understand the behaviors of BM3, we perform ablation studies on both the multi-modal features and different parts of the multi-modal contrastive loss.

Multi-modal Features (RQ3).
We evaluate the recommendation accuracy of BM3 by feeding individual modal features into the model.Specifically, we design the following variants of BM3.
• BM3 w/o v&t : In this variant, BM3 degrades to a general recommendation model that exploits only the user-item interactions for recommendation.As shown in Table 5, the importance of textual and visual features varies with datasets.BM3 w/o v leveraging only on the textual features gains slight better recommendation accuracy than BM3 w/o t on Baby dataset.However, on the other datasets, the differences between these two variations are negligible.Moreover, we can observe that the context features from either textual or visual modality can boost the performance of BM3 w/o v&t on Baby and Sports datasets.However, this statement does not hold under the large dataset, i.e., Electronics.By combining both the textual and visual features, BM3 achieves the best recommendation accuracy on all three datasets.Comparing the experimental results in Table 5 with that in Table 3, we note that BM3 w/o v&t without any multi-modal features is competitive with most multi-modal baseline models and superior to the general baseline models.This demonstrates the effectiveness of the self-supervised learning paradigm for top- item recommendation.It is worth noting that BM3 uses LightGCN as its backbone network.The best performance of BM3 is achieved by using 1, 1, and 2 GCN layers on Baby, Sports, and Electronics datasets, respectively.Whilst LightGCN itself requires 4 GCN layers to achieve its best performance on all datasets.

Multi-modal Contrastive Loss (RQ4).
As the loss function plays a critical role in learning the model parameters of BM3 without negative samples.We further study the behaviors of BM3 by removing different parts of the multi-modal contrastive loss.Specifically, we consider the following variants of BM3 for experimental evaluation.
• BM3 w/o mm : This variant only uses the interaction graph reconstruction loss to train the model parameters, i.e., trained without the multi-modal losses.It is worth noting that this variant is equivalent to BM3 w/o v&t .• BM3 w/o inter : In this variant, the representations of users and items are learned without considering the inter-modality alignment loss.• BM3 w/o intra : This variant learns the representations of users and items without considering the feature masked loss.Table 6 shows the performance achieved by BM3, BM3 w/o mm , BM3 w/o inter , and BM3 w/o intra on all three datasets.
From Table 6, we find a similar pattern as that shown in the ablation study on multi-modal features.That is, the multi-modal losses can improve the recommendation accuracy of BM3 on Baby and Sports datasets.However, BM3 leveraging either the inter-modality alignment loss or the intra-modality feature masked loss degrades its performance on the Electronics dataset.Moreover, the importance of inter-and intra-modality losses also varies with the datasets.
From the ablation studies on features and the loss function, we find the recommendation accuracy on the large dataset (i.e., Electronics) shows a different pattern from that of the small-scale datasets (i.e., Baby and Sports).The uni-modal feature or uni-loss function in BM3 shows no improvement in recommendation accuracy on Electronics dataset.We speculate that the supervised or self-supervised signals on a large dataset already enable BM3 w/o mm to learn good representations of users and items.Adding coarse multi-modal signals to BM3 w/o mm does not help improve the recommendation accuracy.

CONCLUSION
This paper proposes a novel self-supervised learning framework, named BM3, for multi-modal recommendation.BM3 removes the requirement of randomly sampled negative examples in modeling the interactions between users and items.To generate a contrastive view in self-supervised learning, BM3 utilizes a simple yet efficient latent embedding dropout mechanism to perturb the original embeddings of users and items.Moreover, a novel learning paradigm based on the multi-modal contrastive loss has also been devised.Specifically, the contrastive loss jointly minimizes: a) the reconstruction loss of the user-item interaction graph, b) the alignment loss between ID embeddings of items and their multi-modal features, and c) the masked loss within a modality-specific feature.We evaluate the proposed BM3 model on three real-world datasets, including one large-scale dataset, to demonstrate its effectiveness and efficiency in recommendation tasks.The experimental results show that BM3 achieves significant accuracy improvements over the state-of-the-art multi-modal recommendation methods, while training 2-9× faster than the baseline methods.

APPENDICES 6.1 Hyper-parameter Sensitivity Study
To guide the selection of hyper-parameters of BM3, we perform a hyper-parameter sensitivity study with regard to the recommendation accuracy, in terms of Recall@20.We use at least two datasets to evaluate the performance of BM3 under different hyperparameter settings.Specifically, we consider the following three hyper-parameters, i.e., the number of GCN layers , the ratio of embedding dropout, and the regularization coefficient factor .  6.1.1The Number of GCN Layers.The number of GCN layers  in BM3 is varied in {1, 2, 3, 4}.Fig. 2 shows the performance trends of BM3 with respect to different settings of .As shown in Fig. 2, BM3 shows relatively slow performance degradation as the number of layers increases, on small-scale datasets (i.e., Baby and Sports).However, the recommendation accuracy can be improved with more than one GCN layer in the backbone network on Electronics dataset.
6.1.2The Dropout Ratio and Regularization Coefficient.We vary the dropout ratio of BM3 from 0.1 to 0.5 with a step of 0.1, and vary the regularization coefficient  in {0.0001, 0.001, 0.01, 0.1}.Fig. 3 shows the performance achieved by BM3 under different combinations of the embedding dropout ratio and regularization coefficient.We note that a larger dropout ratio of BM3 on a relative small-scale dataset (i.e., Sports) usually helps BM3 achieve better recommendation accuracy.Moroever, the performance of BM3 is less sensitive to the settings of regularization coefficient on the large dataset (i.e., Electronics).In Fig. 3(b), it is worth noting that BM3 achieves competitive recommendation accuracy when the dropout ratio is larger than 0.2.This verifies the stability of BM3 in the recommendation task, i.e., the recommendation performance of BM3 is not just a consequence of random seeds.

Figure 2 :
Figure 2: Top-20 recommendation accuracy of BM3 varies with the number of GCN layers in the backbone network.

Figure 3 :
Figure 3: The Performance achieved by BM3 with respect to different combinations of the latent embedding dropout ratio and regularization coefficient on three datasets.Darker background indicates better recommendation accuracy.

Table 1 :
the linear transformation matrix and bias in the predictor function   .Comparison of computational complexity on graph-based multi-modal methods.To fit the page, we set  = 2|E |/, and  ℎ denotes the dimension of the hidden layer in a two-layer MLP.Function C(•, •) in the above equation is defined as: (7).2GraphReconstructionLoss.BM3 takes a positive user-item pair (, ) as input.With the generated contrastive view ( h  , h  ) of the online representations ( h , h ), we define a symmetrized loss function as the negative cosine similarity between ( h , h  ) and ( h  , h ):L  = C( h , h  ) + C( h  , h ).(7)

Table 2 :
Statistics of the experimental datasets.As our MMCL can learn a good predictor on user and item latent embeddings, we use the embeddings transformed by the predictor   for inner product.That is:

Table 3 :
Overall performance achieved by different recommendation methods in terms of Recall and NDCG.We mark the global best results on each dataset under each metric in boldface and the second best is underlined.' indicates the model cannot be fitted into a Tesla V100 GPU card with 32 GB memory.

Table 4 :
Efficiency comparison of BM3 against the baselines.' denotes the model cannot be fitted into a Tesla V100 GPU card with 32 GB memory.

Table 5 :
Ablation study of BM3 on multi-modal features.more memory, as they use both the user-item graph and multi-modal features in general.Second, among the graph-based multi-modal recommendation models, BM3 consumes less or comparable memory than other baselines.However, it reduces the training time by 2-9× per epoch.Compared with the best baseline, BM3 requires only half of the training time and half of the consumed memory of LATTICE.Although BM3 uses LightGCN as its backbone model, it does not introduce much additional cost on LightGCN other than the multi-modal features.The reason is that BM3 removes the negative sampling time and uses fewer GCN layers.
• BM3 w/o v : This variant of BM3 learns the representations of users and items without the visual features of items.• BM3 w/o t : This variant of BM3 is trained without the input from the textual features.

Table 5
summarizes the recommendation performance of BM3 and its variants, i.e., BM3 w/o v&t , BM3 w/o v , and BM3 w/o t , on all three experimental datasets.

Table 6 :
Ablation study on different parts of the multi-modal contrastive loss of BM3.