A Tale of Two Graphs: Freezing and Denoising Graph Structures for Multimodal Recommendation

Multimodal recommender systems utilizing multimodal features (e.g., images and textual descriptions) typically show better recommendation accuracy than general recommendation models based solely on user-item interactions. Generally, prior work fuses multimodal features into item ID embeddings to enrich item representations, thus failing to capture the latent semantic item-item structures. In this context, LATTICE proposes to learn the latent structure between items explicitly and achieves state-of-the-art performance for multimodal recommendations. However, we argue the latent graph structure learning of LATTICE is both inefficient and unnecessary. Experimentally, we demonstrate that freezing its item-item structure before training can also achieve competitive performance. Based on this finding, we propose a simple yet effective model, dubbed as FREEDOM, that FREEzes the item-item graph and DenOises the user-item interaction graph simultaneously for Multimodal recommendation. Theoretically, we examine the design of FREEDOM through a graph spectral perspective and demonstrate that it possesses a tighter upper bound on the graph spectrum. In denoising the user-item interaction graph, we devise a degree-sensitive edge pruning method, which rejects possibly noisy edges with a high probability when sampling the graph. We evaluate the proposed model on three real-world datasets and show that FREEDOM can significantly outperform current strongest baselines. Compared with LATTICE, FREEDOM achieves an average improvement of 19.07% in recommendation accuracy while reducing its memory cost up to 6$\times$ on large graphs. The source code is available at: https://github.com/enoche/FREEDOM.


INTRODUCTION
The increasing availability of multimodal information (e.g., images, texts and videos) associated with items enables users to have a more in-depth and comprehensive understanding of the items they are interested in.Consequently, multimodal recommender systems (MRSs) leveraging multimodal information are a recent trend in capturing the preferences of users for accurate item recommendations in online platforms, such as e-commerce and instant video platforms, etc.Studies [9,28] demonstrate that MRSs usually have better performance than general recommendation models which only utilize the historical user-item interactions.
The primary challenge in MRSs is how to effectively integrate the multimodal information into the collaborative filtering (CF) framework.Conventional studies either fuse the projected lowdimensional multimodal features with the ID embeddings of items via concatenation, summation operations [9,14] or leverage attention mechanisms to capture users' preferences on items [2,4,13].The surge of research on graph-based recommendations [26,29,33,35] inspires a line of work [22,24,28,31] that exploits the power of graph neural networks (GNNs) to capture the high-order semantics between multimodal features and user-item interactions.Specifically, MMGCN [25] utilizes graph convolutional networks (GCNs) to propagate and aggregate information on every modality of the item.On top of MMGCN, GRCN [24] refines the user-item bipartite graph by weighing the user-item edges according to the affinity between user preference and item content.To achieve better recommendation performance, researchers exploit auxiliary graph structures to explicitly capture the relations between users or items.For instance, DualGNN [22] constructs the user-user relation graph to smoothen users' preferences with their neighbors via GNNs.Nonetheless, the authors of LATTICE [28] argue that the latent item-item structures underlying the multimodal contents of items could lead to better representation learning.Hence, LATTICE first dynamically constructs a latent item-item graph by considering raw and projected multimodal features learned from multi-layer perceptrons (MLPs).It then performs graph convolutions on the constructed latent item-item graph to explicitly incorporate item relationships into representation learning.As a result, LATTICE could exploit both the high-order interaction semantic from the user-item graph and the latent item content semantic from the itemitem structure.Although this paradigm turns out to be effective, it poses a prohibitive cost by requiring computation and memory quadratic in the number of items.
Table 1: LATTICE with frozen item-item graph structures (i.e., LATTICE-Frozen) shows slightly better performance than its original version on Baby and Sports in terms of Recall and NDCG (Refer Section 5 for detailed experiment settings).

Dataset
Metric LATTICE LATTICE-Frozen Baby R@10 0.0547 0.0551 R@20 0.0850 0.0873 N@10 0.0292 0.0291 N@20 0.0370 0.0373 Sports R@10 0.0620 0.0626 R@20 0.0953 0.0964 N@10 0.0335 0.0336 N@20 0.0421 0.0423 Clothing R@10 0.0492 0.0434 R@20 0.0733 0.0635 N@10 0.0268 0.0227 N@20 0.0330 0.0279 In this paper, we first experimentally disclose that the item-item structure learning of LATTICE is dispensable.Specifically, we build an item-item graph directly from the raw multimodal contents of items and freeze it in training LATTICE.Under the same evaluation settings and datasets of LATTICE, our empirical experiments in Table 1 show that LATTICE with frozen item-item graph structures (i.e., LATTICE-Frozen) gains slightly better performance on two out of three datasets.To further refine the structures in both useritem bipartite graph and the item-item graph, we propose a graph structures FREEzing and DenOising Multimodal model for recommendation, dubbed as FREEDOM.To be specific, the item-item graph constructed by LATTICE retains the affinities between items as the edge weights, which may be noisy as the multimodal features are extracted from general pre-trained models (e.g., Convolutional Neural Networks or Transformers).FREEDOM discretizes the weighted item-item graph into an unweighted one to enable information propagation in GCNs only depending on graph structure.For the user-item graph, we further introduce a degree-sensitive edge pruning technique to denoise its structure to remove the noise caused by unintentional interactions or bribes [34].Inspired by [3], we sample edges following a multinomial distribution with pre-calculated parameters to construct a sparsified subgraph.Finally, FREEDOM learns the representations of users and items by integrating the unweighted item-item graph and the sparsified user-item subgraph.We analyze FREEDOM via a spectral perspective to demonstrate that FREEDOM is capable of achieving a tighter low-pass filter.We conduct comprehensive experiments on three real-world datasets to show that our proposed model can significantly outperform the state-of-the-art methods in terms of recommendation accuracy.

RELATED WORK 2.1 Multimodal Recommendation
Multimodal recommendation models utilize deep or graph learning techniques to effectively incorporate the multimodal information of items into the classic CF paradigm for better recommendation performance.Previous work [9,14] extends the BPR method [18] by fusing the visual and/or style content of items with their ID embeddings and obtains a considerable performance boost.To further differentiate users' preferences on multimodal information, attention mechanisms are exploited in multimodal recommendation models.For instance, VECF [4] utilizes the VGG model [20] to perform pre-segmentation on images and captures the user's attention on different image regions.MAML [13] uses a two-layer neural network to capture the user's preference for textual and visual features of an item.Under the surge of GNNs applied in recommendation systems [26], researchers are inspired to inject high-order semantics into user/item representation learning via GNNs.MMGCN [25] adopts the message passing mechanism of GCNs and constructs a modality-specific user-item bipartite graph, which can capture the information from multi-hop neighbors to enhance the user and item representations.Following MMGCN, GRCN [24] introduces a graph refine layer to refine the structure of the user-item interaction graph by identifying the noise edges and corrupting the false-positive edges.In DualGNN [22], the authors argue that users' preferences may dynamically evolve with time.They introduce a user co-occurrence graph with a preference learning module to capture the user's preference for features from different modalities of an item.However, the aforementioned models mine the semantic information between items in an implicit manner and may lead to inferior performance.To this end, LATTICE [28] explicitly constructs item-item relation graphs for each modality and fuses them together to obtain a latent item-item graph.It then dynamically updates the item-item graph with projected multimodal features from MLPs and achieves state-of-the-art recommendation accuracy.Recently, we also see an emerging of applying self-supervised learning in multimodal recommendations [21,36].For an in-depth exploration of multimodal recommender systems, we recommend consulting the comprehensive survey conducted by [30].

Denoising Graph Structures
Studies on denoising graph structures can be roughly categorized as either permanent or temporary edge pruning.Edges are either pruned or retained based on calculated scores.The former work [15,22,23,27] leverages attention mechanisms or pre-defined functions to calculate a node's attention or affinity scores with its neighbors.Based on the scores, they either permanently prune the edges satisfying a pre-defined condition or decrease the weights on the edges.The latter work [7,19] iteratively prunes the edges of the graph to extract a smaller subgraph for graph learning [15].It is worth mentioning that DropEdge [19] is an intuitive and widelyused method that prunes edges following a uniform distribution.Other studies [5,11] place a parameterized distribution or function on edges of the graph and learn the parameters in a supervised manner.A sparsified graph is sampled from the original graph following the learned distributions or functions.For computational complexity consideration, in this paper, we precalculate and fix the parameter of distribution for subgraph sampling.

FREEZING AND DENOISING GRAPH STRUCTURES
In this section, we elaborate on each component of FREEDOM from graph construction to item recommendation.Fig. 1 shows the overall architecture compared with LATTICE.  ) LATTICE [28].FREEDOM freezes the item-item graph and denoises the user-item graph simultaneously for multimodal recommendation.

Constructing Frozen Item-Item Graph
Following [28], FREEDOM also uses NN to construct an initial modalityaware item-item graph   using raw features from each modality .Considering  items, we calculate the similarity score     between item pair  and  with a cosine similarity function on their raw features (   and    ) of modality .That is: where     is the -th row, -th column element of matrix   ∈ R  × .We further employ NN sparsification [1] and convert the weighted   into an unweighted matrix.That is, for each item , we only retain the connection relations of its top- similar edges: Each element in   is either 0 or 1, with 1 denoting a latent connection between the two items.We empirically fixed the value of  at 10.Note that   is different from the weighted similarity matrix of LATTICE, which uses the affinity values between items as its elements.We normalize the discretized adjacency matrix   as , where   ∈ R  × is the diagonal degree matrix of   and    =      .With the resulted modalityaware adjacency matrices, we construct the latent item-item graph by aggregating the structures from each modality: where  ∈ R  × ,   is the importance score of modality  and M is the set of modalities.Same as other studies [13,28], we consider visual and textual modalities denoted by M = {,  } in this paper.
The importance score can be learned via parametric functions.Here, we reduce the model parameters by introducing a hyperparameter   denoting the importance of visual modality in constructing .
We let   = 1 −   .Finally, we freeze the latent item-item graph, which can greatly improve the efficiency of FREEDOM.To be specific, the construction of the similarity matrix under modality  in Eq. ( 1) requires computational complexity of O ( 2   ), where   is the dimension of raw features.As shown in Fig. 1, FREEDOM pre-calculates (dotted lines) the item-item graph before training and freezes it during model training.As a result, it removes the computational burden of O ( 2   ) for graph construction in training.

Denoising User-Item Bipartite Graph
In this section, we introduce a degree-sensitive edge pruning to denoise the user-item bipartite graph.The idea is derived from recent researches on model sparsification [19] and simplification [3].Specifically, DropEdge [19] randomly drops a certain ratio of edges in training.In [3], the authors verify that popular nodes are more likely to suffer from over-smoothing.Inspired by the finding, we sparsify the graph by pruning superfluous edges following a degreesensitive probability.
Formally, we denote a user-item graph as G = (V, E), where V is the set of nodes and E is the set of edges.The number of users and items in the user-item graph is  and  , respectively.We have  +  = |V |, where | • | denotes the cardinality of a set.We construct a symmetric adjacency matrix and each entry   of  is set to 1, if user  has interacted with item , otherwise,   is set to 0. Given a specific edge   ∈ E, (0 ≤  < |E |) which connects node  and , we calculate its probability as , where   and   are the degrees of nodes  and  in graph G, respectively.Usually, we prune a certain proportion  of edges of the graph.That is, the number of edges should be pruned is ⌊ |E |⌋, where ⌊•⌋ is the floor function.As a result, the number of retained edges is  = ⌈|E |(1−)⌉.Thus, we sample edges from the multinomial distribution with index  and parameter vector In this way, edges connecting high-degree nodes have a low probability to be sampled from the graph.That is, these edges are more likely to be pruned in G.We then construct a symmetric adjacency matrix   based on the sampled edges following Eq.( 4).In line with prior latent item-item graph, we also perform the re-normalization trick on   , resulting as   .Same as DropEdge, FREEDOM prunes the user-item graph and normalizes the sampled adjacency matrix iteratively in each training epoch.However, we resort to the original normalized adjacency matrix  =  −1/2  −1/2 in model inference.

Integration of Two Graphs for Learning
We perform graph convolutions on both graphs, that is, we employ a light-weighted GCN [10] for information propagation and aggregation on  and   .Specifically, the graph convolution over the item-item graph is defined as: where N () is the neighbor items of ,    ∈ R  is the -th layer item representation of item ,  0  denotes its corresponding ID embedding vector and  is the dimension of an item or user ID embedding.We stack   convolutional layers on the item-item graph  and obtain the last layer representation     as the representation   ∈ R  of  from the multimodal view: Analogously, in the user-item graph, we perform   convolutional operations on   and obtain embedding of a user   ∈ R  or an item   ∈ R  with a readout function on all the hidden representations resulted in each layer: where the READOUT function can be any differentiable function,  0  and  0  =  0  denotes the ID embeddings of user  and item , respectively.We use the default mean function of LightGCN [10] for embedding readout.
Finally, we use the user representation output by the user-item graph as its final representation.For the item, we sum up the representations obtained from the two graphs as its final representation.
To fully explore the raw features, we project multimodal features of item  in each modality via MLPs.
where   ∈ R   × ,   ∈ R  denote the linear transformation matrix and bias in the MLP.In this way, each uni-modal representation    shares the same latent space with its ID embedding   .
For model optimization, we adopt the pairwise Bayesian personalized ranking (BPR) loss [18], which encourages the prediction of a positive user-item pair to be scored higher than its negative pair: where D is the set of training instances, and each triple (, , ) satisfies   = 1 and    = 0.  (•) is the sigmoid function and  is a hyperparameter of FREEDOM to weigh the reconstruction losses between user-item ID embeddings and projected multimodal features.

Top-𝐾 Recommendation
To generate item recommendations for a user, we first predict the interaction scores between the user and candidate items.Then, we rank candidate items based on the predicted interaction scores in descending order, and choose  top-ranked items as recommendations to the user.The interaction score is calculated as: A high score suggests that the user prefers the item.Note that we only use user and item ID embeddings for prediction, because we empirically find the adding of projected item multimodal features in prediction does not show improvement on performance.However, the item multimodal representations can partially benefit the learning of user representations in FREEDOM via Eq.( 10).

SPECTRAL ANALYSIS
In this section, we examine the benefits of freezing the item-item graph in FREEDOM through the lens of spectral analysis.We show that it possesses a tighter upper bound on the graph spectrum.In addition, we empirically calculate the largest eigenvalues of the normalized graph Laplacian on experimental datasets to validate our analysis.

Theoretical Analysis
Lemma 4.1.The eigenvalues of the frozen matrix in FREEDOM possess a tighter upper bound.
Proof.Let  be an eigenvalue of the non-negative matrix . ) is monotonically increasing, and maximizing Eq. ( 12) requires the maximized s′   ≥ d (average value of node degree).Hence, we can derive max , s  ≤ max , s′   .□ In consideration of the page limit, we have moved the more detailed proof to Appendix A. Readers can access it for a more comprehensive understanding of our work.

Empirical Study
We compute the largest eigenvalues of the item-item matrix in both FREEDOM and LATTICE, as shown in Table 2.The results validate our previous analysis.A tighter upper bound of eigenvalues in the item-item graph ensures that the frozen item-item graph in FREEDOM can act as a low-pass filter, eliminating the effect of negative coefficients at large frequencies.We further investigate the impact of the frozen item-item graph in FREEDOM and the learnable graph in LATTICE on validation accuracy.The plots in Fig. 2 reveal that graphs in FREEDOM trained with this property achieve higher accuracy and more stable performance than LATTICE.

EXPERIMENTS
To evaluate the effectiveness and efficiency of our proposed FREEDOM, we conduct experiments to answer the following research questions on three real-world datasets.
• RQ1: How does FREEDOM perform compared with the stateof-the-art methods for recommendation?As our model improves LATTICE [28] by freezing and denoising the graph structures, how about its improvement over LATTICE?• RQ2: How efficient of our proposed FREEDOM in terms of computational complexity and memory cost?• RQ3: How do different components in FREEDOM influence its recommendation accuracy?• RQ4: How sensitive is our model under the perturbation of hyperparameters?

Experimental Datasets
Following [28], we conduct experiments on three categories of the Amazon review dataset [8]: (a) Baby, (b) Sports and Outdoors, and (c) Clothing, Shoes and Jewelry.For simplicity, we denote them as Baby, Sports and Clothing, respectively.The Amazon review dataset provides both visual and textual information about the items and varies in the number of items under different categories.The raw data of each dataset are pre-processed with a 5-core setting on both items and users, and their filtered results are presented in Table 3.We directly use the 4,096-dimensional visual features extracted by pre-trained Convolutional Neural Networks [8].For the textual modality, we extract a 384-dimensional textual embedding by utilizing sentence-transformers [17] 1 on the concatenation of the title, descriptions, categories, and brand of each item.

Baseline Methods
To demonstrate the effectiveness of FREEDOM, we compare it with the state-of-the-art recommendation methods in two categories.First category comprises two general CF models that recommend personalized items to users solely based on user-item interactions: • BPR [18] optimizes the latent representations of users and items under the framework of matrix factorization (MF) with a BPR loss.• LightGCN [10] simplifies the vanilla GCN by removing its non-linear activation and feature transformation layers for recommendation.
While the second line of work includes six multimodal recommendation models that leverage both user interactions and multimodal information about items for recommendation: • VBPR [9] incorporates visual features for user preference learning under the MF framework and BPR loss.Following [22,28], we concatenate the multimodal features of an item as its visual feature for user preference learning.• MMGCN [25] fuses the representations generated by GCNs in each modality of items for recommendation.• GRCN [24] refines the user-item bipartite graph with the removal of false-positive edges for multimodal recommendation.Based on the refined graph, it then learns user and item representations by performing information propagation and aggregation via GCNs.
• DualGNN [22] augments the representations of users in GCNs with an additional user-user correlation graph that is extracted from the user-item graph.
• LATTICE [28] learns the latent semantic item-item structures from the multimodal features for recommendation.As FREEDOM improves LATTICE, we highlight its improvement over LATTICE in Table 4.
• SLMRec [21] incorporates self-supervised learning into multimedia recommendation.It proposes three data augmentations to uncover the multimodal patterns in data for contrastive learning.

Evaluation Protocols
For a fair comparison, we adopt the same evaluation settings of [21,22,28].Specifically, we use two widely-used evaluation protocols for top- recommendation: Recall@ and NDCG@, which we refer to as R@ and N@ in brief.We report the average metrics of all users in the test set under both  = 10 and  = 20.For each user in the evaluated dataset, we randomly split 80% of historical interactions for training, 10% for validation and the remaining 10% for testing.During training, we conduct the negative sampling strategy to pair each observed user-item interaction in the training set with one negative item that the user does not interact with before.We use the all-ranking protocol to compute the evaluation metrics for recommendation accuracy comparison.

Implementation and Hyperparameter Settings
Following existing work [10,28], we fix the embedding size of both users and items to 64 for all models, initialize the embedding parameters with the Xavier method [6], and use Adam [12] as the optimizer.For a fair comparison, we carefully tune the parameters of each model following their published papers.All models are implemented by PyTorch [16] and evaluated on a Tesla V100 GPU card with 32 GB memory.To reduce the hyperparameters searching space of FREEDOM, we fix the number of GCN layers in the user-item bipartite graph and the item-item graph at   = 2 and   = 1, respectively.We empirically fix the hyperparameter of  at 1 − 03 and the visual feature ratio   at 0.1.We then perform a grid search on other hyperparameters of FREEDOM across all datasets to conform to its optimal settings.Specifically, the ratio  of the degree-sensitive edge pruning is searched from {0.8, 0.9}.For convergence consideration, the early stopping and total epochs are fixed at 20 and 1000, respectively.Following [28], we use R@20 on the validation data as the training stopping indicator.To ensure a fair comparison, all baseline models as well as our proposed model have been integrated into the unified multimodal recommendation framework, MMRec [32].

Performance Comparison
Effectiveness (RQ1).Table 4 reports the comparison of recommendation accuracy in terms of Recall and NDCG, from which we have the following observations: 1).Although LATTICE obtains the second best results on Baby and Clothing datasets, FREEDOM improves LATTICE by an average of 19.07%across all datasets and outperforms other baselines as well.The improvements attribute to the graph structures denoising and freezing components of FREEDOM.Denoising the user-item graph in  reduces the impact of noise signal in false-positive interactions.The frozen item-item graph in FREEDOM ensures the items are relevant to each other if linked because it built on the raw multimodal features.On the contrary, the latent item-item graph is dynamically learned via projecting the raw multimodal features into a low-dimensional space.The affinity of two items in the graph depends on not only the raw multimodal features but also the projectors (i.e., MLPs).2).Generally, both general MF (i.e., BPR) and graph-based (i.e., LightGCN) models can benefit from the multimodal information.For example, VBPR leveraging multimodal features outperforms BPR by up to 27.63% on average.Most graph-based multimodal models except MMGCN show better recommendation results than MF models (i.e., BPR and VBPR) and general CF models.However, we observe that the performance of MMGCN is inferior to that of BPR on Sports dataset.One potential reason for this might be that the textual and visual representations extracted from the texts and images of products in the Sports dataset are less informative than those of the Baby and Clothing datasets.Another reason may be the fusion mechanism used in MMGCN, which sums up the features from all modalities, making it difficult to discern the prominent contribution of individual modalities.Our proposed FREEDOM utilizing LightGCN as its backbone network gains a significant margin of 42.79% over the original LightGCN on average.3).In graph-based multimodal models, compared with MMGCN, GRCN utilizes simplified graph convolutional operations and refines the user-item bipartite graph by assigning weights based on the affinities of user preferences and item contents.It shows better performance in dense datasets, but is inferior in Clothing dataset.The reason is that the assignments of low weights on less similar user-item edges impede the information propagation of GCNs in a sparse dataset.DualGNN and SLMRec utilize auxiliary information to augment the representation learnings on users and items.They have comparable performance with each other on the evaluated dataset except Sports.One potential reason is that the extracted textual and visual representations from the texts and images of products in Sports are less informative than that of Baby and Clothing datasets.The evidence is that in Table 4, the improvements achieved by GRCN and DualGNN over LightGCN in Sports are not as significant as that of Baby and Clothing.SLMRec augments the representation from each modality of an item with self-supervised learning and gains better results than GRCN and DualGNN in Sports.Analogously, both LATTICE and FREEDOM augment the representations of items via a latent item-item graph and show competitive results.Table 4: Overall performance achieved by different recommendation methods in terms of Recall and NDCG.We mark the global best results on each dataset under each metric in boldface and the second best is underlined.Improvement percentage (improv.) is calculated as the ratio of performance increment from LATTICE to FREEDOM under each dataset and metric.To verify the stability of our method, we conduct experiments across 5 different seeds and state the improvements are statistically significant at the level of  < 0.01 with a paired -test.

Dataset Metric
General Overall, the experiment results in Table 4 validate the effectiveness of augmenting item representation with multimodal information.Furthermore, freezing the latent graph structure of the item as in FREEDOM can help achieve even better performance.
Efficiency (RQ2).We report the memory and training time consumed by FREEDOM and baselines in Table 5.From the table, we observe: 1).Multimodal models usually utilize more memory than the general CF models as they need to process modality-aware features.Graph-based models perform convolutional operations on graphs and cost more time on training.DualGNN consumes even more time in training because the convolutional operations are performed not only on the user-item graph but also on the user-user relationship graph.MMGCN

Ablation Study (RQ3)
In this section, we decouple the proposed FREEDOM and evaluate the contribution of each component with regard to recommendation accuracy.We design the following variants of FREEDOM based on its architecture: • FREEDOM-D denoises the user-item graph in FREEDOM without freezing the item-item graph.• FREEDOM-F improves LATTICE merely by the freezing component of FREEDOM, which freezes the item-item graph as introduced in Section 3.
• FREEDOM-R replaces the denoising method of degree-sensitive edge pruning in FREEDOM with random edge dropout [19].• FREEDOM-0 ablates the contribution of multimodal losses in Eq. ( 10) by setting  to 0.
We report the comparison results in terms of R@20 and N@20 with Fig. 3.The results show that the freezing component of FREEDOM attributes significantly to its overall performance.However, denoising the user-item graph can further improve the performance.
Freezing the item-item graph in FREEDOM (i.e., FREEDOM-F) gains consistent improvements over LATTICE on three datasets, but not for the denoising component.We speculate that without the freezing component, the denoising component may affect graph connectivity on sparse graphs and result in performance degradation.FREEDOM-R with random edge dropout shows slight improvement over FREEDOM-F but is still worse than FREEDOM, showing the effectiveness of degree-sensitive edge pruning.The performance of FREEDOM-0 is comparable to FREEDOM, indicating the effective design of FREEDOM in freezing and denoising graph structures for recommendation.However, FREEDOM can benefit slightly from the inclusion of multimodal losses.

Hyperparameter Sensitivity Study (RQ4)
Multimodal Features.FREEDOM uses the raw textual and visual features extracted from pre-trained models to construct the latent item-item graph.We first study how information from different modalities could affect the performance of FREEDOM by adjusting the ratio of visual features from 0.0 to 1.0 with a step of 0.1 in constructing the graph.A ratio of 0.0 on visual features means that the construction of item-item graph depends only on textual features.The item-item graph, on the other hand, is based on visual features if the ratio is 1.0.The recommendation results on R@20 and N@20 for three evaluated datasets are shown in Fig. 4. From the  results, we may infer that the textual features are more informative than the visual features in constructing an effective item-item graph.
The Dropout Ratio and Loss Weight.We vary the edge pruning ratio  in the denoising component of FREEDOM from 0.6 to 0.9 with a step of 0.1, and vary the loss weight  in {1e-04, 1e-03, 1e-02, 1e-01}.Fig. 5a and 5b show the performance achieved by FREEDOM under different combinations of embedding dropout ratios and weights on Baby and Clothing, respectively.The results suggest a high edge pruning ratio of FREEDOM on large graphs.Compared with the edge pruning ratio, the performance of FREEDOM is less sensitive to the settings of loss trade-off .However, placing a high weight on multimodal loss might limit the expressive power of FREEDOM and results in performance degradation.

CONCLUSION
In this paper, we experimentally reveal that the graph structure learning in a state-of-the-art multimodal recommendation model (i.e., LATTICE [28]) plays a trivial role in its performance.It is the item-item graph constructed from raw multimodal features  that contributes to the recommendation accuracy.Based on the finding, we propose a model that freezes the item-item graph and denoises the user-item graph simultaneously for multimodal recommendation.Through both theoretical and empirical analysis, we demonstrate that freezing the item-item graph in FREEDOM can yield various benefits.In denoising, we devise a degree-sensitive edge pruning method to sample the user-item graph, which shows better performance than the random edge dropout [19] for recommendation.Finally, we conduct extensive experiments to demonstrate the effectiveness and efficiency of FREEDOM in multimodal recommendation.

A PROOF OF LEMMA 4.1
Let  ′   represent the element in the adjacency matrix of LATTICE and let    represent the corresponding element in FREEDOM.We can derive that    = 1 ≥  ′   .As defined in the paper,  ′ = diag( ′ 11 , • • • ,  ′  ) is a diagonal matrix where each entry on the diagonal is equal to the row-sum of the adjacency matrix  ′  =   ′   .The normalized adjacency matrix  ′ in LATTICE can be denoted as follows: ) is monotonically increasing.To maximize Eq. ( 13) in the paper, s′   must also be maximized.From the above equations, we can deduce that the maximum value for either  or  ′ lies on its diagonal.Therefore, we can derive that  , s  = 1/  = 1/     ≤  , s′   = 1/ ′  = 1/   ′   .

Figure 1 :
Figure 1: Comparison of our proposed a) FREEDOM and b) LATTICE[28].FREEDOM freezes the item-item graph and denoises the user-item graph simultaneously for multimodal recommendation.

Figure 4 :
Figure 4: Performance of FREEDOM changes with the ratio of visual features in constructing the item-item graph.

Figure 5 :
Figure 5: Performance of FREEDOM with regard to different loss weights  and edge pruning ratios  on Baby and Clothing datasets.

Table 2 :
Comparison of the largest eigenvalues of FREEDOM and LATTICE on different datasets (lower is better).
Each element in  of FREEDOM and  ′ of LATTICE is positive, with the form of s  / √︁      and s′   /

Table 3 :
Statistics of the experimental datasets.
propagates convolutional operators on each modality, resulting in increased training time.2).Compared FREEDOM with LATTICE, the proposed FREEDOM can reduce the memory cost and training time of LATTICE on Clothing by 6× and 4×, respectively.FREEDOM removes the construction of latent item-item graph in each training epoch, it pre-builds the graph before training and freezes it during training.With the removal of item-item graph construction in training, FREEDOM is preferable with large graphs.

Table 5 :
Comparison of FREEDOM against state-of-the-art baselines on model efficiency.
′ Analogously, we can derive the normalized adjacency matrix in FREEDOM.We denote the elements in the normalized adjacency matrices of  ′ and  as s′   and s  , respectively.As stated in the paper, the function ( s′   ) = s′   /( s′   + ≠ s′