On Popularity Bias of Multimodal-aware Recommender Systems: a Modalities-driven Analysis

Multimodal-aware recommender systems (MRSs) exploit multimodal content (e.g., product images or descriptions) as items' side information to improve recommendation accuracy. While most of such methods rely on factorization models (e.g., MFBPR) as base architecture, it has been shown that MFBPR may be affected by popularity bias, meaning that it inherently tends to boost the recommendation of popular (i.e., short-head) items at the detriment of niche (i.e., long-tail) items from the catalog. Motivated by this assumption, in this work, we provide one of the first analyses on how multimodality in recommendation could further amplify popularity bias. Concretely, we evaluate the performance of four state-of-the-art MRSs algorithms (i.e., VBPR, MMGCN, GRCN, LATTICE) on three datasets from Amazon by assessing, along with recommendation accuracy metrics, performance measures accounting for the diversity of recommended items and the portion of retrieved niche items. To better investigate this aspect, we decide to study the separate influence of each modality (i.e., visual and textual) on popularity bias in different evaluation dimensions. Results, which demonstrate how the single modality may augment the negative effect of popularity bias, shed light on the importance to provide a more rigorous analysis of the performance of such models.


INTRODUCTION
The massive availability of digital data (e.g., images, texts, audio tracks) on the Internet has recently favored the raising of a novel family of recommender systems (RSs), known as multimodal-aware recommender systems (MRSs).With the integration of multimodal features (extracted through pre-trained deep learning models [26,30,51]) as items' side information, MRSs can generate more accurate recommendations than traditional collaborative filtering [17,61,66] (CF) algorithms by providing a countermeasure to common issues such as the sparsity of the user-item matrix and the cold-start scenario [27,49,56], or the inexplicability of users' preferences in the implicit feedback setting [15,22,23,38,39].The vast majority of MRSs are generally based upon the famous matrix factorization with bayesian personalized ranking (MFBPR) recommendation model.On the one hand, matrix factorization [34] (MF) is a latent-factor approach that maps users and items in the recommendation system to embeddings in the latent space and is trained to reconstruct the user-item interaction matrix via the dot product of the respective factors.On the other hand, bayesian personalized ranking [52] (BPR) is an optimization schema that drives from the assumption that, for each user, the predicted score of positive (i.e., interacted) and negative (i.e., non-interacted) items should diverge.Given its simple implementation and efficacy, MFBPR has long constituted the backbone of recommendation algorithms in CF [28,29,44], not only for multimodal recommendation.
Despite the growing interest in popularity bias [5,21] and potential solutions to address it, to date, very limited effort has been put into investigating how multimodal side information in MRSs could amplify the negative effects of popularity bias.To the best of our knowledge, three recent works discussed the concept of bias in multimodal-aware recommendation.First, Liu et al. [40] take into account the bias towards a single modality in multimodal recommendation, and propose a solution based upon causal inference and counterfactual reasoning; however, the definition they provide about bias is conceptually different from the one of popularity bias.Then, Kowald and Lacic [35] consider popularity bias in the case of multimedia recommendation datasets (e.g., MovieLens); however, they do not support their findings by testing recommender systems leveraging multimodal features as items' side information.Last, Malitesta et al. [43] investigate how novelty and diversity metrics are influenced in multimodal recommendation, but without a finer-grained analysis on the impact of each single modality.
Driven from the assumptions above, and differently from the related literature, we propose one of the first analyses on how multimodal-aware recommender systems may amplify popularity bias in the produced recommendation lists.To this aim, we select four established and recent multimodal-aware recommender systems from the literature (i.e., VBPR [27], MMGCN [63], GRCN [62], and LATTICE [66]) and train them on three categories of the Amazon recommendation dataset [46] (i.e., Office, Toys, and Clothing).Then, we evaluate the performance of the models by assessing metrics accounting for recommendation accuracy and popularity bias (the latter is measured through the diversity of recommendation lists and the percentage of retrieved items from the long-tail).Finally, to tailor our investigation, we focus on the separate impact of each multimodal side information (i.e., visual or textual) on popularity bias.To conduct this further study, we train the selected recommender systems when integrating either the visual or the textual modality as items' side information, and study the performance on single metrics and across pairs of metrics.
We seek to answer: RQ1.How do multimodal-aware recommendation models behave in terms of accuracy, diversity, and popularity bias?RQ2.What is the influence of each modality (i.e., visual, textual, multimodal) on such performance measures?Results widely show that the integration of a single modality (with respect to the multimodal setting) is capable of amplifying the negative effects of popularity bias, paving the way to additional, more formal investigations on multimodal recommendation.We release the code at: https://github.com/sisinflab/MultiMod-Popularity-Bias.

RELATED WORK
This section outlines the related literature about multimodal learning and popularity bias in recommendation.First, we provide an overview of the most popular and recent advances in multimodalaware recommendation, from which we select four representative approaches to analyze.Then, we summarize the concept of popularity bias, underlining how our work provides one of the first comprehensive investigations on popularity bias in multimodal recommendation at the granularity of modalities.Multimodal-aware recommendation.In various domains such as fashion [16,17,25], music [20,49,55], food [37,47,58], and micro-video [13,18,63] recommendation, the multimodal content associated with items (e.g., product images and descriptions, or audio tracks) has demonstrated to greatly enhance the representational power of recommender systems.
With the recent outbreak of graph neural networks in recommendation [28,45,50], several techniques have started integrating multimodality into the user-item bipartite graphs and knowledge graphs [11,28,53,57,60], refining the multimodal representations of users and items through different approaches implementing the message-passing schema.While some early attempts involve simply injecting multimodal item features into the graph-based pipeline [65], more advanced techniques learn separate graph representations for each modality and disentangle users' preferences at the modality level [33,54,62].Recent approaches focus on uncovering multimodal structural differences among items in the catalog [41,42,66], in some cases by leveraging self-supervised [61,69] and contrastive [64] learning.
In this work, we select four popular and recent approaches in multimodal recommendation, namely, VBPR [27], MMGCN [63], GRCN [62], and LATTICE [66], and test their performance to assess the impact of (multi)modalities on popularity bias.Popularity bias in recommendation.In recommendation, popularity bias refers to the system's tendency to favor popular items (i.e., short-head) at the expense of less popular ones (i.e., longtail) [2,6,10,12,31].For instance, Jannach et al. [31] conduct a comprehensive algorithmic comparison across multiple datasets; their findings indicate that existing recommendation methods tend to concentrate mainly on a small fraction of the available item spectrum.More recently, Abdollahpouri et al. [3] delve into this issue using the well-known MovieLens 1M dataset and reveal that over 80% of all ratings are attributed to popular items; their main focus lies in finding ways to strike a balance between ranking accuracy and the coverage of long-tail items.
In multimodal recommendation, only a few recent works discuss popularity bias, but with specific definitions [40] and neglecting the impact of multimodal features [35], or on other evaluation metrics [43].Conversely, our analysis assesses how prone multimodalaware recommender systems are to push items belonging to the short-head and how the different modalities affect the tendency to amplify the popularity bias.

BACKGROUND
This section provides useful background notions for our proposed experimental analysis.To begin with, we introduce the preliminaries about the personalized recommendation scenario.Then, we focus on factorization-based approaches for recommendation (such as MFBPR) and present their building formulations.Finally, we extend the formalism to multimodal-aware recommendation, by considering the four selected approaches (i.e., VBPR, MMGCN, GRCN, and LATTICE) and their rationales.

Preliminaries
Let U and I be the set of users and items in the recommendation system, respectively, where their cardinalities are indicated as |U| and |I|.Then, let X ∈ R | U | × | I | be the user-item interaction matrix, where   = 1 if user  interacted with item , 0 otherwise.On such basis, we also introduce R = {(, ) |   = 1} as the set of recorded user-item interactions (|R| is its cardinality).

Factorization-based approaches
Currently, the majority of state-of-the-art recommender systems in collaborative filtering follow the matrix factorization [34] (MF) rationale.Despite the different building solutions they propose, the core idea is to map users' and items' IDs to embeddings in the latent space.Specifically, we indicate with e  ∈ R  and e  ∈ R  the embeddings for user  and item , respectively, with  << |U|, |I|.Then, given a pair of user and item (, ), the predicted interaction score is: To learn such embeddings, MF-based approaches are usually coupled with bayesian personalized ranking [52] (BPR).This optimization method assumes that the predicted interaction score for users and their positive (i.e., interacted) items should be higher than the predicted interaction score for users and their negative (i.e., noninteracted) items.Concretely, let T = {(, , ) |   = 1 ∧    = 0} be the set of triples, where each triple consists of a user, a positive, and a negative item.Bayesian personalized ranking seeks to optimize the following objective function: where Θ is the vector containing all model's parameters (e.g., in the case of MF, e  and e  ), while  (•) is the sigmoid function.

Factorization-based approaches leveraging multimodal side information
We present the formulations of four state-of-the-art multimodalaware recommender systems (MRSs): VBPR [27], MMGCN [63], GRCN [62], and LATTICE [66].Before diving into their approaches, we introduce some additional formalism.Besides e  and e  , hereafter referred to as collaborative user and item embeddings, we also introduce f  and f  as the multimodal embeddings for user  and item .Moreover, we indicate M as the set of available modalities (e.g., visual, textual, audio), and we use  as embedding's apex to denote that the embedding refers to the  ∈ M modality (e.g., f   stands for the -th multimodal embedding of item ).VBPR.Visual-bayesian personalized ranking [27] (dubbed as VBPR) adopts visual features extracted from product images as items' side information in MFBPR.The authors introduce, along with user and item collaborative embeddings, additional visual user and item embeddings, where the latter is obtained as the activation of the penultimate layer from a pre-trained convolutional neural network.Then, the collaborative and visual embeddings are used to measure a collaborative-and visual-aware prediction for the interaction score and are eventually summed to obtain the final prediction score.In this work, we follow [66] and adapt VBPR to multimodality by concatenating the visual and textual item features to generate a unique multimodal representation of the item: where  is a projection function such that the latent dimensions of the multimodal user and item embeddings match.
MMGCN.One of the first approaches leveraging the representational power of graph convolutional networks (GCNs) with multimodal content is multimodal graph convolution network for recommendation [63] (dubbed as MMGCN).By designing one GCN for each modality, the model learns the different preferences users have towards each representation of the items.Finally, to fuse all multimodal representations into one for both users and items embeddings, the authors adopt the element-wise addition, and the predicted interaction score is calculated via the dot product: where  and  are a combination and GCN-based functions.We report only the user-side formulation for the sake of space.GRCN.Similarly to MMGCN, graph-refined convolutional network for multimedia recommendation [62] (dubbed as GRCN) utilizes a GCN-architecture to update user and item embeddings.Specifically, the adjacency matrix entries are refined by pruning the noisy user-item interactions according to the preference of users toward each item's modality.Collaborative and multimodal versions of the user and item embeddings are eventually combined through concatenation to estimate the interaction score via their dot product: (5) Again, we report only the user-wise formulation for lack of space.LATTICE.Latent structure mining method for multimodal recommendation [66] (dubbed as LATTICE) performs graph structure learning on multiple modality-aware item-item graphs (one for each modality).The obtained adjacency matrices are aggregated where  is a LightGCN [28] architecture performing graph structure learning as stated above.

PROPOSED ANALYSIS
In this section, we present the details to conduct our analysis.Initially, we report on the used datasets, describing the methodologies employed for extracting multimodal features.Subsequently, we introduce and formally define the evaluation metrics employed, encompassing accuracy, diversity, and popularity bias.Finally, we provide a thorough summary of the reproducibility information for our study, detailing the methods used for dataset splitting and filtering as well as the strategy for hyperparameter search.

Datasets
The multimodal recommender systems have been tested on three popular [17,33,66,69] datasets from the Amazon catalog [46]: Office Products (Office), (b) Toys & Games (Toys), and (c) Clothing, Shoes & Jewelry (Clothing).The multimodal datasets provide both images and descriptions for each available item.Specifically, we utilize the pre-extracted 4,096-dimensional visual features [24] which are made publicly available 1 .For the textual modality, we follow the existing literature [66], which aggregates the item's title, descriptions, categories, and brand, thereby generating textual embeddings by leveraging sentence transformers [51].The generated features are 1,024-dimensional embeddings.Additional dataset information can be found in Table 1.

Evaluation metrics
In the proposed study, we refer to various metrics that may bring out additional insights which have not been investigated yet in multimodal recommendation.Indeed, we do not solely rely on accuracy metrics (i.e., Recall and nDCG) but also on diversity (i.e., item coverage) and popularity bias (i.e., APLT) metrics.The metrics listed hereinafter are calculated on top- recommendation lists.
Recall.The Recall assesses the system's capacity to retrieve relevant items from the recommendation list, highlighting the need for 1 https://cseweb.ucsd.edu/~jmcauley/datasets/amazon/links.html.
thorough coverage to the list of user interactions [7]: where Rel  indicates the set of relevant items for user , while Rel  @ is the set of relevant recommended items in the top- list.
Normalized discount cumulative gain.The normalized discount cumulative gain (nDCG) considers the relevance and the ranking position of recommended products, taking into account the varied degrees of relevance: where DCG@ =  =1 2  , −1 log 2 (+1) quantifies the cumulative gain of relevance scores through the recommended list, with  , ∈ Rel  , and IDCG represents the cumulative gain of relevance scores for a perfect (ideal) recommender system.Item coverage.The item coverage (abbreviated "iCov" in the following) gives information on the coverage (item-side) measured in recommendation lists.A higher item coverage suggests that a larger fraction of the item space is being scrutinized and recommended to consumers, implying a more comprehensive coverage of user preferences and potentially a more comprehensive recommendation experience.In particular, we have: where Î @ is the list of top- recommended items for a user .
Average percentage of long-tail items.The average percentage of long-tail items (APLT) is a measure used to assess the presence of popularity bias in recommendation systems [2].Popularity bias refers to the tendency of recommendation algorithms to prioritize popular or mainstream items over less well-known or niche items.This bias can lead to limited exposure of users to diverse and personalized recommendations.The metric measure the percentage of items belonging to the medium/long-tail distribution in the recommendation lists averaged over all users: where Φ is the set of items belonging to the short-tail distribution while ∼ Φ is the set of items from the medium/long-tail distribution.Note that we decide to integrate the evaluation of the APLT along with the iCov (introduced above) because the latter may be functional to provide a complete interpretation of the former.Indeed, following their definitions and formulations, the two metrics are conceptually related.
Metrics value interpretation An ideal recommender system should increase all the metrics listed above according to the principle "higher is better" to boost accuracy and diversity while reducing the popularity bias of the produced recommendations.Nevertheless, with the current work, we try to unveil whether and why multimodal-aware recommender systems are affected by popularity bias.Thus, in the following, we will take into account those settings in which accuracy is high, while diversity and popularity bias are low (according to the metrics definitions).

Reproducibility
We investigate the models' behavior in three different settings: (i) visual modality, in which we employ only visual features, (ii) textual modality, in which we employ only textual features, and (iii) multimodal, where both modalities are considered and combined.
In the first step, we evaluate the models in the multimodal setting which is the same setting as the original one for each tested approach.Then, we focused on quantifying the singular modality influence on the multimodal scenario in terms of accuracy, diversity, and popularity bias.Furthermore, to ensure the reproducibility of our work, in the following, we provide comprehensive details regarding the preprocessing and splitting of the datasets, as well as the tuning and evaluation of the models.
The datasets are filtered using the -core strategy, where we set  to 5. Subsequently, we employ an 80%/20% train-test hold-out strategy to split the dataset.During the hyper-parameter tuning phase, we further divide the test set by removing 50% of its instances for the validation, specifically evaluating the results using the Recall@20 metric (as in the original work).In terms of models' we set the maximum number of epochs to 200 and select the model weights based on the epoch that yields the best performance on the validation set.
The code is implemented in Elliot [4].Note that the explored hyper-parameter values are not entirely aligned with the ones in the original papers and codes.Indeed, we want to tune the selected baselines on an extensive, shared set of hyper-parameter values across all models for the sake of fair comparison.

RESULTS AND DISCUSSION
In this section, we answer the following research questions (RQs): RQ1.How do the selected multimodal-aware recommendation models behave in terms of accuracy, diversity, and popularity bias?Section 5.1 investigates the recommendation performance in terms of accuracy (i.e., Recall, nDCG), diversity (i.e., iCov), and popularity bias (i.e., APLT).Note that, for the sake of completeness, we also report the performance of a recommender system generating recommendations in a random manner (i.e., Random) or based upon the most popular items in the catalog (i.e., MostPop); then, we train and evaluate MFBPR, that is the building model of the other multimodal baselines.We regard the performance of Random, MostPop, and MFBPR as a reference for the other multimodal-aware recommender systems we want to analyze.RQ2.What is the influence of each modality setting (i.e., visual, textual, multimodal) on such performance measures?Section 5.2 takes a step further by analyzing how each modality (i.e., visual, textual, and multimodal) influences accuracy, diversity, and popularity bias; the evaluation is conducted both on the single metric and across pairs of metrics.

Recommendation accuracy, diversity, and popularity bias (RQ1)
The results of the accuracy, diversity, and popularity bias metrics are reported in Table 2.The measured values refer to top@10, top@20, and top@50 recommendation lists.In the following, we discuss the obtained results considering the three metrics families separately.
Accuracy.Overall, LATTICE is the top-performing model, in alignment with the recent literature [66].Indeed, its ability to learn more refined items' embeddings based upon the multimodal item-item similarities may positively impact the accuracy performance.Conversely, VBPR's outstanding performance with respect to the other multimodal approaches comes as quite a surprise, considering that more complex and recent models leveraging graph neural networks (such as MMGCN and GRCN) do not outperform it.
Considering the performance on a dataset level, the most significant variation in metrics between LATTICE and VBPR is observed on Toys and Clothing, while the difference is reduced on Office.Notably, Toys and Clothing store three and four times more interactions than Office, respectively, but they are much sparser.This emphasizes LATTICE's ability to recommend more accurate items despite the higher dataset sparsity.Assessing the other models' performance, MMGCN works exceptionally well on Toys but shows the lowest performance as the number of interactions and sparsity increase.GRCN, in contrast, excels with highly sparse data, exhibiting an opposite trend to MMGCN.
From a metric-wise analysis, LATTICE outperforms VBPR in correctly predicting relevant items (high Recall) that are more likely to appear at the top of the recommendation lists (nDCG).However, the same trend is not as evident on the Recall, partly due to its normalization w.r.t. the  recommended items, which can lead to a smaller difference between LATTICE and VBPR as  increases.Diversity.As far as recommendation diversity (i.e., iCov) is concerned, the worst-performing model is MMGCN, since its iCov is, in any case, negatively out of scale compared to the other models.For instance, when taking into account Office, MMGCN's iCov is slightly better than MostPop (whose item diversity is, by construction, the lowest) demonstrating a restricted ability to engage diverse items in the recommendation lists.Unexpectedly, the second-worst model is LATTICE, even if its performance is still more balanced to the other approaches than MMGCN's one.Indeed, we observe that while MMGCN is affected by poor accuracy due to the lack of item diversity, LATTICE can deal with both accuracy and diversity.
As an opposite (but noteworthy) trend, we underline that VBPR and GRCN are generally capable of recommending a wider portion of items than MMGCN and LATTICE, independently on the selected top-.Overall, their iCov values are quite comparable to the ones of Random, which should provide (by definition) the highest item coverage from the catalog.We intend to further investigate (and justify) this aspect by assessing the effects of popularity bias.Popularity bias.In terms of popularity bias (i.e., APLT), the worst and second-worst models are once again MMGCN and LATTICE (the former on Office and Clothing, while the latter on Toys).As already discussed in Section 4.2, it makes sense to conceptually bind iCov and APLT.When assessing MMGCN's performance on Office, it becomes clear how the model is recommending only a few items (see again the iCov) while achieving good results in terms of accuracy; this demonstrates how the user-item interactions from Office may likely be biased towards popular items, and the phenomenon is even amplified due to the dataset small size.The same does not hold on Clothing where MMGCN, usually prone to popularity bias, gets also really low performance in terms of accuracy.Conversely, LATTICE can recommend popular items thus pushing its accuracy performance without amplifying the Table 2: Results in terms of recommendation accuracy (Recall, nDCG), diversity (iCov) and popularity bias (APLT).For accuracy metrics, ↑ means better performance, while ↓ means less diversity and more popularity bias.We remind that, while iCov and APLT metrics would generally adhere to the principle of "higher is better" (↑) for an ideal recommender system, in this work we consider the opposite as we want to emphasize which models are performing worst in terms of diversity and popularity bias.popularity bias phenomenon as much as MMGCN does.Indeed, even if LATTICE's iCov is the second-worst across all the datasets, the metric is always close to the best models in terms of diversity.Finally, VBPR and GRCN confirm their ability (already observed on the diversity measure) to tackle also popularity bias in all experimental settings.Particularly, while we recognize that VBPR performance is slightly increased with respect to MFBPR in terms of iCov and APLT (the two approaches are almost similar), GRCN results are quite remarkable.It might be the case that its graph edges pruning technique (driven by multimodal signals) is reducing the influence of noisy user-item interactions (i.e., redundant edges which might involve popular items), thus helping to diversify the recommendations by considering also several long-tail items.Summary.In a standard multimodal setting, LATTICE stands out for its accuracy performance and ability to handle dataset sparsity, but at the detriment of amplifying popularity bias; MMGCN struggles with diversity, exhibits strong popularity bias, and sacrifices accuracy in certain scenarios; VBPR and GRCN, in different manners, better manage all the metrics by finding the right compromise among them.

Modalities influence on recommendation performance (RQ2)
While the previous section has answered how multimodal recommender systems perform in terms of accuracy, diversity, and popularity bias when leveraging the full modalities, in the following, we discuss the influence of each single modality on the performance.We consider two evaluation dimensions where modalities influence is assessed (i) on accuracy, diversity, and popularity bias separately, and (ii) on pairs of metrics to investigate their joint variations.
Modalities influence on the single metric.Figure 2 displays the influence of each modality calculated as percentage variation with respect to the multimodal baseline, on the top@20 recommendation lists.We select the Recall (Figure 2a), iCov (Figure 2b), and APLT (Figure 2c) for accuracy, diversity, and popularity bias, respectively.As regards the accuracy performance (Figure 2a), we notice how the trend is not consistent across all the datasets and models.Particularly, when considering Office, we observe that only VBPR and LATTICE fully exploit multimodality (indeed, their performance decreases when the modalities are injected separately); on an opposite level, on MMGCN, the visual modality slightly improves the multimodal setting, while the textual modality even worsens it; then, GRCN achieves better performance on both the visual and textual modalities, suggesting that this approach may not take advantage of the multimodal configuration.On the Toys dataset, the only textual setting generally improves the performance, bringing important information to the model learning interaction.The model benefiting from the single modality the most is MMGCN, which has an improvement of at least 20% on both visual and textual.For the remaining models, the trend is quite stable with the textual and visual modalities improving and reducing the performance, respectively.Finally, we observe that Clothing is the only dataset  showing consistent trends.Indeed, the visual modality reduces the Recall while the textual increases it (with the only exception of VBPR whose percentage variation is negligible).Differently from the accuracy analysis, we recognize a quasistable trend in the performance variation measured for the diversity metric (Figure 2b).Considering the Office dataset, each modality's contribution is generally irrelevant except for MMGCN, for which the visual modality slightly improves the coverage across the whole recommendation list, while the textual one worsens the performance by a large margin.Assessing the trend on Toys, both the modalities decrease the coverage performance of the model when injected separately in the recommendation pipeline; remarkably, MMGCN is once again the model affected by the single modality presence the most, but this time the coverage performance widely deteriorates because of both the visual and textual modalities.Finally, on Clothing, both modalities lower the model's item coverage, with specific reference to the visual modality.
As the last part of our analysis, we take into account each modality's contribution to the popularity bias dimension (Figure 2c).Starting from Office, we notice how both modalities are prone to enforce popularity bias if injected singularly, with the only exception of LATTICE whose textual modality limits the popularity bias (the  APLT increases); this is interesting as we remind that LATTICE is the second-worst model in terms of popularity bias, but using only the textual modality reduces its accuracy performance and the influence of popular items in the recommendation list.When it comes to the Toys dataset, every single modality enforces the popularity bias of MMGCN and GRCN; for VBPR, the visual and textual modalities amplify and reduce the bias, respectively, while for LATTICE both the visual and textual modalities limit the popularity bias.Finally, on Clothing, both the modalities show to increase the popularity bias of the model (but the textual one on VBPR and LATTICE).
Modalities cross-influence on metrics pairs.To conclude, we discuss the cross-influence of each modality setting (i.e., visual, textual, and multimodal) on pairs of metrics.In this respect, we decide to display (Figure 3) the joint trend of (a) accuracy and popularity bias (i.e., Recall vs. APLT), (b) accuracy and diversity (i.e., Recall vs. iCov), and (c) diversity and popularity bias (i.e., iCov vs. APLT).We only report the results on Clothing for top@20 recommendations.In detail, VBPR and MMGCN are the models being affected by each specific modality the least, since the performance measures assessed on visual and textual are generally aligned with the multimodal reference.Regarding LATTICE, we notice that the textual modality has a major accuracy influence with respect to popularity bias and diversity.Indeed, the textual modality improves the Recall without having a relevant effect in terms of iCov and APLT; conversely, the visual modality reduces the accuracy by jointly worsening the diversity and the popularity bias.Finally, when considering GRCN, we observe that the multimodal setting reduces the popularity bias without affecting the accuracy and diversity.Summary.In a single modality setting, the textual one improves the accuracy, while both modalities negatively affect the diversity and reinforce the popularity bias.When evaluating the modalities' influence across metrics pairs, the textual modality has a significant influence on accuracy but minimal effects on diversity and popularity bias; conversely, the visual modality reduces accuracy and jointly worsens the popularity bias and diversity.

CONCLUSION AND FUTURE WORK
Motivated by the assumption that factorization models in recommendation (such as MFBPR) are affected by popularity bias, in this work, we provided one of the first systematic analyses on how multimodal-aware recommender systems (largely built upon MF-BPR) further amplify the recommendation of popular items.After having selected four state-of-the-art multimodal recommender systems, namely, VBPR, MMGCN, GRCN, and LATTICE, we proposed an exhaustive experimental study involving three datasets from the Amazon catalog, four metrics spanning three evaluation dimensions (i.e., accuracy, diversity, and popularity bias), and three modalities settings (i.e., multimodal, only visual, and only textual).Results demonstrated that, in a standard multimodal setting, VBPR and GRCN can strike a better compromise between all evaluated metrics than MMGCN and LATTICE; furthermore, the separate injection of the visual and textual modalities can improve the accuracy but negatively impact the diversity and popularity bias.Conclusively, a complementary investigation regarding the modalities' influence on metrics pairs outlined that the textual modality has a considerable impact on accuracy but little effect on diversity and popularity bias, whereas the visual modality reduces accuracy while exacerbating popularity bias and limiting the diversity.Such findings pave the way to a more complete study on the performance of these models and other more recent approaches in multimodal recommendation.

Figure 1 :
Figure 1: Short-head and long-tail items from the Office dataset in the Amazon catalog.

Figure 2 :
Figure 2: Percentage variation on the (a) Recall, (b) iCov, and (c) APLT when training the multimodal recommender systems with either visual or textual modalities.The 0% line stands for the reference performance provided by the multimodal version of the model.All results refer to the top@20 recommendation lists.

Figure 3 :
Figure 3: Performance analysis on Clothing when comparing (a) Recall vs. APLT, (b) Recall vs. iCov, and (c) iCov vs. APLT for different modality settings involving the multimodal, visual, and textual modalities.Metrics are on top@20.

Table 1 :
Statistics of the tested datasets.