MMRec: Simplifying Multimodal Recommendation

This paper presents an open-source toolbox, MMRec for multimodal recommendation. MMRec simplifies and canonicalizes the process of implementing and comparing multimodal recommendation models. The objective of MMRec is to provide a unified and configurable arena that can minimize the effort in implementing and testing multimodal recommendation models. It enables multimodal models, ranging from traditional matrix factorization to modern graph-based algorithms, capable of fusing information from multiple modalities simultaneously. Our documentation, examples, and source code are available at \url{https://github.com/enoche/MMRec}.


INTRODUCTION
Multimodal recommendation models is a raising trend in current research community due to the following reasons: • The prevalence of multimodal information (e.g., images, texts and videos); • Leveraging multimodal information in recommendation can address the sparsity of interaction data [6][7][8]11]; • The maturity of recent researches on multimodal learning in NLP and CV domains [1,3].
Consequently, the multimodal recommendation paradigm has incrementally metamorphosed into an indispensable cornerstone of digital media platforms.This evolution empowers these platforms to deliver tailored recommendations to users.This is achieved by simultaneously scrutinizing historical user interactions and the multifaceted modalities of items [6].Compared with the conventional recommender systems [2,5,9,10] that solely leverage useritem interactions for recommendation, multimodal models involves the processes of preprocessing information from multiple modalities (e.g., -coring user/item filtering, data splitting, vectorizing multimodal information, aligning items IDs with their multimodal information), fusing multimodal information etc.,.On the one hand, these tedious processes impede the progress of multimodal recommendation research.On the other hand, the wide variety of preprocessing methods make the models difficult to reproduce its performance and compare fairly with others.
To address these issues, we present MMRec, an open-source toolbox that simplifies the research on multimodal recommendation.MMRec provides a full stack toolbox that includes data preprocessing, multimodal recommendation models, multimodal information fusion, performance evaluation to minimize the cost of implementation and comparison of novel models or baselines.The objective of MMRec is to establish a benchmarking system that ensures the fair comparison of multimodal models in an efficient and non-laborious manner.The toolbox is highly configurable and user-friendly.Data Encapsulation.MMRec first preprocesses raw data and encapsulates user interactions and multimodal information into DataLoader of Pytorch.The format of raw data is consistent with Amazon Review Data 1 .MMRec performs -core filtering to retain the users and items with at least  interactions, and aligns the multimodal information with the retained items.It then splits the whole interactions into Training/Validation/Test.The raw features of multimodal information are vectorized into numeric values leveraging pre-trained multimodal models, such as transformers.

ARCHITECTURE AND KEY FEATURES
Trainer.MMRec provides various optimizer to train the models.Both unimodal and multimodal models are supported in MMRec.In MMRec, information from each modality can be easily fused for multimodal recommendation.In its current version, MMRec supports four popular modalities: Text, Image, Audio, Video.MMRec unifies the training interface for all models.Customized models are merely required to implement two functions: calculate_loss: The main part of the model, which defines how loss is generated from the model graph flow.full_sort_predict: This function predicts the ranking of items for users.
Evaluation.This module features a wide set of commonly used metrics for recommender systems, such as Recall, NDCG, MAP etc.,.
It is worth noting that all modules can be customized and configured by modifying the configuration files.All changes will be reflected and loaded in config module.The config module also supports grid searching of models on hyperparameters.The results from all combinations of hyperparameters are summarized and reported to end-users after training.Reproducibility of models are secured by resetting the seed and raw data in DataLoader in each hyperparameter combination.

COMPARISON TO RELATED WORKS
As far as we know, Cornac [4] is the only open-source library that supports multimodal information in recommendation.Although we observe 40+ algorithms are implemented in Cornac, most of them are general recommenders systems that do not utilize multimodal information.Furthermore, Cornac is limited by the fusion of modalities in its current stage.Specifically, Cornac cannot integrate multimodal information (i.e., text, image, audio, video) from more than one modality.Current version of MMRec supports 10+ multimodal recommendation models and a board range of general collaborative filtering models.The detailed comparisons between Cornac and MMRec are presented in Table 1.

Figure 1
Figure 1 presents the modules of MMRec.Noting the inputs of MMRec are raw data of multimodal and user-item interaction files.Data Encapsulation.MMRec first preprocesses raw data and encapsulates user interactions and multimodal information into DataLoader of Pytorch.The format of raw data is consistent with Amazon Review Data 1 .MMRec performs -core filtering to retain the users and items with at least  interactions, and aligns the multimodal information with the retained items.It then splits the whole interactions into Training/Validation/Test.The raw features of multimodal information are vectorized into numeric values leveraging pre-trained multimodal models, such as transformers.Trainer.MMRec provides various optimizer to train the models.Both unimodal and multimodal models are supported in MMRec.In MMRec, information from each modality can be easily fused for multimodal recommendation.In its current version, MMRec supports four popular modalities: Text, Image, Audio, Video.MMRec unifies the training interface for all models.Customized models are merely required to implement two functions:

Figure 1 :
Figure 1: The architecture of MMRec.MMRec consists of 4 modules ranging from raw data preprocessing to model performance evaluation.To ensure model reproducibility and performance consistency, MMRec consumes raw data as input.

Table 1 :
Comparison of multimodal recommendation frameworks.