Online Distillation-enhanced Multi-modal Transformer for Sequential Recommendation

Multi-modal recommendation systems, which integrate diverse types of information, have gained widespread attention in recent years. However, compared to traditional collaborative filtering-based multi-modal recommendation systems, research on multi-modal sequential recommendation is still in its nascent stages. Unlike traditional sequential recommendation models that solely rely on item identifier (ID) information and focus on network structure design, multi-modal recommendation models need to emphasize item representation learning and the fusion of heterogeneous data sources. This paper investigates the impact of item representation learning on downstream recommendation tasks and examines the disparities in information fusion at different stages. Empirical experiments are conducted to demonstrate the need to design a framework suitable for collaborative learning and fusion of diverse information. Based on this, we propose a new model-agnostic framework for multi-modal sequential recommendation tasks, called Online Distillation-enhanced Multi-modal Transformer (ODMT), to enhance feature interaction and mutual learning among multi-source input (ID, text, and image), while avoiding conflicts among different features during training, thereby improving recommendation accuracy. To be specific, we first introduce an ID-aware Multi-modal Transformer module in the item representation learning stage to facilitate information interaction among different features. Secondly, we employ an online distillation training strategy in the prediction optimization stage to make multi-source data learn from each other and improve prediction robustness. Experimental results on a stream media recommendation dataset and three e-commerce recommendation datasets demonstrate the effectiveness of the proposed two modules, which is approximately 10% improvement in performance compared to baseline models.

Compared to general recommendation systems, multi-modal recommendation requires not only model architecture design, but also consideration of how to effectively apply multi-modal features in downstream tasks, especially in a way compatible with the current recommendation system.Most current recommendation systems rely on collaborative filtering [10,11,31,48,49], which predicts user-item interactions by modeling users and items separately and then computing the similarity between the user and candidate item to generate a prediction score.In general multimodal recommendation systems, multi-modal features are used to enhance the connection between user-item pairs [32,34,36] or as side information, which is complementary to ID features [9,52].In contrast, sequential recommendation systems rely more on item representation learning than collaborative filtering and model user representations according to items clicked by the user and their temporal sequences [20,29,38,39,57].This approach places a higher demand on item representation learning, especially in the case of multi-modal recommendation systems, where the raw item information is more diverse.
Our study focuses on item representation learning in multimodal sequential recommendation systems and explores the performance of single-modal features (text or image) as individual input and as input combined with ID features in downstream recommendation tasks.We also investigate different fusion strategies when combining ID and multi-modal information.Our exploration experiments reveal that compared to other network structures, Transformers provide better semantic transformation and representation learning for single-modal features, leading to more accurate predictions in recommendation tasks.However, the advantage of strong representation brought by Transformers weakens when using multi-source data (ID, text, and image) as input.Further analysis reveals that this is due to ID features being easier to optimize and producing lower training loss in recommendation prediction tasks, whereas multi-modal features provide valuable prior information on item similarity, making recommendation systems easy to retrieve items of interest for users.Therefore, when ID and modal information are combined, the improvement in evaluation metrics may not perfectly align with the direction of the loss reduction.
To address the challenges in multi-modal sequential recommendation, we propose a new model-agnostic framework called Online Distillation-enhanced Multi-modal Transformer (ODMT) equipped with two novel modules.Firstly, we introduce an ID-aware Multimodal Transformer module in the item representation learning stage to facilitate information interaction among different features.Secondly, we apply an online distillation training strategy in the prediction optimization stage to obtain more robust predictions without compromising the loss optimization of the multi-modal features.Overall, our contributions can be summarized as follows: • To relieve the incompatibility issue between multi-modal features and existing sequential recommendation models, we introduce the ODMT framework, which comprises an ID-aware Multi-modal Transformer module for item representation.• To obtain robust predictions from multi-source input, we propose an online distillation training strategy in the prediction optimization stage, which marks the first instance of applying online distillation to a multi-modal recommendation task.
• Comprehensive experiments on four diverse multi-modal recommendation datasets and three popular backbones for sequential recommendation validate the effectiveness and transferability of proposed method, which is about 10% performance improvement compared with other baseline models.

PRELIMINARIES
This section aims to thoroughly investigate the effects of Item Representation Learning (IRL) and Information Fusion (IF) modules on downstream recommendation networks.We provide empirical evidence through experiments, which highlight two key findings: 1) Transformers are effective in transforming multi-modal information from general semantic to specific recommendation semantic; 2) simple fusion strategies can cause a discrepancy between the direction of loss optimization and the direction of metric improvement during the training process, thus affecting the significance of multi-modal features in the recommendation model.

Brief Concepts
The IRL module is responsible for generating the final item embeddings by converting raw input data into distributed representations.Input data can be categorized as item ID or item other modalities (e.g., image and text).The embedding table of items is a crucial component in sequential recommendation models [12,20].Each item has a unique embedding representation that corresponds to its index.For multi-modal data, BERT [6] and ViT [7] are utilized to extract textual and visual features from raw data.The extracted multi-modal features are then fed into a Feature Semantic Transformation (FST) module to convert modal information into a semantic space suitable for recommendations.Our FST module candidates include DNNs, MoE Adaptor (MoE), and Transform-ers+DNNs (TRM+DNNs), which have been widely used in previous researches [15,24,30,46].
In multi-modal sequential recommendation models, the IF module can be categorized into Early Fusion and Late Fusion [14,26,28,52].Early Fusion involves embedding all item information into a single feature representation, which is then inputted into the model, while Late Fusion is based on the prediction results or the prediction scores of each feature.In this paper, we consider three types of input information, namely ID, text, and image, that need to be fused.In Early Fusion, we obtain fused item embeddings by averaging the three features.In contrast, in Late Fusion, we get a user's general preferences by averaging the three logits, where different logits correspond to different user preferences.Intermediate Fusion is not discussed in this section, as it is a model-dependent fusion method [56].

Empirical Explorations
To ensure fair comparison in our experiments, we control all variables other than the FST module and IF module, including random seed, pre-trained encoders, hyper-parameters (e.g., learning rate, embedding size, hidden size, and dropout ratio), and experimental codes, all experiments are conducted in a unified framework.The backbone model for our sequential model is the representative one, SASRec [20], which uses self-attention mechanisms for sequence modeling.Figure 1 shows experimental results on the Stream and Arts (Amazon) datasets.General visual and textual features extracted by pre-trained models (e.g., BERT and ViT) are not necessarily suitable for recommendation tasks.Therefore, FST module is needed to transform the modality features into recommendation semantics.Figure 1 indicates that Transformers are capable of performing semantic transformation more effectively than DNNs and MoE when inputted with singlemodal data.This finding highlights the potential of Transformers in learning powerful representations for recommendation systems.
Previous studies in multi-modal sequential recommendation models [15,26,28,52] have typically used simple FST modules, such as DNNs and MoE, with Early Fusion or Late Fusion methods.In these cases, fusing multiple information sources has generally yielded favorable outcomes.When Transformers are employed as the FST module, we notice a diminishing advantage of utilizing multi-source input instead of single-modal input as the number of Transformer layers increases.Remarkably, even on the Arts dataset, the effectiveness of the single-modal input surpasses that of the multi-source fusion input.These results suggest that while Transformers possess strong representation learning capabilities, they may not be able to fully showcase their potential in the scenarios of multi-source input.
To provide more insights into the impact of FST modules and IF modules, we fix the input with plain text and the FST module with DNNs, respectively.Figure 2a and 2b illustrate the consistency between training loss and testing Recall, revealing that models with lower training loss corresponded to higher Recall scores.This observation suggests that single-modal input enables the FST module to better learn item representations and effectively reduce the training loss, leading to improved downstream recommendation performance.On the other hand, Figure 2c and 2d demonstrate the inconsistency between training loss and testing Recall when different types of information are used as input.This inconsistency may arise from overfitting or the mismatch between the objective function and the evaluation metric.For recommendation models, optimizing the ID features can be viewed as an unconstrained optimization problem, resulting in lower training loss.Conversely, optimizing modality features is subject to constraints due to the prior information that items' content has similarities.Therefore, even if the modality-based models do not achieve significantly lower training loss, they can achieve better performance, particularly when using Transformers as the FST module.
Based on our findings, we can conclude that the FST module plays a crucial role in extracting informative representations of items.Integrating multi-modal information may not lead to the optimal improvement in recommendation metrics, as there exists a misalignment between the direction of metric improvement and loss reduction.This misalignment creates a dilemma in learning both ID and modality features simultaneously for recommendation systems.The dilemma becomes more challenging as the modality representation capacity increases, which can ultimately lead to compromised recommendation performance.And in some cases, the performance is even worse compared to single-modal models.

METHOD
In the previous section, we discussed the significance of Transformers in acquiring multi-modal representations and also pointed out the shortcomings of current methods when fusing multi-source information.To further improve the representation ability of Transformers in recommendation scenarios and enable collaborative learning from different modalities without conflicts during training, we propose two modules based on the Late Fusion framework: 1) ID-aware Multi-modal Transformer.We incorporate the ID features with modality features and perform fine-grained feature interactions within a single multi-modal Transformer.2) Online Distillation.We use an online distillation framework to compute the recommendation classification loss for each input, leveraging the strong representation capacity of Transformers.This ensures that each sub-network captures distinct user preferences by optimizing corresponding loss.Besides, we introduce a distillation loss that facilitates on-the-fly mutual learning [8,53] among the student networks.Figure 3 shows the overall framework of ODMT with the above two modules.

Notations
We define the set of users as U = {} and the set of items as A = {}.For the each item   , we record its image, text and ID as    ,    , and    , respectively.The -th user interaction sequence   , is defined in chronological order as where     () represents the -th interaction item, and  represents the types of the item, i.e., image, text, and ID.

Item Representation Learning
Feature Extractor.Given an item with different types (  ,   , and   ), we first use fixed visual and textual feature extractors (ViT and BERT) to obtain the corresponding fine-grained patch-level and token-level features, then we obtain the corresponding ID features from a learnable embedding table, the feature extraction process is summarized as follows: where ) and   ∈ R   .  ,   , and   are the visual feature dimension, textual feature dimension, and ID feature dimension respectively.  is the number of image patches and   is the number of word tokens.  here represents the embedding of the special token "[CLS]".
Afterward, we use a simple feature transformation matrix to project each input feature into the same dimension  as Ẽ =   •   +   , where   ∈ R  ×  ,   ∈ R  , and  ∈ {, ,  }.
ID-aware Multi-modal Transformer (IMT).In this part, we describe the proposed ID-aware Multi-modal Transformer (IMT) module, which consists of multiple standard Transformer layers.Different from traditional multi-modal Transformers designed for visual and textual features, our IMT module integrates the unique ID features in the recommendation system into Transformers.Our goal is to obtain a unified framework that transforms item embeddings from the original generic feature space to one that is suitable for recommendations (especially for modality features).
To achieve this, we first concatenate visual patch features, ID features, and textual token features together as Ẽ = [ Ẽ ; Ẽ ; Ẽ ] ∈ R  × (  +  +3) , where [•] denotes the concatenation operation.Since there are no paddings for visual and ID features, we set the mask value for all visual and ID features to 0 and obtain the attention mask as M ∈ R (  +  +3) × (  +  +3) .However, in the previous section of the discussion, we discovered that the ID features could influence the optimization direction of the model.To prevent the misleading influence of ID embeddings on modality embeddings, we make the following adjustments to the original attention mask matrix as M [: This ensures that the ID embeddings can attend to the modality embeddings, while the modality embeddings cannot attend to the ID embeddings Similar to the traditional Transformer modeling process, once we have input feature Ẽ and revised attention mask matrix M, we can feed them directly into the standard Transformer layer as: where Θ IMT denotes all the learnable parameters in the IMT module and Ê = [ Ê ; Ê ; Ê ] ∈ R  × (  +  +3) denotes the encoded item representation.Then we use the "cls" embedding to represent the global feature for image and text.We do not need to include additional positional embeddings in the input embeddings, as the features extracted from the pre-trained models already contain positional information.
To obtain more powerful feature representations for the recommendation domain [24], we empirically employ separate two-layer DNNs with a LeakyRelu activation layer [40] for each output as: where    ,   ∈ R  × ,  ∈ {,  } and  ∈ {1, 2}. represents the final embedding of the item.In detail,    represents the -th item embedding whose input is , where  ∈ {, ,  }.
ID Embedding Initialization.It is noteworthy that the initial embedding table for the ID features is typically randomly generated, which differs significantly from the text and image features extracted by large pre-trained models such as BERT [6] and ViT [7].In the IMT module, a self-attention mechanism is utilized to compute the similarity between queries and keys.The discrepancy between the ID features and modality features can negatively impact the optimization of the IMT module.To address this, when   =   , we initialize the ID embedding table by averaging    and    , thus establishing   =   =   .When   ≠   , the ID embedding table is initialized using either the text or image features or either the concatenated text and image features, arbitrarily chosen.

User Sequence Modeling
In sequential recommendation tasks, user sequence features are generated from interacted items.The widely used SASRec [20] model employs a multi-head attention mechanism for user sequence modeling.We adopt SASRec as the backbone network to learn user sequence representations from three input types (image, text, and ID).The user behavior sequence with final item embeddings is represented as S  = {   1 ,    2 , ...,     }.Using this sequence, we obtain the user sequence feature   as follows: where   ∈ R  is the user preference feature, Θ SASRec  denotes all the learnable parameters in SASRec  ,  represents the input type.

Debiased Inbatch Loss
Following prior research [15,46,52], we adopt next-item prediction as the recommendation task, with negative log-softmax loss as the guiding loss function, which helps to bring user preference and the target item closer in the feature space.Given the user interaction sequence as  1 →   , the positive sample that needs to be predicted is  +1 .For computational efficiency, we utilize all the items from the user interaction sequences in the mini-batch as the candidate item sets.However, this approach leads to a distribution of items in the candidate item sets, known as the Matthew effect, where the majority of items are highly popular with a large number of interactions, causing popular items to become over-represented and leading to under-optimized performance for less popular items.To mitigate this effect, we debias the similarity computation results between users and items based on popularity [4,43].
It is noteworthy to consider false negatives in in-batch sampling.When using items that the user has already interacted with as negative samples, the gradient descent direction of the model may be confused.To address this issue, items in the candidate item set that overlap with the user's interaction sequence should be excluded when predicting the to-click item of a user.
In the batch training process, a set of  training instances is considered, where each instance corresponds to a user sequence and a target next item (positive sample).These instances are encoded as embedding representations {( ℎ 1 ,   1 ), ( ℎ 2 ,   2 ), ..., ( ℎ  ,    )}, where ℎ  and   represent the indices of the user sequence and target item of the -th pair, respectively.Then we define the candidate item sets which exclude the items that overlap with ℎ  -th user sequence as B ℎ  .Finally, we adopt the cross-entropy loss as the objective function: where  (   ) represents the frequency of item    appearing in the training set.

Online Distillation
Similar to Late Fusion, we model different types of user sequences to calculate the similarity between user sequences and the target items as well as candidate items, obtaining logits that represent the user's interest distribution.However, different from Late Fusion, we treat each part as a student network branch and directly calculate the classification loss for each corresponding logit, rather than averaging multi-source logits and obtaining a classification loss.We believe that this independent loss calculation approach will alleviate conflicts between multiple features during the training process.In detail, we denote z m and y as the logits and ground truth, where  ∈ {, ,  }.Late Fusion method obtains the final classification loss as L  = cross_entropy((z  , y), where z  is the ensemble logits as z  = z  +z  +z

3
. As for collaborative learning, we calculate classification loss as follows: L   = cross_entropy(z  , y),  ∈ {, ,  } In the knowledge distillation part, we calculate the distillation loss as follows: where T is the temperature parameter,  is the softmax operation, and (, ) means the KL divergence between the soften outputs  from teacher network and  from student network.Because at the beginning of the model training, the predictions of each student network are not accurate enough, we need to decrease the weight of the distillation loss during the early training stage.Therefore, we adopt a time-dependent unsupervised ramp-up function  () [22].When the training epoch is 0,  () is 0.Then,  () increases exponentially as the training epoch progresses.When the training epoch reaches ,  () takes a value of 1. Then the final total loss is as follows:

EXPERIMENTS 4.1 Datasets
We evaluate the performance of each method by using four datasets, Stream, Arts, Office, and H&M, which are obtained from three different platforms.The Stream dataset is a stream media dataset from the a video content platform that we have crawled by ourselves.Arts and Office are e-commerce datasets from the Amazon platform 1 , which are publicly available and commonly used [14,50,58].Arts dataset corresponds to the "Arts, Crafts and Sewing" category of Amazon review datasets, while Office represents the "Office Products" category.H&M is another e-commerce dataset from the H&M platform, which is a public competition dataset provided by Kaggle 2 .The diversity of datasets from different platforms helps to demonstrate the robustness of our proposed methods.
For the Stream dataset 3 , we utilize the video cover and the concatenation of video tags and video titles to represent visual and textual information, respectively.For the other three datasets (Arts, Office, and H&M), we use the cover of the product to represent visual information.As for textual information, Arts and Office datasets use the concatenation of "title", "brand", "category", and "description".Different from them, the H&M dataset consists of the concatenation of "prod_name", "product_type_name", "product_group_name", "graphical_appearance_name", and "colour_group_name".
For all datasets, we utilize user sequences with more than 5 interactions and items that have completely matched textual and visual content.Additionally, we keep only the most recent 15 interaction records for each user.Table 1 presents the relevant statistical information of each dataset.

Evaluation Metric
To evaluate the performance of each model, we follow [15,28] and adopt the commonly used metrics, Recall@k and NDCG@k (Normalized Discounted Cumulative Gain@k).We report the average results over all users in both metrics, and the higher value indicates better performance.Following [15,28], we use the last interaction as the prediction, the second-to-last as validation, and the rest for training.We conduct hyper-parameter optimization on the validation set, and choose the combination of parameters that yields the highest Recall@10 as the optimal configuration.

Implementation Details
To make a fair comparison, we reproduce all of the baselines by utilizing our pipeline framework.Our default loss function for all models is debiased in-batch cross-entropy loss.The visual features are extracted based on the "openai/clip-vit-base-patch32" [27] pre-trained model.The textual features with Chinese words are extracted by using the "hfl/chinese-roberta-wwm-ext" [5] pre-trained model and textual features with English words are extracted by using the "bert-base-uncased" [6] pre-trained model.We conduct the grid search for hyper-parameters, such as hidden size and learning rate.For the general sequential models, FDSA [53] and UniSRec [15], the search range for hidden sizes and learning rates are [128,256,512,768] and [1e-3, 1e-4, 1e-5], respectively.Empirically, SAS-Rec+EF and SASRec+LF both have two Transformer layers for text and image modalities, culminating in a total of four Transformer layers.In our model, we adopt two Transformer layers of IMT.In our experiments, we set the batch size to 128, which includes 128 sequences from different users per batch for training.For specific hyper-parameters unique to each baseline, such as the number of GRU layers in GRU4Rec and the selection of dilated convolution layers in NextItNet, we refer to the settings in RecBole [54].

Comparison with SOTA Methods
Based on the input types, we divide SOTA methods into two categories: 1) general sequential recommendation models that only take item ID information as input (e.g.GRU4Rec [12], SASRec [20], NextItNet [45]), and 2) multi-modal sequential recommendation models that take both item ID information and modality information (visual and textual) as input (e.g.FDSA [52], UniSRec [15], SASRec+EF, SASRec+LF): (a).GRU4Rec is a session-based recommendation algorithm that uses recurrent neural networks (GRUs) to model user behavior; (b).SASRec is a self-attention-based sequential recommendation algorithm that uses a multi-head selfattention mechanism to capture user preferences; (c).NextItNet is a neural network-based sequential recommendation algorithm that uses dilated convolutions to capture long-term dependencies between items; (d).FDSA is a feature-driven and self-attention-based sequential recommendation algorithm that uses feature-driven attention mechanisms to capture user preferences; (e).UniSRec is a universal sequence representation learning algorithm for recommendation that harnesses the descriptive text associated with an item to learn transferable representations across different domains and platforms; (f).SASRec+EF (Our Extension) is an extension of SASRec that takes id, text, and image as input and uses Transformer layers as the FST module with Early Fusion; (g).SASRec+LF (Our Extension) is an extension of SASRec that takes id, text, and image as input and uses Transformer layers as the FST module with Late Fusion.
For the vanilla UniSRec [15] and FDSA [53] models, only textual information is utilized.To make a fair comparison, we reproduce these models by incorporating image information, leveraging the inherent extensibility of the UniSRec and FDSA models.
Table 2 demonstrates the superiority of the Sequential Model with Modality Feature over the General ID-based Sequential Model across all datasets and evaluation metrics, which underscores the potential benefits of incorporating modality information to enhance recommendation accuracy.Notably, SASRec+EF and SASRec+LF within the Sequential Model with Modality Feature outperform UniSRec and FDSA in terms of Recall@10 and NDCG@10 scores, indicating that utilizing Transformers as FST module may lead to more effective item representation modeling and improves recommendation accuracy.Furthermore, the fusion strategy of Late Fusion is proven to result in better overall performance.Based on the SASRec+LF framework, our proposed ODMT model aims to achieve multi-source information representation learning in a unified manner, which leverages the strengths of contemporary multi-modal Transformer models and online distillation methods.Experimental results demonstrate that our proposed ODMT model can achieve better performance across all four datasets and four metrics, surpassing not only SASRec+LF but also all other baseline models.

Ablation Study
In this study, we conduct an analysis to evaluate the impact of each module on the final performance of our proposed ODMT model.
To compare the performance of our model with other variants, we prepare 8 different models, including: (1) Text Initialization, which initializes the ID embedding table using only textual features; (2) Image Initialization, which initializes the ID embedding table using only visual features; As shown in Table 3, each new component contributes to the final performance.In the ID initialization part, utilizing average features of both text and image modalities for initializing the ID embedding table results in the best performance, with text features following closely.However, random initialization of the ID embedding table has a detrimental impact on prediction results, particularly affecting the optimization of the IMT module."w/o ID" shows an obvious performance reduction compared to full ODMT, which highlights the effective accommodation of both ID features and multi-modal features in our framework.4 indicates existing models predict popular items well but struggle with long-tail items.Moreover, Figure 4 illustrates the consistent improvement of our proposed method compared to the SOTA models in item groups with varying popularity.Notably, in item group 0, which consists of items not present in the training set, the ID-based SASRec fails to make accurate predictions.In contrast, our proposed model demonstrates effective mitigation of the cold-start problem, showcasing superior performance compared to other multi-modal sequential recommendation models.4.6.2Different Sequential Models as Backbone.Since our method is model-agnostic, we conduct experiments based on various sequential models to test the robustness.Figure 5 shows that ODMT can surpass other methods consistently.This highlights the generality of ODMT, which can be seamlessly integrated as a plug-and-play module into any general sequential model.

RELATED WORK 5.1 Multi-modal Recommendation
The development of computer vision [17], natural language processing [23], and multi-modal learning [16,18,19,47] has provided better representations for heterogeneous data structures.Recently, multi-modal representations have been widely used in the field of recommendation systems, where collaborative filtering paradigms still dominate, with multi-modal features typically incorporated as side information in the model framework [33,35,36,42,58].In the field of sequential recommendation, several studies have found that multi-modal features can also yield significant improvements [14,15,28].Especially, [46] even achieved comparable results to traditional ID features using only multi-modal features, which underscores the significance of modal features in sequential recommendation and their substitutability for traditional ID features.We attribute this to the characteristics of sequential recommendation models, which heavily rely on item representations for modeling user preferences, unlike collaborative filtering which requires additional learning of user representations.Through extensive experimental exploration of multi-modal sequential recommendation, we observe the issue of conflicts between multi-modal and ID features during training.Based on these observations, we design two modules that leverage the characteristics of ID and multi-modal features to make them compatible and mutually beneficial.

Knowledge Distillation
Knowledge distillation [13,44] aims to guide the learning of a student model by using a pretrained teacher model, allowing the student model to achieve better predictive performance with smaller model sizes.In recent years, online distillation has gained attention due to its end-to-end training strategy, which eliminates the need for a pre-trained teacher model.Unlike the traditional "teacherstudent" paradigm of knowledge distillation, online distillation allows for mutual learning between all sub-networks [1,53] or the use of ensemble methods to obtain a teacher output that combines multiple prediction results [8,21,59], which in turn guide the learning of all sub-networks.Online distillation typically requires that each sub-network can independently complete downstream prediction tasks (usually classification tasks) and that the prediction results of different sub-networks have diverse characteristics [3].In sequential recommendation, we find that using ID features or multi-modal features (such as image or text) can independently complete recommendation predictions.Additionally, due to the heterogeneity of the input, the outputs of different features are also diverse.Based on these findings, we propose a multi-modal online distillation framework for sequential recommendation models.

CONCLUSION
This paper investigates the impact of item representation learning on downstream recommendation tasks, exploring disparities in information fusion at different stages.Empirical experiments show significant influence on recommendation performance.To enhance recommendation accuracy, the paper proposes two novel modules: ID-aware Multi-modal Transformer for feature interaction and online distillation training for multi-faceted user interest distribution and improved prediction robustness.Experimental results on four datasets demonstrate the effectiveness of these modules.

ACKNOWLEDGEMENT
This work is supported by the Advanced Research and Technology Innovation Centre (ARTIC), the National University of Singapore under Grant (project number: A-8000969-00-00).This research is also supported by the National Natural Science Foundation of China (9227010114) and the University Synergy Innovation Program of Anhui Province (GXXT-2022-040).Futhermore, we extend our gratitude to the Lab for Representation Learning at Westlake University (fajieyuan@westlake.edu.cn) for supporting dataset.

A APPENDIX A.1 Dataset Details
Figure 6 shows some samples of four datasets from three platforms, namely Stream, Amazon and H&M, respectively.In contrast to the majority of existing research that predominantly focuses on experiments conducted on Amazon datasets [14,15,24,28], our study employs a diverse range of experimental datasets with the aim of validating the robustness of our proposed method.
As for the Stream dataset, it is assembled by gathering publicly accessible user comments on videos from the stream media platform between February 2020 and September 2022.Specifically, we first collect short videos from the homepage to ensure the channel diversity.Subsequently, more videos are included via crawling related videos from the associated page of each video during the first stage.Finally, we establish the user set by acquiring publisher IDs from the comment sections under all videos.Note that there are no privacy issues since all videos and user comments are publicly available.

A.2 Effect of Different Transformer Layers
We investigate the number of layers in the multi-modal Transformer of the IMT module, where the number of multi-heads corresponds to the number of Transformer layers.For comparison, we also explore the SASRec+LF method under the same conditions.Figure 7 indicates that 2 Transformer layers are sufficient to achieve good performance, which is our default setting.Comparing our model with SASRec+LF, ODMT consistently outperforms SASRec+LF across different numbers of Transformer layers.

A.3 Effect of Parameters in Knowledge Distillation
The online distillation part of our framework involves two important parameters: 1) Temperature parameter (T) serves as a scaling factor that controls the softness or hardness of predicted  4 illustrates the impact of the two parameters on the experimental results on Stream and Arts datasets.
Table 4: Performance comparison of different hyperparameters settings in the online distillation module."None" denotes the weight of distillation loss is 0, primarily used for control purposes in comparison with other parameters."R@10" is short for Recall@10, "N@10" is short for NDCG@10.The best performance is highlighted in bold.

Figure 1 :
Figure 1: Comparison of Recall@10 for different FST modules based on single-modal and multi-source input, across Arts and Stream datasets."x" in TRMx+DNNs represents the number of Transformer layers.For x = 1, it represents 1 Transformer layer for text and 1 Transformer layer for image.

Figure 2 :
Figure 2: Training curves of FST modules with text input, evaluating (a) training loss and (b) testing Recall@10, and training curves of different single input with DNNs as FST module, evaluating (c) training loss and (d) testing Recall@10 on the Arts dataset.

Figure 3 :
Figure 3: (a) Overall framework of our proposed ODMT model, which illustrates the forward computation flow based on modifications to the Late Fusion approach.(b) shows our proposed IMT module, and (c) represents our proposed Online Distillation module.Different colors represent the information flow of different features (ID, image, and text).
(3) w/o Initialization, which abandons initialization for the ID embedding table with text and image features; (4) w/o ID mask, which removes the limitation of modality features being unable to attend to ID features in the IMT module; (5) w/o IMT (1), which replaces the two layers of IMT with two standard Transformer layers for text input and two standard Transformer layers for image input, with the depth of Transformers in both cases remaining the same; (6) w/o IMT (2), which replaces the two layers of IMT with one standard Transformer layer for text input and one standard Transformer layer for image input, while keeping the total number of Transformers the same; (7) w/o Online Distillation, which uses a traditional Late Fusion loss calculation method and removes the distillation loss; (8) w/o ID, which removes the ID component in ODMT and replaces it with the corresponding ID-free version.

Figure 4 :
Figure 4: Performance comparison of sequential models regarding item popularity, where the x-axis represents different groups divided based on quantile statistics.Group 0 refers to items that have not appeared in the training set, while the item count is kept consistent across other groups.

Figure 5 :
Figure 5: Performance comparison of sequential models with different backbones.

4. 6 . 1
Performance Comparison w.r.t Item Popularity.The item popularity distribution follows a Matthew effect, with a majority of users showing interest in only a small portion of items.

Figure 6 :
Figure 6: Selected sample cases from four different datasets.

Figure 7 :
Figure 7: Performance comparison of different numbers of Transformer layers.

Table 1 :
Statistics of all datasets used in our experiment."#Inter."represents total user-item interactions, and "Avg." denotes the average user length.

Table 2 :
Overall performance of our model and the baselines on four multi-modal recommendation datasets.Best performances are noted in bold, and the second-best are underlined.