MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised Learning

The first Multimodal Emotion Recognition Challenge (MER 2023) was successfully held at ACM Multimedia. The challenge focuses on system robustness and consists of three distinct tracks: (1) MER-MULTI, where participants are required to recognize both discrete and dimensional emotions; (2) MER-NOISE, in which noise is added to test videos for modality robustness evaluation; (3) MER-SEMI, which provides a large amount of unlabeled samples for semi-supervised learning. In this paper, we introduce the motivation behind this challenge, describe the benchmark dataset, and provide some statistics about participants. To continue using this dataset after MER 2023, please sign a new End User License Agreement and send it to our official email address merchallenge.contact@gmail.com. We believe this high-quality dataset can become a new benchmark in multimodal emotion recognition, especially for the Chinese research community.


INTRODUCTION
Multimodal emotion recognition has become an important research topic due to its wide-ranging applications in human-computer interaction.Over the past few decades, researchers have proposed various approaches [1][2][3].But due to their low robustness in complex environments, existing techniques do not fully meet the demands in practice.To this end, we launch a Multimodal Emotion Recognition Challenge (MER 2023), which aims to improve system robustness from three aspects: multi-label learning, modality robustness, and semi-supervised learning.
Annotating with both discrete and dimensional emotions is common in current datasets [4,5].Existing works mainly utilize multitask learning to predict all labels simultaneously [6,7].However, these works ignore the correlation between discrete and dimensional emotions.For example, valence is a dimensional emotion that reflects the degree of pleasure.For negative emotions (such as anger and sadness), the valence score should be less than 0; for positive emotions (such as happiness), the valence score should be greater than 0. To fully exploit the multi-label correlation, we launch the MER-MULTI sub-challenge, which encourages participants to exploit the appropriate loss function [8] or model structure [9] to boost recognition performance.
Many factors may lead to modality perturbation, which increases the difficulty of emotion recognition.Recently, researchers have proposed various strategies to deal with this problem [10][11][12].But due to the lack of benchmark datasets, existing works mainly rely on their own simulated missing conditions to evaluate modality robustness.To this end, we launch the MER-NOISE sub-challenge, which provides a benchmark test set focusing on more realistic modality perturbations such as background noise and blurry videos.In this sub-challenge, we encourage participants to use data augmentation [13] or other more advanced techniques [14,15].
Meanwhile, it is difficult to collect large amounts of emotionlabeled samples due to the high annotation cost.Training with limited data harms the generalization ability of recognition systems.To address this issue, researchers have exploited various pre-trained models for video emotion recognition [16,17].However, task similarity impacts the performance of transfer learning [18].Existing video-level pre-trained models mainly focus on action recognition rather than expression videos [19].In this paper, we extract human-centered video clips from movies and TV series that contain emotional expressions.We then launch the MER-SEMI sub-challenge, encouraging participants to use semi-supervised learning [19,20] to achieve better performance.

CHALLENGE DATASET
MER 2023 employs an extended version of CHEAVD for performance evaluation.Due to the small size of CHEAVD, we implement a fully automatic strategy to collect large amounts of unlabeled video clips; due to the low annotation consistency of CHEAVD, we adopt a stricter data selection approach and split the dataset into reliable and unreliable parts.As for reliable samples, we further divide them into three subsets: Train&Val, MER-MULTI, and MER-NOISE.As for unreliable samples, we treat them as unlabeled data and merge them with automatically-collected samples to form MER-SEMI. Statistics of each subset are shown in Table 1. Figure 1 summarizes the distribution of discrete emotions.Despite some imbalance, our dataset still exhibits a relatively high balance compared to other mainstream benchmarks such as MELD [41] and CMU-MOSEI [5]. Figure 2 further reveals the relationship between discrete emotions and valences.Valence serves as an indicator of pleasure, and the value from small to large means the sentiment from negative to positive.From this figure, we observe that the valence distribution of different discrete labels is quite reasonable.Negative emotions (such as anger, sadness, and worry) predominantly exhibit valences below 0. Conversely, positive emotions (such as happiness) primarily exhibit valences above 0.The valence associated with neutral centers around 0. Notably, surprise is a fairly complex emotion that contains multiple meanings such as sadly surprised, angrily surprised, or happily surprised.Hence, its valence ranges from negative to positive.These findings ensure the high quality of our labels and demonstrate the necessity of incorporating both discrete and dimensional annotations, as they can help us distinguish some subtle differences in emotional states.
To download the dataset, participants should fill out an End User License Agreement (EULA) 4 , which requires participants to use this dataset only for academic research and not to edit or upload samples to the Internet.For each track, participants can submit 20 times per day with a maximum of 200 times: MER-MULTI5 , MER-NOISE 6 , and MER-SEMI 7 .At the end of the challenge, each team is required to submit a paper describing their approach.For each paper, the program committee will conduct a double-blind review of the scientific quality, novelty, and technical quality.To continue using this dataset after the challenge, please sign a new EULA 8 and send it to our official email address 9 .We will provide the test set labels to facilitate further usage.We believe this dataset can serve as a new benchmark in robust multimodal emotion recognition, especially for the Chinese research community.

PARTICIPANTS AND OUTCOME
This year's challenge attracts the registration of 76 teams from varying academic institutions.Due to the inherent class imbalance of discrete emotions (see Figure 1), we choose the weighted average F-score as our evaluation metric, consistent with previous works [42,43].For dimensional emotions, we select the widely utilized mean square errors as the evaluation metric.To further evaluate comprehensive performance, we define a combined metric that incorporates both discrete and dimension predictions: where metric  and metric  represent the metrics for discrete emotions and valences, respectively.In MER-MULTI and MER-NOISE, participants are required to provide predictions for both discrete and dimensional emotions.Therefore, we use the combined metric for performance evaluation.In MER-SEMI, we only evaluate discrete results on the labeled subset.Therefore, we use the weighted average F-score as the evaluation metric.
For each sub-challenge, we perform an initial attempt to explore a range of multimodal features and establish a competitive baseline system 10 .To ensure reproducibility, we primarily utilize open-source pre-trained models for feature extraction and a simple yet effective multi-layer perceptron for emotion recognition.In MER-MULTI and MER-NOISE, our baseline system achieves 0.56 and 0.41 on the combined metric, respectively.In MER-SEMI, we only evaluated discrete emotions and our baseline system reaches 86.40% on the weighted average F-score.
Table 2 ∼ Table 4 show the leaderboards for the three subchallenges.Excitingly, we witness that most teams exceed our baseline performance.The team named "sense-dl-lab" emerges as the winner across all three sub-challenges.Their system outperforms our baseline by 0.1405 on MER-MULTI and 0.2746 on MER-NOISE.For MER-SEMI, their system reaches 89.11% on the evaluation metric, outperforming our baseline by 2.36%.

CONCLUSIONS
This paper summarizes MER 2023, a multimodal emotion recognition challenge focused on system robustness.MER 2023 consists of three sub-challenges: (1) MER-MULTI requires participants to predict both discrete and dimensional emotions.This multi-scale labeling process can help distinguish some subtle differences in emotional states; (2) MER-NOISE simulates data corruption in realworld environments for modality robustness evaluation; (3) MER-SEMI requires participants to train more powerful classifiers using large amounts of unlabeled data.In the future, we plan to increase both labeled and unlabeled samples in our corpus.Additionally, we hope to organize a series of challenges and related workshops that bring together researchers from all over the world to discuss recent research and future directions in multimodal emotion recognition.

Figure 2 :
Figure 2: Empirical PDF on the valence for different discrete emotions (Train&Val).

Table 1 :
Statistical information for the challenge dataset (duration: hh:mm:ss).