MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic Facial Expression Recognition

Dynamic facial expression recognition (DFER) is essential to the development of intelligent and empathetic machines. Prior efforts in this field mainly fall into supervised learning paradigm, which is severely restricted by the limited labeled data in existing datasets. Inspired by recent unprecedented success of masked autoencoders (e.g., VideoMAE), this paper proposes MAE-DFER, a novel self-supervised method which leverages large-scale self-supervised pre-training on abundant unlabeled data to largely advance the development of DFER. Since the vanilla Vision Transformer (ViT) employed in VideoMAE requires substantial computation during fine-tuning, MAE-DFER develops an efficient local-global interaction Transformer (LGI-Former) as the encoder. Moreover, in addition to the standalone appearance content reconstruction in VideoMAE, MAE-DFER also introduces explicit temporal facial motion modeling to encourage LGI-Former to excavate both static appearance and dynamic motion information. Extensive experiments on six datasets show that MAE-DFER consistently outperforms state-of-the-art supervised methods by significant margins (e.g., +6.30\% UAR on DFEW and +8.34\% UAR on MAFW), verifying that it can learn powerful dynamic facial representations via large-scale self-supervised pre-training. Besides, it has comparable or even better performance than VideoMAE, while largely reducing the computational cost (about 38\% FLOPs). We believe MAE-DFER has paved a new way for the advancement of DFER and can inspire more relevant research in this field and even other related tasks. Codes and models are publicly available at https://github.com/sunlicai/MAE-DFER.


INTRODUCTION
Facial expressions, as an important aspect of nonverbal communication, play a significant role in interpersonal interactions [10].In the past two decades, automatic facial expression recognition (FER) has drawn widespread attention due to its crucial role in developing intelligent and empathetic machines that can interact with humans in a natural and intuitive way [11,44,45].FER also has a wide spectrum of practical applications in areas such as healthcare [3], education [66], and entertainment [53].According to the input data type, FER can be divided into two categories, i.e., static FER (SFER) and dynamic FER (DFER) [31].SFER takes static facial images as input, while DFER aims to recognize expressions in dynamic image sequences or videos.Since SFER overlooks the critical temporal information for the interpretation of facial expressions, this paper mainly focuses on DFER.
DFER is dominated by the supervised learning paradigm.Researchers have developed various deep neural networks for this task, including 2D/3D convolutional neural networks (CNN) [15,25,27], recurrent neural networks (RNN) [14,52,65], and more advanced Transformer-based architectures [29,35,37,61,69].Although supervised methods have achieved remarkable success, the limited training samples in existing DFER datasets (typically around 10K, which is much smaller than those in other research areas such as general image/video classification and face recognition, see details in Table 1) severely restrict their further advancement (e.g., training large video Transformers).A straightforward idea to address this issue is to increase the dataset scale.However, collecting and  annotating large-scale high-quality DFER datasets is pretty timeconsuming and labor-intensive, which is mainly due to the sparsity of dynamic facial expressions in videos and the ambiguity and subjectivity in facial expression perception [25,31,64].Considering that there are massive unlabeled facial videos on the Internet, a natural question arises in the mind: can we exploit them to fully unleash the power of deep neural networks for better DFER?
The recent progress of self-supervised learning in many deep learning fields [1,12,22] indicates that there is a positive answer.Notably, masked autoencoder (MAE) [22] in computer vision develops an asymmetric encoder-decoder architecture for masked image modeling.It successfully pre-trains the vanilla Vision Transformer (ViT) [13] in an end-to-end manner and outperforms the supervised baselines in many vision tasks.Subsequently, VideoMAE [54] extends MAE to the video domain and also achieves impressive results on lots of general video datasets.Motivated by this line of research, we present MAE-DFER (Fig. 1), a novel self-supervised method based on VideoMAE which leverages large-scale self-supervised pre-training on abundant unlabeled facial video data to promote the advancement of DFER.Although VideoMAE has made remarkable success in self-supervised video pre-training, we notice that it still has two main drawbacks: 1) The vanilla ViT encoder employed in VideoMAE requires substantial computation during fine-tuning due to the quadratic scaling cost of global space-time self-attention, which is unaffordable in many resource-constrained scenarios.2) It only reconstructs video appearance contents during pre-training, thus might be insufficient to model temporal facial motion information which is also crucial to DFER.
To tackle the above issues in VideoMAE, our MAE-DFER presents two core designs accordingly.For the first issue, MAE-DFER develops an efficient local-global interaction Transformer (LGI-Former) as the encoder.Different from the global space-time self-attention in ViT, LGI-Former first constrains self-attention in local spatiotemporal regions and then utilizes a small set of learnable representative tokens to enable efficient local-global information exchange.Concretely, it decomposes the global space-time self-attention into three stages: local intra-region self-attention, global inter-region self-attention, and local-global interaction.In this way, LGI-Former can efficiently propagate global information to local regions and avoid the expensive computation of global space-time attention.For the second issue, MAE-DFER introduces joint masked appearance and motion modeling to encourage the model to capture both static facial appearance and dynamic motion information.Specifically, in addition to the original appearance content reconstruction branch, it simply utilizes the frame difference signal as another reconstruction target for explicit temporal facial motion modeling.To verify the effectiveness of MAE-DFER, we perform large-scale selfsupervised pre-training on the VoxCeleb2 dataset [9], which has more than 1M unlabeled facial video clips collected from YouTube.Then we fine-tune the pre-trained model on six DFER datasets, including three relatively large in-the-wild datasets (DFEW [25], FERV39k [64], and MAFW [32]) and three small lab-controlled datasets (CREMA-D [5], RAVDESS [36], and eNTERFACE05 [38]).The results show that MAE-DFER significantly outperforms the state-of-the-art supervised methods, indicating that it is capable of learning strong and useful dynamic facial representations for DFER.Moreover, compared with VideoMAE, MAE-DFER largely reduces ∼38% FLOPs while having comparable or even better performance.The main contributions of this paper are summarized as follows: • We present a novel self-supervised method, MAE-DFER, as an early attempt to leverage large-scale self-supervised pretraining on abundant unlabeled facial video data to advance the development of DFER.

• MAE-DFER improves VideoMAE by developing an efficient
LGI-Former as the encoder and introducing joint masked appearance and motion modeling.With these two core designs, MAE-DFER largely reduces the computational cost while having comparable or even better performance.
• Extensive experiments on six DFER datasets show that our MAE-DFER consistently outperforms the previous best supervised methods by significant margins (+5∼8% UAR on three in-the-wild datasets and +7∼12% WAR on three labcontrolled datasets), which demonstrates that it can learn powerful dynamic facial representations for DFER via largescale self-supervised pre-training.

RELATED WORK 2.1 Dynamic Facial Expression Recognition
The early studies on DFER primarily focus on designing various local descriptors and only several very small lab-controlled datasets are available for evaluation.With the emergence of deep learning and the proliferation of relatively larger datasets, the research paradigm has undergone a transformative shift towards training deep neural networks in an end-to-end fashion.In general, there are three trends.The first trend directly utilizes 3D CNNs (such as C3D [55], 3D ResNet [21], R(2+1)D [56], and P3D [46]) to extract joint spatiotemporal features from raw facial videos [15,25,27,32,60,64].
The second trend uses the combination of 2D CNN (e.g., VGG [48] and ResNet [23]) and RNN (e.g., LSTM [24] and GRU [8]) [14,25,26,32,52,64,65].Recently, with the rise of Transformer [59], several studies exploit its global dependency modeling ability to augment CNN/RNN for better performance, which forms the third trend [29,30,32,35,37,69].For instance, Former-DFER [69] employs a Transformer-enhanced ResNet-18 for spatial feature extraction and another Transformer for temporal information aggregation.STT [37] improves Former-DFER by introducing factorized spatial and temporal attention for joint spatiotemporal feature learning.IAL [29] further introduces the global convolutionattention block and intensity-ware loss to deal with expressions with different intensities.However, all the above methods fall into the supervised learning paradigm, which is thus restricted by the limited training samples in existing DFER datasets.Unlike them, this paper proposes a self-supervised method that can learn powerful representations from massive unlabeled facial video data and achieve significant improvement over them.

Masked Autoencoders
Masked autoencoders (MAEs), as the representative of generative self-supervised learning, have recently achieved unprecedented success in many deep learning fields [67].They are mainly inspired by the progress of masked language modeling (e.g., BERT [12] and GPT [47]) in natural language processing and typically adopt a mask-then-predict strategy to pre-train the vanilla ViT.Notably, iGPT [7] follows GPT to auto-regressively predict pixels and makes the first successful attempt.BEiT [2] follows BERT and adopts a twostage training pipeline, i.e., first utilizing an off-the-shelf tokenizer to generate discrete visual tokens and then performing maskedthen-predict training.MAE [22] improves BEiT by designing an asymmetric encoder-decoder architecture to enable efficient endto-end pre-training.After that, many studies adopt the architecture of MAE to perform self-supervised pre-training on various tasks.For instance, VideoMAE [54] and its concurrent work MAE-ST [17] extends MAE to the video domain and achieve impressive results on lots of video benchmarks.Our proposed MAE-DFER is inspired by VideoMAE and it develops two core designs to facilitate effective and efficient representation learning for DFER.

METHOD 3.1 Revisiting VideoMAE
VideoMAE [54] is a simple extension of MAE [22] in the video domain.It basically follows the asymmetric encoder-decoder architecture of MAE for self-supervised video pre-training.The main difference is that a much higher masking ratio (i.e., 90% vs. 75%) and tube masking strategy (instead of random masking) are adopted, considering that large temporal redundancy and high temporal correlation in videos [54].In specific, VideoMAE mainly consists of four modules: cube embedding, tube masking, a high-capacity encoder Φ  (i.e., the vanilla ViT), and a lightweight decoder Φ  .Given a raw video V ∈ R  × × ×3 , VideoMAE first utilizes cube embedding with a cube size of 2 × 16 × 16 to transform V into a sequence of tokens X ∈ R  × , where  =  2 •  16 •  16 and  is the channel size.Then the tube masking module generates a mask M ∈ {0, 1}  with a masking ratio of  = 90% and the high-capacity encoder Φ  only takes the unmasked tokens X ⊙ M ∈ R × ( = (1 − )) as input and simply process them with global space-time self-attention.Subsequently, the lightweight decoder Φ  combines the encoded visible tokens with the learnable mask tokens (with a size of ) to reconstruct the raw video data.Finally, the mean square error between the original and reconstructed video in the masked positions are calculated to optimize the whole model.The above process can be generally formulated as follows: where Ψ is a function used to obtain masked positions in the pixel space.In downstream tasks, the lightweight decoder Φ  is discarded and only the high-capacity ViT encoder Φ  will be fine-tuned.

MAE-DFER: Overview
Although VideoMAE has made great success in self-supervised video pre-training, it still faces two major challenges.First, it only focuses on reconstructing raw appearance contents in the video, which thus lacks explicit temporal motion modeling and might not be sufficient to model temporal facial motion information.Second, although it enjoys high efficiency during pre-training through an asymmetric encoder-decoder architecture (i.e., dropping a large proportion of masked tokens to save computation), the computational cost of global space-time self-attention in the vanilla ViT is still extremely expensive during downstream fine-tuning since it cannot drop input tokens at this stage.To tackle these issues, as shown in Fig. 1, we propose MAE-DFER, a new self-supervised framework for DFER.For the first issue, MAE-DFER introduces joint masked appearance and motion modeling to encourage the model to excavate both static appearance and dynamic motion information (Section 3.3).For the second issue, it employs a novel Local-Global Interaction Transformer (LGI-Former) as the encoder to largely reduce the computational cost of ViT during downstream fine-tuning (Section 3.4).

MAE-DFER: Joint Masked Appearance and Motion Modeling
Temporal motion information matters for DFER (e.g., the gradual appearance and disappearance of a smile may convey totally different emotions).To explicitly incorporate this information in self-supervised pre-training, our MAE-DFER adds an additional temporal motion reconstruction branch in parallel with the original appearance reconstruction branch in VideoMAE to achieve joint facial appearance and motion structure learning.Specifically, we simply calculate the frame difference signal as the temporal motion target given that its computation is very cheap and it has shown effectiveness in video action recognition [49,62,63].To ensure that the computational cost during pre-training similar to Video-MAE, we share the decoder backbone for appearance and motion branches and only use two different linear heads to predict their targets.Besides, the decoder only outputs appearance predictions in the odd frames and motion predictions in the remaining even frames.Finally, the total loss is the weighted sum of mean square errors in two branches: where is the motion target,  is a hyperparameter to balance the contribution of two branches and we empirically set it to 0.5.

MAE-DFER: Efficient LGI-Former
The architecture of LGI-Former is illustrated in Fig. 2. Unlike the global space-time self-attention adopted in the vanilla ViT, LGI-Former constrains self-attention in local spatiotemporal regions to save computation.However, simply stacking multiple local selfattention layers does not permit inter-region information exchange.
Inspired by [16] and [51], the core idea of LGI-Former is to introduce a small set of representative tokens to local regions.On the one hand, these tokens take charge of summarizing critical information in local regions.On the other hand, they allow for long-range dependencies modeling between different regions and enable efficient local-global information exchange.Thanks to the introduction of representative tokens, the expensive global spacetime self-attention can be decomposed into three stages with much cheaper computation: 1) local intra-region self-attention, 2) global inter-region self-attention, and 3) local-global interaction.In the following, for simplicity, we only describe the above three stages during fine-tuning.The process during pre-training is similar as MAE-DFER follows VideoMAE to adopt the tube masking strategy and applies the same masking ratio to each local region to ensure that all regions have an equal number of visible tokens.
Local Intra-Region Self-Attention.For convenience, we first reshape the input sequence X ∈ R  × (after cube embedding) to 3D tokens 16 × and divide it into non-overlapped local spatiotemporal regions with an equal size of  × ℎ ×  as shown in Fig. 2. In each region, apart from the original tokens, we also add a learnable representative token.The local intra-region self-attention then operates on their concatenation to simultaneously promote fine-grained local feature learning and enable local information aggregation into the representative token.Assume that the original local tokens and the associated representative token in the th region is X  ∈ R  × and S  ∈ R 1× respectively ( = ℎ,  ∈ {1, 2, ...,  }, and  =   is the number of representative tokens), the formulation of local intra-region self-attention is given as follows: where X ∈ R ( +1) × , MHSA is the multi-head self-attention in the vanilla ViT, and LN stands for layer normalization.In particular, the calculation of MHSA is formulated as follows: MHSA(X) = Concat(head 1 , ..., head ℎ )W  (5) where W *  ∈ R  × ( * ∈ {, ,  }), W  ∈ R  × , ℎ is the number of attention heads,  =  ℎ is the feature dimension of each head.Global Inter-Region Self-Attention.After local intra-region self-attention, the representative token has extracted crucial information in each local region and can represent the original tokens to perform information exchange between different regions.Since the number of representative tokens is typically small (e.g., 8), the computational cost for inter-region communication can be negligible.Thus, we first aggregate all representative tokens and then simply utilize global inter-region self-attention on them to propagate information between different regions, i.e., S = Concat(S 1 , ..., S  ) where S ∈ R  × is the aggregated representative tokens.
Local-Global Interaction.After information propagation via global inter-region self-attention, the representative token in each local region has been consolidated by useful information from other regions, thus having a global view of the whole input tokens.To enable the original tokens in each local region to access the global information, we further employ cross-attention between local tokens and representative tokens to achieve that goal: where MHCA is multi-head cross-attention and FFN denotes feedforward network.Specifically, MHCA has the similar implementation with MHSA except that its query and key/value come from different inputs, i.e., MHCA(X, Y) = Concat(head 1 , ..., head ℎ )W  (13) Complexity Analysis.We suppose that the flattened input is X ∈ R  × , where  =  is the number of total input tokens,  is the number of local regions and  is the number of original tokens in each region.Since self-attention scales quadratically with the sequence length, the complexity of local intra-region self-attention is  ( ( + 1) 2 ) ≈  ( 2 ) =  (  2  ).Similarly, the complexity of global inter-region self-attention is  ( 2 ) =  (  2  2 ).Moreover, local-global interaction has a complexity of  ( ) =  (  2  ).Putting them together, the complexity of an LGI-Former block is , while a standard Transformer block in the vanilla ViT has a complexity of  ( 2 ).In practice,  ≪  and  ≪ , thus the computational cost of LGI-Tranformer is largely reduced compared with the vanilla ViT.

Datasets
Pre-training Dataset.We perform self-supervised pre-training on VoxCeleb2 [9].It has over 1 million video clips of more than 6,000 celebrities, extracted from around 150,000 interview videos on YouTube.It is divided into a development set and a test set.We only use the development set for pre-training, which contains 1,092,009 video clips from 145,569 videos.

Implementation Details
MAE-DFER.For the high-capacity encoder, we adopt the LGI-Former which has 16 blocks and a hidden size of 512.The total number of parameters is 84.9M, which is similar to that (86.2M) of ViT base model.The local region size is set to 2 × 5 × 10 by default.For the lightweight decoder, we follow VideoMAE to adopt four standard Transformer blocks with a hidden size of 384.Given that the speaker's face generally does not fill the entire frame, we only used a 160 × 160 patch located in the upper center of each video frame to remove the irrelevant background information.During pre-training, we extract 16 frames from each video clip using a temporal stride of 4. This results in 8 × 10 × 10 input tokens after cube embedding, when using a cube size of 2 × 16 × 16.Regarding hyperparameters, we mainly follow VideoMAE.Specifically, we use an AdamW optimizer with  1 = 0.9 and  2 = 0.95, an overall batch size of 128, a base learning rate of 3 − 4, and a weight decay of 0.05.We linearly scale the base learning rate according to the overall batch size, using the formula: lr = base learning rate × batch size 256 .In addition, we use a cosine decay learning rate scheduler.By default, we pre-train the model for 50 epochs, with 5 warmup epochs.When using 4 Nvidia Tesla V100 GPUs, the pre-training takes about 3-4 days.
Fine-tuning.Same as pre-training, the input clip size is 16 × 160 × 160 and the temporal stride is 4 for most datasets (except 1 for FERV39k).To optimize the model, we use an AdamW optimizer with  1 = 0.9 and  2 = 0.999, with a base learning rate of 1 − 3 and an overall batch size of 96.The other hyperparameters remain the same as in pre-training, and more details can be found in [54].We fine-tune the pre-trained model for 100 epochs, with 5 warmup epochs.During inference, we sample two clips uniformly along the temporal axis for each video and then calculate the average score as the final prediction.

Ablation Studies
In this part, we conduct ablation experiments on DFEW and FERV39k to demonstrate the effects of several key factors in MAE-DFER.For simplicity, on DFEW, we only report results of fold 1 (fd1).
Pre-training Epochs.As shown in Table 2, we observe that longer pre-training is generally beneficial and the performance saturation occurs at around 50 epochs.Besides, we also find that the performance of training from scratch (i.e., #Epochs=0) is very poor (nearly random guessing).This is largely attributed to the limited training samples in current DFER datasets since large vision Transformers are data-hungry and training them typically requires more than million-level labeled data [13,54].This result also demonstrates the significance and superiority of large-scale self-supervised pre-training over traditional supervised learning.
Comparison of Different Model Architectures.We then investigate the effect of three key modules in LGI-Former by evaluating the performance of the following variants: 1) only local intra-region self-attention (i.e., no global inter-region self-attention and local-global interaction), 2) no local-global interaction, 3) no global inter-region self-attention, and 4) using global space-time self-attention instead (i.e., ViT).The results are presented in Table 3.We have the following observations: 1) The first variant has the worst performance, which is as expected since only utilizing local intra-region self-attention does not allow local tokens to access global information.2) Either global inter-region self-attention or local-global interaction contributes to better performance, demonstrating the effectiveness of these two modules in local-global information propagation.Besides, the latter is generally more effective than the former but at the cost of more computation.It also should be noted that global inter-region self-attention only introduces negligible computation (∼0.1GFLOPs) thanks to the small number (i.e., 8) of representative tokens.3) When combing the global inter-region self-attention with local-global interaction, LGI-Former achieves the best results.Besides, compared with the last variant which uses global space-time self-attention (i.e., ViT), we only observe slight performance drop (<0.6%) but large computation reduction (∼38% FLOPs), thus demonstrating the efficiency of LGI-Former.Effectiveness of Joint Masked Appearance and Motion Modeling.We study the effect of different loss weights in Equation 2, ranging from 1.0 (i.e., only the original appearance target) to 0.0 (i.e., only the motion target).As shown in Fig. 3, we find that the joint model outperforms the model with only one reconstruction target and it achieves the best performance when adopting a loss weight around 0.5.For instance, on DFEW fd1, the best joint model surpasses the standalone appearance model by 1.69% UAR and 0.92% WAR and its motion counterpart by 1.77% UAR and 1.02% WAR.These results indicate that joint masked appearance and motion modeling are indispensable to facilitate better spatiotemporal representation learning for DFER.In addition to our MAE-DFER,    we apply it to VideoMAE (shown in Table 10 of Appendix), which can also bring further improvement (1.51% UAR with 0.30% WAR on DFEW fd1 and 0.39% UAR with 0.13% WAR on FERV39k).Role of Local Region Size.We evaluate the effect of different local region sizes in LGI-Former and report the results in Table 4.We can find that the model performance is not very sensitive to the region size.Moreover, the model computation with different region sizes are similar to each other.These results indicate that, no matter how to divide the input into local regions, LGI-Former can achieve effective and efficient local-global information exchange via the introduced representative tokens and its specialized designs (i.e., the three key modules).Besides, when using the region size of 2 × 5 × 10 (only using  = 8 representative tokens), the model achieves the best performance-computation trade-off.

Comparison with State-of-the-art Methods
Results on Large In-the-wild Datasets.We first compare MAE-DFER with previous state-of-the-art supervised methods on DFEW, FERV39k, and MAFW in Table 5, Table 6, and Table 7, respectively.On DFEW, MAE-DFER surpasses the previous best methods (i.e., DPC-Net [65] and M3DFEL [60]) with a significant margin, achieving a noteworthy 6.30% UAR and 5.18% WAR improvement.Besides, we also present fine-grained performance of each class in Table 12 of Appendix, MAE-DFER also achieves remarkable improvement across most facial expressions.Notably, for the disgust expression, which only accounts for 1.2% of the entire dataset and is very challenging for all baselines, MAE-DFER improves the best accuracy by and MAFW in Table 13 and Table 14 of Appendix.In summary, the promising results on three in-the-wild datasets demonstrate the strong generalization ability of MAE-DFER in practical scenarios.
Comparison with VideoMAE.To verify the effectiveness and efficiency of MAE-DFER, we also show the results of VideoMAE [54] on three in-the-wild datasets, including both the original model pre-trained on Kinetics-400 [6] for 1600 epochs and the model pre-trained on VoxCeleb2 under the same setting as MAE-DFER.From Table 5-7, we have the following observations: 1) The original VideoMAE model pre-trained on general videos (i.e., action recognition) is largely inferior to its counterpart pre-trained on facial videos, indicating that the large-domain gap between selfsupervised pre-training and downstream fine-tuning will severely hurt the performance.2) Compared with VideoMAE pre-trained on VoxCeleb2, our MAE-DFER largely reduces the computational Figure 5: Embedding space visualization using t-SNE [58].
Results on Small Lab-controlled Datasets.We show the comparison results on CREMA-D, RAVDESS, and eNTERFACE05 in Table 8.Compared with in-the-wild datasets, we observe even larger performance improvement on three lab-controlled datasets.On CREMA-D, our MAE-DFER outperforms the best unimodal methods by over 12% UAR and 10% WAR.More surprisingly, it also shows slightly better performance than the state-of-the-art multimodal method, thus amply demonstrating the superiority of MAE-DFER.On RAVDESS, MAE-DFER improves the previous best by more than 12% WAR and also achieves comparable performance with the best audio-visual method.Finally, on eNTERFACE05, MAE-DFER surpasses the best-performing Graph-Tran [68] by about 7% WAR.

Visualization Analysis
Reconstruction.We first visualize the reconstructed results of MAE-DFER in Fig. 4. The video is randomly selected from the VoxCeleb2 test set.For better visualization, we use a gray-style background for frame difference images shown in even frames and also show all the reconstructed video by adding the reconstructed frame difference images in even frames with the adjacent recovered odd frame images.From Fig. 4, we see that under such a high masking ratio (75% or 90%), MAE-DFER still can generate satisfactory reconstructed results for both the facial appearance content and temporal motion information.Notably, despite the change in identity information (as the model does not see this person during pre-training), the dynamic facial expression can be well restored by reasoning in limited visible contexts (e.g., the opening mouth).This imply that our model is able to learn meaningful dynamic facial representations that capture the global spatiotemporal structure.
Embedding Space.To further qualitatively show the superiority of MAE-DFER over traditional supervised methods, we visualize the learned embeddings using t-SNE [58] on DFEW fd1.As can be seen in Fig. 5, the embeddings of our method are more compact and separable than those of two state-of-the-art supervised methods (i.e., IAL [29] and Former-DFER [69]), which demonstrates that MAE-DFER can learn more discriminative representations for different dynamic facial expressions through large-scale self-supervised pretraining.Besides, VideoMAE has similar embedding space with our MAE-DFER but at the cost of much larger computational cost.

CONCLUSION
In this paper, we have presented an effective and efficient selfsupervised framework, namely MAE-DFER, to exploit large amounts of unlabeled facial videos to address the dilemma of current supervised methods and promote the development of DFER.We believe MAE-DFER will serve as a strong baseline and foster relevant research in DFER.In the future, we plan to explore the scaling behavior of MAE-DFER (i.e., using more data and larger models).Beside, it is also interesting to apply it to other related tasks (e.g., dynamic micro-expression recognition and facial action unit detection).Role of Classification Token Type.We finally explore the effect of two different classification tokens (i.e., original tokens and representative tokens) for downstream fine-tuning.As shown in Table 11, we find that performing mean pooling on the representative tokens for final classification slightly outperforms that on the original tokens.We speculate that this is because the representative tokens are more compact and high-level than the original tokens.

C DETAILED RESULTS
In this section, we first present more fine-grained results (i.e., accuracy of each class) on DFEW, FERV39k, and MAFW in Table 12, Table 13, and Table 14, respectively.From three tables, we observe that MAE-DFER significantly outperforms the state-of-theart supervised methods on most facial expressions, especially on some rare facial expressions (such as disgust, contempt, and disappointment).For instance, on DFEW, our MAE-DFER surpasses the previous best supervised results by about 9% on sad, 13% on disgust, and 8% on fear.On MAFW, it improves the best-performing supervised methods by over 5% on anger, 7% on disgust, 8% on contempt, 8% on anxiety, 6% on helplessness, and 7% on disappointment.Moreover, compared with VideoMAE pre-trained under the same setting, MAE-DFER has comparable or even better fine-grained performance while largely reduces the computational cost during fine-tuning.We also note that the original VideoMAE pre-trained on Kinetics-400 does not perform well on some rare expressions (e.g., disgust on FERV39k), although it could achieve the best results on some dominated expressions (e.g., neutral on FERV39k).These results indicate that our MAE-DFER can effectively and efficiently learn more robust and general representations for DFER via large-scale self-supervised training on abundant unlabeled facial videos, thus mitigating the unbalanced learning issue and achieving superior fine-grained performance.

Figure 1 :
Figure 1: An overview of the proposed MAE-DFER.

Figure 2 :
Figure 2: The illustration of LGI-Former.For simplicity, we only present the information flow in one block, which mainly consists of three stages: 1) local intra-region self-attention, 2) global inter-region self-attention, and 3) local-global interaction.

Figure 3 :
Figure 3: Ablation study on the loss weight.

Figure 4 :
Figure 4: Reconstruction results of a VoxCeleb2 test video under masking ratios of 0.75 and 0.9.We only show 8 frames due to the space limitation.

Table 1 :
Basic information of six DFER datasets used in this paper.CV: cross-validation.† : subject-independent setting.

Table 2 :
Ablation study on the pre-training epochs.

Table 4 :
Ablation study on the local region size.

Table 10 :
Ablation study on VideoMAE with additional temporal facial motion modeling.

Table 11 :
Ablation study on the classification token type.