Loss Relaxation Strategy for Noisy Facial Video-based Automatic Depression Recognition

Automatic depression analysis has been widely investigated on face videos that have been carefully collected and annotated in lab conditions. However, videos collected under real-world conditions may suffer from various types of noise due to challenging data acquisition conditions and lack of annotators. Although deep learning (DL) models frequently show excellent depression analysis performances on datasets collected in controlled lab conditions, such noise may degrade their generalization abilities for real-world depression analysis tasks. In this article, we uncovered that noisy facial data and annotations consistently change the distribution of training losses for facial depression DL models; i.e., noisy data–label pairs cause larger loss values compared to clean data–label pairs. Since different loss functions could be applied depending on the employed model and task, we propose a generic loss function relaxation strategy that can jointly reduce the negative impact of various noisy data and annotation problems occurring in both classification and regression loss functions for face video-based depression analysis, where the parameters of the proposed strategy can be automatically adapted during depression model training. The experimental results on 25 different artificially created noisy depression conditions (i.e., five noise types with five different noise levels) show that our loss relaxation strategy can clearly enhance both classification and regression loss functions, enabling the generation of superior face video-based depression analysis models under almost all noisy conditions. Our approach is robust to its main variable settings and can adaptively and automatically obtain its parameters during training.


INTRODUCTION
Major Depression Disorder (also called depression) is the most prevalent psychiatric disorder that negatively impacts various aspects of human life, including behaviours, feelings, sleeping and so forth [12], and can even lead patients with severe depressive symptoms to commit suicide.Previous psychological and biological studies frequently claimed that facial behaviours are informative for reflecting human depression status [7,43,46,47].Since facial behaviours can be easily recorded in a non-invasive way via cameras installed in various devices, a large number of objective automatic depression recognition approaches have been developed to estimate depression severity from facial behaviours [1,9,22,25,37,48,49,53].However, they were only developed and evaluated on depression datasets collected under controlled lab conditions [18,44,45,55,56], where the face videos are carefully recorded, selected, segmented and annotated.
On the contrary, face videos recorded in the real world easily suffer from various types and levels of noise due to complex recording conditions (e.g., various light and device conditions).For example, in some frames, the face may not be recorded by the camera, or the recorded faces are incomplete/have low qualities.Subsequently, these incomplete or low-quality face frames may lead to incorrect or even failed face detection, resulting in the detected spatio-temporal facial behaviours being discontinuous; i.e., the obtained facial sequence cannot reflect the real facial behaviours of the target subject.Besides, different from these carefully annotated in-lab datasets, expert human annotators are not always available for real-world datasets.Therefore, acquired data may also suffer from the noisy label problem; i.e., depression labels of some subjects/videos are not reliable.As a result, deep learning (DL) models trained with such noisy data and annotations would be problematic [13,14].In other words, even if recent depression approaches achieved promising performance on clean face videos, they may not be reliable for noisy face-based depression analysis.
To reduce the negative impact of noisy data and annotations on the training process of the machine learning (ML) models, previous solutions frequently attempted to estimate the structure of the noise.Specifically, they either correct [23,27,41,64] or remove [28,32,33,36,50,51] the estimated noisy data or labels.While estimating the distribution of noise usually requires a high computational cost (Problem 1), removing noisy samples would further decrease the size of the training set (Problem 2), as existing facial video-based depression datasets usually have very limited numbers of training samples (e.g., less than 300 training videos).Although an option is to only remove noisy frames, this would also result in the modified video being discontinuous.To avoid such problems, other related studies aim to build a noise-invariant model for the target task without estimating the noisy label structure, where robust loss functions and meta learning strategies have been investigated.These approaches aim to make models trained with noisy and noise-free labels have the same performance [3,10,29,34,57,66].Unfortunately, they are only suitable to address a certain noise problem occurring in a specific task.As a wide variety of unknown noisy data and label problems could occur at the same time, these solutions are not applicable for real-world face video-based depression analysis (Problem 3).
In this article, we propose a novel loss function relaxation strategy that can jointly minimize the negative effects of various noisy facial data and annotation problems during the training phase of facial video-based depression analysis models.Our hypothesis is that the distribution of the training loss values obtained from the noisy dataset would be different from the loss distribution obtained from the clean dataset (investigated in Section 3).Thus, instead of focusing on estimating the type or distribution of the potential noise, our relaxation strategy is designed to categorise loss values into different noise levels in a simple manner and then individually reduce the negative impacts of different levels of noise (addressing problem 3).Importantly, the proposed relaxation strategy can be easily applied to customize both classification and regression loss functions, whose parameters can be automatically and dynamically adapted according to the applied loss function and the training loss values obtained in every training epoch, without requiring complex or additional parameter tuning processes (addressing problem 1).Since this strategy does not modify or remove any data/label, it has no impact on the size of the dataset (avoiding problem 2).In summary, the main contributions of the proposed loss relaxation strategy are listed as follows: -We present the first work that systematically investigates the negative effects of various types of noisy facial data (e.g., failed face detection, incorrect detected face, etc.) and label problems on the training of video-based depression analysis DL models, and uncover that higher-level noise usually leads models to generate predictions that have larger differences with the ground truth (i.e., larger loss values) during training.-We propose a novel loss function relaxation strategy that specifically reduces the large loss values based on multiple estimated noise levels, which can jointly minimize the negative impact of various unknown noisy facial data and annotation problems on depression DL models' training.-Given the uncertainty of the noisy conditions in real-world facial video-based depression applications, we propose a novel adaptive relaxation parameter computation strategy that allows any customized relaxation loss function to automatically, dynamically and adaptively adjust its parameters during training without requiring any manual intervention (i.e., dynamically and automatically setting a set of thresholds to define multiple noisy levels for obtained loss values).

RELATED WORK
In this section, we first review previously proposed video-based automatic depression recognition methods and then summarise existing solutions for addressing noisy data and annotation problems.

Video-based Automatic Depression Recognition
Recent video-based automatic depression analysis approaches can be roughly categorized into three types: framelevel depression modelling approaches, thin video slice-level depression modelling and video-level depression modelling approaches.A standard pipeline for frame-level approaches [21,69] is to first learn depression cues from each static facial display despite the fact that the video is provided and then average all frame-level predictions as the video-level depression prediction.An apparent shortcoming of these approaches is that they ignore spatio-temporal cues from human facial dynamics, which claimed to be crucial for inferring depression [9,49].
Although Uddin et al. [53] applied Long-short-term memory (LSTM) to model the temporal information, its feature extractor mainly focuses on extracting facial appearance-related depression cues.As a result, many studies [2,8,9,19,63,68] proposed to infer depression from short-term spatio-temporal facial behaviours (i.e., thin video slice-based approaches), where the C3D network has been frequently employed to directly deep learn depression-related facial behaviours from video slices [2,8,9].Similarly, the video-level prediction is also obtained by averaging or applying LSTM [2,19] to combine slice-level predictions.While depression is a long-term mental health issue, some studies also attempt to extract video-level representations for video-level depression predictions.For example, Niu et al. [37] first propose a spatio-temporal attention model to learn depression cues from short-term facial behaviours and then use the eigen-evolution pooling to aggregate all short-term cues as the video-level representation.Moreover, some studies [4,48,49,62] also propose to encode all frame-level facial representations of the target video into a video-level spectral representation that contains multi-scale video-level facial behavioural dynamics for depression recognition.These approaches achieved state-of-the-art face-based depression analysis performances on multiple publicly available depression datasets (i.e., AVEC 2013 [56], AVEC 2014 [55] and DAIC-WOZ [18]).However, these studies usually crop and align the face in each frame to deal with potential noisy data problems [6].While sparse coding is employed in an early study [60], it can only suppress the background noise rather than dealing with more complex noisy data and annotation problems.In summary, there is neither a study that specifically considers the impact of various noises that can potentially appear in facial videos on the depression analysis DL models nor an approach that can jointly address multiple noise-related problems.

Solutions for Addressing Noisy Data and Annotation Problems
Most recent solutions focus on addressing the noisy label issue, as previous studies frequently claim that noisy labels are more harmful than noisy data [20,70].Specifically, these solutions can be categorized into two types: noise model-based solutions and noise model-free solutions.Noise model-based solutions focus on modelling the noisy label structure of the dataset, which can be further categorized as (1) noisy channel modelling that corrects predictions to fit the given label distribution [11,23,39]; (2) label noise cleaning that identifies and corrects suspicious labels to true labels [54,64,65]; (3) noisy label removing and sample choosing, which removes suspicious labels [28,36], or directly selecting the data with correct labels for training [5,33]; and (4) sample importance weighting that assigns a unique weight to each data based on the estimated noisiness level of its label [42,59].Instead of modelling the noise structure of labels, noise model-free solutions aim to build a noise-invariant model for the target task.Most of them assume that the main problem caused by noise is model overfitting.Therefore, robust losses have been frequently proposed to allow the classifier/regressor trained with noisy and noise-free labels to generate the same performance [3,17,34].Besides, meta learning is also a noise model-free popular solution, which adjusts models' hyper-parameters to minimize the effect of the noise [15,29].In contrast to dealing with noisy labels, data denoising methods [27,30,35,41] have been widely investigated to correct noisy data.These methods usually train networks to compute clean samples from noisy samples, which have been successfully used for dealing with data that contain white noise and hybrid noise.Other solutions propose to remove erroneous samples (e.g., data with incorrect contents or incomplete data) or select the most noise-robust samples for model training [32,38,50].Among these studies, there are also a few works devoted to addressing the noise problem for facial analysis tasks, where the majority also aim to address noisy label problems [16,57,58,61,66].Meanwhile, a limited number of studies were proposed to handle noisy facial data, which are either the noise model-based solutions [67] that model noises using the network or the noise model-free solutions that design a loss function to train noise-robust models [10].

PROBLEM FORMULATION
The goal of this article is to develop a noise-robust loss function relaxation strategy for real-world facial videobased depression analysis, where various types of noises may occur to the recorded facial videos and annotations.Thus, this section explicitly investigates the impact of various types of noisy facial data and annotation on the training process of depression severity estimation models, i.e., the impact of noises on the training loss.This aims to provide the theoretical basis to develop our loss relaxation strategy.Specifically, six standard DL models are evaluated on 25 different artificially created noise conditions defined by the noisy label and three types of noisy data (e.g., non-face image noise, incorrect face image, and incomplete face video), to ensure the generalizability and objectivity of the investigation results.The detailed settings of these noisy conditions and depression models are explained in Section 5.1 and Section 5.2, respectively.
The effects of noisy facial labels: We first analyse the negative effects of the noisy annotation problem on all DL models by showing the average L1 differences between predictions and corresponding ground truths during training.According to Figure 1, there is a clear difference between the L1 values achieved under clean and various levels of noisy label conditions, where the L1 values obtained under noisy label conditions are more likely to be larger than '6' compared to clean conditions.Particularly, training samples generally generate higher loss values under conditions of higher noise levels for all training epochs, despite that after several training epochs, the loss values are gradually reduced (i.e., more loss values are belonging to the '0-3' group during the 10th training epoch).
The effects of noisy facial data: We also evaluate the effects of three types of standard facial data noise: (1) the non-face image, (2) incorrect face images, and (3) incomplete face video.As shown in Figure 1, we found that the non-face and incorrect face problems cause fewer loss differences in comparison to noisy labels, while incomplete face video noise has a similar negative impact.Similar to noisy label conditions, under all noisy data In summary, compared to facial video data and labels obtained in the original controlled condition, adding different types of noise to them has consistently led to larger differences (loss values) between predictions and ground truth during depression model training, where higher levels of noise tend to produce larger loss values.Consequently, this would lead the trained model to have poor generalization capability.In this article, instead of directly applying the obtained loss values to update models' weights, we propose to first estimate the probabilities of loss values that are caused by the noise and then reduce the impact of loss values in the model weight update stage accordingly.

LOSS FUNCTION RELAXATION STRATEGY
Since different noisy data/label problems would have different impacts on training face video-based depression models, it is difficult or even impossible to obtain the structures and distributions of potential noises.As discussed in Section 3, a certain impact of such noises is that they would cause larger loss values.However, directly removing potential noisy samples/labels that lead to loss values above a certain threshold T would largely reduce the number of training samples and degrade the generalization capacity of the trained model.Instead, we propose a novel loss function relaxation strategy that specifically reduces the loss values caused by each level of noise during models' training.Figure 2 shows an example impact of our relaxation loss function, while Figure 6 and Algorithm 1 illustrate the details of our loss relaxation strategy.Step 3: Step 4: Computing the relaxation loss value as: , where The main novelties of the proposed relaxation strategy reside in that (1) it is the first simple and generic strategy that can jointly address the negative impacts of various noises on face video-based depression analysis; (2) compared to previous noise-related problem solutions discussed in Section 2 that can only deal with a specific noise problem for a specific task, our strategy can jointly address various noisy facial data and annotation problems for face video-based depression analysis task; and (3) this strategy differs from previous noise-robust facial analysis loss functions (e.g., Huber loss, Sub-center loss [10], etc.) that have a fixed form, as our relaxation strategy is flexible to be applied to customize various existing classification and regression loss functions in a fast and simple manner.where P denotes the prediction and G denotes the ground truth.Then, the proposed loss relaxation strategy is formulated as

Suppose a differentiable classification or regression loss function for training depression models is
where are two sets of parameters representing thresholds and incremental noisy factors that describe the contributions of the noises to the loss values (explained in Section 4.2); f 1 , f 2 , • • • , f N are differentiable functions aiming to reduce the impacts of noises at N levels, which can be defined as where д n (x) is a continuous and differentiable function conditioned on lim Specifically, we use a set of thresholds T n (n = 1, 2, • • • , N ) to categorize each obtained loss value into N parts, corresponding to N noisy levels.For the part above the threshold T i , we define the percentage of the loss value L(P, G)) > T i that is caused by noises as the W i , where N n=1 W n = 1.In this article, we call W n as an incremental noisy factor that describes the difference of the noisy level (percentage) between the loss values above T n−1 and the loss values above T n , which can be computed as where L(P, G) ∈ Noise denotes that the loss value is caused by noisy data or label, and thus Prob(L(P, G) ∈ Noises|L(P, G) > T ) is the probability that the loss value is caused by noise when its value above T .Loss differentiation analysis: According to Equation (2), Equation (3) and Equation ( 4), the proposed loss relaxation strategy L R has three properties: (1) the L(P, G) is differentiable; (2) each f n () is differentiable, as it is continuous at the threshold T n and д n () is differentiable; and (3) the term N n=1 f n (L(P, G), P T , P W ) is differentiable, as both L(P, G) and f n () are differentiable and continuous functions.Consequently, the proposed 12:8 • S. Song et al.
relaxation strategy L R is differentiable, which can be directly applied to customize any standard differentiable loss function for training various types of ML models.
Computational complexity: Suppose that the time complexity of the loss function L(P, G) is O(f 1 ), the time complexity of the function f n ( ) is O(f 2 ) and the time complexity of the operation for considering P T and P W is O(1).Also, there is a loop operation N n=1 in Equation ( 2), which conducts N times of the function f n (L(P, G), P T , P W ). As a result, the computational complexity of the full relaxation loss function can be formulated as Consequently, the overall computational complexity of Equation ( 2) depends on the time complexity of the loss function L as well as the function f n ( ), which can be defined as In other words, there is a tradeoff between the computational complexity of the proposed relaxation strategy and its effectiveness; i.e., if we set N = 1, the customized relaxation loss function L R would have the same-level computational complexity as the original loss function L(P, G)).

Adaptive Relaxation Parameter Computation
As explained above, the proposed relaxation strategy is partially decided by its parameters T n and where where e( dQ n (D(k)))

dk
) represents the estimated first-order difference between Q n (D(k)) and Q n (D(k − 1)).When applying a fixed training strategy to train a certain model, we assume the differences (e.g., dQ n (k) of loss value distributions between consecutive training epochs would be continuous and smooth, and thus can be further estimated by higher-order differences.In short, we propose to estimate the e( dQ n (D(k)))

dk
) as where the term M m=1 only depends on the loss distributions of previous k − 1 epochs, which can be obtained before the kth epoch.As a result, the adaptively computed T n divides each loss value into two parts, where the impacts of loss value parts above Q n (D(k)) are reduced by f n (), while the loss value parts less than Q n (D(k)) are fully retained.
Adaptive incremental noisy factor W n : Since it is almost impossible to obtain the distribution of loss values caused by noise in real-world depression datasets, Equation (5) could not be directly computed.Instead, we propose to approximate incremental noisy factors W n (k) of the kth epoch based on the loss distribution of its previous epoch (k − 1 th epoch) as where , the model is more confident that the loss values above T n (k − 1) are caused by noises; i.e., the model can better larger incremental noisy factor.
By applying the proposed adaptive parameter computation strategies, Equation (2) would dynamically adapt to the model's generalization capability during the training, which can be re-written as where L R (k) denotes the relaxation loss function of the kth epoch, and represent the corresponding automatically estimated thresholds and incremental noisy factors.

Example Relaxation Loss Functions for Training Depression Models under Noisy Real-world Conditions
Although setting a large N and specifically designing a set of д n () functions would theoretically allow a relaxation loss function to effectively remove negative impacts of almost all noisy data and annotations, such settings may lack generality and require large efforts.
In this sense, we only set N = 1 for Equation (2) and employ three simple, generic and reproducible functions д() for our evaluation, which can be defined as As a result, the achieved three examples' relaxation loss functions for depression analysis can be formulated as 12:10 • S. Song et al.
These relaxation loss functions can partially reduce the loss above T 1 , with the relaxation loss function L R1 retaining the most extreme losses and L R3 removing most of them.Importantly, not only is the computational complexity of these relaxation loss functions kept at the same level as the original loss function, but also only one parameter T 1 (1) needs to be decided at the beginning of the model training (i.e., the threshold and incremental noisy factors will be automatically adjusted during training).

Comparison with the Huber Loss
The proposed relaxation strategy has a similar object as the Huber loss; i.e., both aim to reduce the impact of outlier/noise.In particular, the Huber loss can be formulated as where δ is a threshold that decides the boundary of the normal loss values and abnormal loss values (i.e., extremely large loss values), which play a similar role to the T n defined in Equation 2).However, the Huber loss is a fixed function that has a similar form to the MSE loss when |G − P| ≤ δ and to L1 loss otherwise; i.e., it is quadratic for normal loss values and linear for abnormal loss values.On the contrary, the proposed relaxation strategy can be applied to customize various differentiable loss functions L(P, G) rather than a fixed combination of MSE and L1.In particular, while the Huber loss only sets a fixed-form function (similar to the MSE loss function) to reduce the negative impact of loss values caused by outliers, in our relaxation strategy, the term f n () enables the different relaxation loss functions to be customized depending on the task, such as linear, square root, or cube root functions (examples are explained in Section 4.3).Moreover, different from Huber loss that only uses one threshold to distinguish normal loss value and noisy loss value, the proposed relaxation loss strategy considers the noise at multiple levels; i.e., we use multiple thresholds to categorize loss values to different noise levels and specifically reduce their impacts accordingly.

EXPERIMENTS
To evaluate the effectiveness of the proposed relaxation strategy in real-world face-based depression analysis, we propose a novel experimental protocol that artificially simulates different noisy real-world conditions based on the widely used video depression datasets that are recorded in the lab (Section 5.1).In particular, six widely used and reproducible DL models are adopted to evaluate the performance of the proposed relaxation strategy (Section 5.2).Section 5.3 comprehensively compares the results achieved by the original Cross-Entropy (CE) and Mean Square Error (MSE) loss functions with the customized relaxation CE/MSE loss functions under various noisy conditions for facial video-based depression classification and severity estimation (regression).We also conduct a set of ablation studies in Section 5.4.

Datasets
In this article, we conduct experiments on a set of noisy face video depression datasets that are created from two widely used audio-visual depression datasets: AVEC 2013 [56] and AVEC 2014 [55].The AVEC 2013 corpus contains 150 audio-visual clips, where each records a participant conducting a set of pre-defined tasks.The lengths of these videos range from 20 minutes to 50 minutes, with an average of 25 minutes.The AVEC 2014 corpus contains specific parts of the AVEC 2013 dataset and also has 150 recorded clips with a total of 300 videos.Each clip contains two audio-visual files recording participants conducting two different tasks, i.e., Northwind and Freeform.The duration of these clips ranges from 6 seconds to 4 minutes 8 seconds.Each clip in both datasets is labeled with a Beck Depression Inventory II (BDI II) score (ranging from 0 to 63) indicating the depression severity of the subject.In this article, we evaluate the proposed loss relaxation strategy on both depression classification and severity estimation (regression) tasks.The depression classification is a video-level facial behaviour-based four-class (i.e., non-depressed/minimal depression, mild depression, moderate depression and severe depression) classification problem; i.e., we categorize all depression videos based on their BDI II scores.Meanwhile, we follow previous studies [9,48,53,63] that use BDI II scores as the labels for depression severity estimation tasks.We specifically produce five types of noisy depression video datasets for each AVEC dataset: -The noisy label datasets are produced by randomly selecting 5%, 10%, 20%, 30% and 50% of videos from the dataset and assigning incorrect labels to them.-The non-face image datasets are produced by adding common failed face detection noises to randomly selected 5%, 10%, 20%, 30% and 50% of frames of each video; i.e., we replace each of the selected face images with a black image.-The incorrect face image datasets are produced by adding incorrect face detection noises to randomly selected 5%, 10%, 20%, 30% and 50% of frames of each video; i.e., we replace each of the selected face image with a randomly cropped image patch.-The incomplete face video datasets are produced by selecting 5%, 10%, 20%, 30% and 50% of training videos and then randomly removing one or two thin video slices (the duration ranges from 1 second to 8 seconds) from each of them.This process aims to simulate the real-world situation where the camera is not functioning properly and fails to continuously record the video.-The mixed noisy datasets are produced by jointly adding the aforementioned four types of noise to randomly selected 5%, 10%, 20%, 30% and 50% of videos of the dataset, as multiple noises could be simultaneously occurring in real-world datasets.

Implementation Details
Model settings: In this article, we evaluate our relaxation strategy on six DL models that have been applied to either face video-based depression analysis tasks or other facial behaviour understanding tasks, including three static models (ResNet-50, EfficientNet-b0 [52] and Swin-Transformer [31]) and three spatio-temporal models (C3D [2,9], MTB-DFE [62] and ST-Transformer [40]).For static models, we follow a widely used strategy [24,53] to train them by pairing the video-level label with every frame in the video.For spatio-temporal models, we follow the same strategy as [9] to divide the target video into several equal-length (16 frames) segments, where each segment is then paired with the video-level label to train spatio-temporal models.The final video-level prediction for all models is obtained by two strategies: (1) averaging all frame-level/segment-level predictions of the video as frequently used by [9,63] and (2) encoding all frame/segment-level features of the video as a spectral vector [48,49] and feeding them to an MLP for the video-level depression prediction.In this article, we simply set Q 1 () equal to the top 10% highest loss value for all relaxation loss functions.
Training details: During training, we individually employ the original CE/MSE loss as well as their three example relaxation functions as the final loss functions to train depression classification and severity estimation models.For all experiments, we employ Adam [26] as the optimizer with the weight decay and learning rate decay for every five epochs.All hyper-parameters are individually tuned according to the corresponding task and dataset.For all datasets created from AVEC 2013 and AVEC 2014, we use 50 training videos for training and choose the best model in the validation set.The final result is obtained by applying the chosen model on the test set.
Evaluation metrics: In this article, we employ the classification accuracy to evaluate the performance of the depression classification task.We also follow previous studies to use root mean square error (RMSE) to measure the error and correlation between predictions and labels for the depression severity estimation task.

Depression Classification Results
. We first show the benefits of the proposed relaxation loss functions for depression classification under various artificially created noisy conditions in Table 1.It can be seen that our relaxation CE losses achieved comparable performance to the original CE loss under clean conditions on both datasets.In contrast, the original CE loss function performed worse than at least one relaxation loss function under almost all noisy conditions.In particular, for all conditions whose noise levels are equal to or above 20%, all three relaxation loss functions have clear advantages over the original CE loss regardless of the noise type.We notice that all relaxation loss functions achieved superior classification results over the original CE loss function under all mixed noisy conditions on both datasets.These results validated that our relaxation strategy not only can effectively reduce negative impacts of different types of noisy data and annotation problems (i.e., especially high-level noises) but also is particularly powerful in jointly dealing with the mixed noise datasets.Table 3 further displays the statistical analysis results comparing whether the proposed relaxation strategy provides significantly different performance over the standard CE loss.Similar results can be observed where for most conditions whose noise levels are equal to or above 20%, all of the three example relaxation functions show significant advantages over the original CE loss.Moreover, Figure 4 compares the average performances of four types of noises achieved by models trained with different loss functions.It can be observed that the original CE loss function allows models to achieve the best performance under clean conditions.However, the performance of such models decreased dramatically compared to the models trained by the proposed relaxation loss functions when the noise level increased.This further validated that the proposed relaxation strategy allows the trained model to be less sensitive to various noises.More specifically, we found that the RF-2 and RF-3 are more robust under high-level noise (30% and 50%), as they can largely reduce the impact of large loss value.

Depression Severity Estimation Results
. Table 2 compares our relaxation strategies with the original MSE loss as well as the Huber loss on the depression severity estimation task, where models trained by our first relaxation loss (RF-1) outperformed the original MSE loss under not only almost all noisy conditions but also the clean conditions on both datasets.Meanwhile, the models trained by the RF-2 and RF-3 MSE loss functions have similar results on the depression severity estimation task as the depression classification task; i.e., they achieved comparable performance to the original MSE loss under clean conditions and superior performances under all noisy conditions whose noise levels were equal to or above 20% except the RF-2, which achieved a slightly worse performance under the 20% noisy label condition.Again, all relaxation loss functions achieved superior depression severity estimation results over the original MSE loss functions under all mixed noisy conditions on both datasets.Importantly, our RF-1 and RF-2 outperformed the Huber loss under almost all conditions, demonstrating the superiority of the proposed loss relaxation strategy over the Huber loss.Table 4 displays the statistical analysis results comparing whether the proposed relaxation strategy significantly enhances the performance over the standard MSE loss.It can be seen that our relaxation strategy helps the MSE loss function to be significantly improved under high noise level conditions (i.e., most conditions when noise level is above or equal to 30%, especially the mixed noise conditions and noisy label conditions).
As illustrated in Figure 5, we can again conclude that the proposed RF-1 loss is superior to the original MSE loss to train depression severity estimation models under all conditions, as the models trained by the RF-1 loss consistently generated lower RMSE performance than the models trained by the MSE loss.Moreover, it can be seen that with noise levels increasing, the RMSE performance generated by the models trained with the original MSE loss increased faster than all proposed relaxation MSE loss functions, which again suggests that our relaxation loss functions are more robust to various types/levels of noise.Similar to the depression classification results, the models trained by RF-2 and RF-3 loss functions are also much more robust under high-level noises for training depression severity estimation models.

Relaxation Strategy on State-of-the-art Depression Analysis Loss Function.
We further apply our relaxation strategy to the state-of-the-art depression severity estimation loss function (called distribution loss and proposed in [68]).Table 5 shows that the models trained by the original DLL frequently show superior performance over its relaxation loss functions under the clean condition and low-noise-level conditions.However, similar to the standard MSE and CE losses, with our relaxation strategy, the modified relaxation-DLL loss functions consistently outperformed the original DLL loss under various high-noise-level conditions.
In summary, the results achieved for both depression classification and severity estimation tasks validated that our relaxation strategy not only can effectively reduce negative impacts of different types of data and annotation noises (i.e., especially high-level noises) but also is powerful in jointly dealing with mixed noises; i.e., the proposed relaxation strategy ensures the training of depression models to be robust to various types of noise.More importantly, it is suitable to be applied to different loss functions and can consistently improve the performance with regard to different forms of the f function in Equation (2).+/-denotes that there is/there is no statistically significant difference from our approach (the significance level of * P < 0.05, * * P < 0.01, * * * P < 0.001).(-*) means that the corresponding relaxation loss is significantly worse than the original loss.

Ablation Studies
In this section, we specifically investigate the influence of the main parameter/variable settings and different noisy conditions on the performance of the relaxation strategy in depression analysis.
Effectiveness of the parameter adaption strategy: Figure 6 investigates the influence of our parameter adaption strategy, where the blue bars represent the results achieved by manually setting an optimal pair of fixed threshold T 1 and incremental noisy factor W 1 for the relaxation loss function for the entire training.Here, we take the RF-2 relaxation loss function as the example and conduct the grid search to obtain the best parameter combination for each experiment, respectively.It can be observed that adaptively searched parameters do not always beat manually defined parameters under 5% noisy conditions.However, when noise levels  of the Q function settings for either the relaxation CE loss function or the relaxation MSE loss function under all conditions, where the RF-1 is the most robust one with only small differences under most conditions.In particular, when the setting of Q ranges from 0.85 to 0.95, both depression classification and depression severity estimation results are very stable; i.e., most depression classification performances varied less than 1%, while most depression severity estimation results varied by less than 0.1 RMSE value.In other words, the proposed relaxation strategy is relatively stable and effective when the setting of Q belongs to a certain interval.
The influence of the noise type: Table 6 and Table 7 investigate the effectiveness of the proposed strategy on different noise types, where all evaluated example relaxation loss functions can improve the depression classification accuracy compared to the original CE loss as well as reduce the RMSE of depression severity estimation compared to the original MSE loss functions on both datasets under all five types of noisy conditions.Particularly, the improvements are more clear under the mixed noise condition.Such results show that the proposed relaxation strategy can jointly reduce the negative impacts of different types of noise without requiring to understand their distributions.
The influence of the noise level: We also investigate the effectiveness of the relaxation strategy on different noise levels.As we can see from Table 6 and Table 7, our relaxation loss functions do not show significant advantages for conditions that have low noise levels (i.e., 5%).However, they help the original CE and MSE loss to achieve superior performances for all conditions whose noise level is equal to or above 20%.We assume that this is because the relaxation strategy can more reduce the impact caused by noise under high-level noisy conditions (i.e., more loss values are affected by noises), while it is more likely to incorrectly reduce the impact of clean data under low-level noisy conditions.In short, we conclude that our relaxation loss is more suitable to be applied to conditions where the data/labels suffer from relatively high noise levels.

CONCLUSION
DL models have achieved promising performances in face video-based depression analysis tasks.However, facial videos collected in real-world conditions sometimes suffer from various types of noise, which may heavily degrade the analysis performance.While existing approaches to address noise problems usually suffer from various issues, e.g., high computational cost, reducing the number of training data, or can be only applied to a specific application/noise, this article proposes an efficient and effective loss function relaxation strategy that can be applied to various previously unknown noisy conditions for face video-based depression analysis without impacting the number of training data.More importantly, it is flexible to be applied to customize existing differentiable classification and regression loss functions, allowing them to reduce the negative impact of loss values caused by noisy training data and annotations.According to our experimental results, we conclude that (1) various types of noise have clear influences on the loss values during the facial video-based depression models' training, where higher-level noise tends to cause larger training loss values; (2) noisy annotations have higher negative impacts than data noise; and (3) the proposed loss relaxation strategy is flexible to customize different loss functions, which helps them achieve better results for both depression classification and severity estimation (regression) tasks under various noisy conditions.This suggests that the proposed loss relaxation strategy can make models be more robust to various noisy conditions: (4) the proposed loss relaxation strategy is not very sensitive to the form of its f function, as three example relaxation loss functions consistently enhanced the results achieved by the original CE and MSE loss functions under various noisy conditions, and (5) while the proposed parameter adaptation strategy does not rely on time-consuming parameter tuning, it is effective to learn more task-specific parameters for the proposed relaxation loss functions in every training epoch.
Limitation and future work.The main limitation of the proposed loss relaxation strategy is that it is unable to help the original CE and MSE loss to achieve clear superior performances under clean conditions, as these relaxation loss functions would incorrectly reduce the impact of loss values generated by clean samples.Although this problem might be addressed by setting a large N value and manually defining a set of f functions according to the task, this would be time consuming.As a result, our future work would focus on developing an efficient method to automatically define the N value and a set of task-specific f functions during the training process, which will be evaluated on more face video-based depression analysis datasets.We also aim to extend the relaxation loss strategy to more ML applications.

Fig. 1 .
Fig. 1.Comparison of distributions of L1 distances between predictions and ground truth during the training, which are achieved on different clean/noisy conditions for the face video-based depression severity estimation task.

Fig. 2 .ALGORITHM 1 : 1 : 2 :
Fig. 2. Example of applying our relaxation loss function to facial video-based depression severity estimation.While our relaxation MSE loss produces very similar loss values as the standard MSE loss under the clean condition, it brings a much smaller negative impact caused by the noise.

Fig. 3 .
Fig. 3. Illustration of the impact of our loss relaxation strategy, where the thresholds T 1 (k 1 ) and T 2 (k 1 ) computed for the k 1 th epoch (depicted on the left side) are different from T 1 (k 2 ) and T 2 (k 2 ) computed for the k 2 th epoch (depicted on the right side).Our relaxation strategy adaptively learns a set of thresholds (two thresholds are used in this example, i.e., T 1 (k 1 ) and T 2 (k 1 ) are defined for the k 1 th epoch, while T 1 (k 2 ) and T 2 (k 2 ) are defined for the k 2 th epoch) for every epoch to define loss values caused by multiple noise levels.It can specifically reduce the negative impact of loss values caused by each noise level (i.e., different functions are applied to compute loss values of different estimated noise levels).
) without suffering from the complex and time-consuming parameter tuning process.Adaptive thresholds T n : We define T n (k) for the kth training epoch based on the estimated distribution of the original loss values (i.e., the loss values computed by L(P, G)) at the kth epoch.Specifically, supposing the loss value distribution of the k th epoch is D(k) (k = 1, 2, • • • , K), the threshold T n (k) of the employed relaxation loss function in this epoch can be computed as are a set of manually defined functions that decide the T n (k) from D(D(k)).Specifically, each Q n (D(k)) is defined by the top M n % largest loss value; i.e., it assumes that losses whose values above the threshold Q n (D(k)) are more likely to be caused by a certain level (the nth level) of noises.While it is impossible to obtain the real D(k) before the end of the kth training epoch, we propose to estimate the Q n (D(k)) based on loss distributions (D(1), D(2), • • • , D(k − 1)) obtained by previous epochs, which can be formulated as is the average value of all loss values that are below or equal to the T n (k − 1) in the k − 1 th training epoch, and L Ave n+ (k −1) is the average value of all loss values that are above T n (k −1) in the k − 1 th training epoch.Since the model gradually fits to the clean data, our hypothesis is that if L

Fig. 4 .
Fig. 4. Average depression classification results of models trained with four types of loss functions on the clean condition and five levels of noisy conditions.

Fig. 5 .
Fig. 5. Average depression severity estimation results of models trained with four types of loss functions on the clean condition and five levels of noisy conditions.

Fig. 6 .
Fig. 6.Average depression classification and regression results of six models trained with the RF-2 loss function on the clean condition and five levels of noisy conditions created from AVEC 2013 dataset.

Fig. 7 .
Fig. 7. Average depression analysis estimation results of six models achieved by different Q settings on noisy datasets created from AVEC 2013 dataset.

Table 1 .
Average Depression Classification Results (Accuracy) Achieved by Six Baseline Models on Noisy Datasets Created from AVEC 2013 and AVEC 2014 Depression Datasets CE denotes the standard cross-entropy loss function, and RF-1, RF-2 and RF-3 represent the three CE-based example relaxation loss functions defined in Section 4.3.All noise types are defined in Section 5.2.

Table 2 .
Average Depression Severity Estimation Results (RMSE) Achieved by Six Baseline Models on Noisy Datasets Created from AVEC 2013 and AVEC 2014 Depression Datasets MSE denotes the standard mean square error loss function, while RF-1, RF-2 and RF-3 represent the three MSE-based example relaxation loss functions defined in Section 4.3.All noise types are defined in Section 5.2.

Table 3 .
Statistical Difference between the Depression Severity Classification Accuracy Results Achieved by Six Baseline Models Trained Using Each Relaxation Strategy and the Original Cross-entropy Loss Function on Noisy Datasets Created from AVEC 2013 and AVEC 2014 Depression Datasets

Table 5 .
Average Depression Severity Estimation Results (RMSE) Achieved by Six Baseline Models on Noisy Datasets Created from AVEC 2013 and AVEC 2014 Depression Datasets

Table 6 .
Average Depression Classification Accuracy Improvements Achieved by Six Baseline Models on Noisy Datasets Created from AVEC 2013 and AVEC 2014 Depression Datasets

Table 7 .
Average RMSE Reduction for Depression Severity Estimation Task Achieved by Six Baseline Models on Noisy Datasets Created from AVEC 2013 and AVEC 2014 Depression Datasets