A Novel Active-Learning Based Emotion-Vision-Transformer Network for Expression Recognition

The purpose of this research is to investigate variability in emotional expression that exists across different types of emotion, including different ages and gender of adolescents with Cleft Lip and Palate. Past studies suggest that adolescents with oral clefts are not able to express their certain emotions clearly, since the inability to display correct facial behavior during and before the treatment of the affected child with a facial behavioral disability. We collected over seven thousand portraits of patients with Cleft Lip and palate, covering different age groups from 7 to 24 years old, containing both male and female samples, and seven basic human emotions: Anger, Disgust, Fear, Happiness, Sadness, Surprise and Neutral. Combining with the Active Learning algorithm, we can achieve effective results in the detection of emotions in cleft lip and palate patients. The facial recognition sample shows that (1) the accuracy of positive emotions recognizing is better than negative emotions, (2) among the four negative emotions, the accuracy of sad is the highest and (3) facial emotions of female samples are relatively difficult to be identified, compared with male samples. We also find that cleft lip and palate patients are less able to express their negative emotions, such as anger and fear, and the ability to express their emotions correctly also differs among different genders.


INTRODUCTION
Human facial expression is the fundamental way to express emotions and can be interpreted to reveal mental status potentially.By analyzing the activity of facial expressions, a large amount of valuable information can be summarized and applied to many fields, such as medicine, human-computer interaction, and virtual reality [1].To read and understand facial expressions, facial expression recognition has been an active research field all along [2].The face expression recognition system includes three aspects: face detection and localization, expression image acquisition, and facial feature extraction and classification [3].Face detection is mainly a pre-processing operation by assigning a common coordinate system to the input face image, and after locating the face extracting the expression features and digitizing them.The last step is based on the obtained expression description and emotion mapping, and the recognition results are grouped into corresponding categories in Ekman and Friesen's 6 basic emotion classification models: Anger, Disgust, Fear, Happiness, Sadness and Surprise [4].Each emotion of them reflects the mental activity of the moment with a unique expression.Cleft Lip and Palate (CLP) is a very common congenital craniofacial anomaly of the mouth and face, with an average of 1 in 600-1000 babies having a cleft lip and palate [5].This facial type of congenital developmental problem is due to viruses, nutritional, endocrine, or genetic factors that the fetus may receive during the developmental stage [6].CLP does not only seriously affect facial aesthetics and cause respiratory infections, but also prevents normal expression of emotions due to facial disorders, which can easily cause mental emotional damage.In particular, severe CLP disease might have a negative effect on the patient's diet and nutritional absorption, resulting in stunted growth of the affected adolescents and causing acquired injuries [7].Additionally, the facial disorder might result in the patient being unable to express his or her language skills properly, which can affect linguistic and communicative ability [8].Even during the repair period and the year after surgeries, communication problems still exist, and the lack of adequate structure, length and mobility of the soft palate can lead to impaired language development, which is also accompanied by problems such as incorrect pronunciation due to structural abnormalities [9].Most children with CLP who are cured will tend to have normal language development, but 20-25% of them will require further guided therapy.Furthermore, besides the emotional impact of facial damage, it also makes the patient vulnerable to discrimination and bullying by peers.Because of the scars of CLP [17], as well as the strong climbing and clustering mentality among peers during adolescence, children with CLP are extremely vulnerable to discrimination and isolation in society and school [10].And with weak communication ability due to CLP, the children are not able to express his or her psychological feelings properly and thus cannot channel his or her emotions in time, triggering more serious psychological disorders [11].Therefore, it is especially significant for the healthy growing up of children with CLP to have a proper understanding of their mental and emotional well-being and to help them to overcome their psychological barriers.
The contribution of this paper as follows: • We propose an active-learning based emotion vision transformer network, which can be efficient and accurately recognize CLP's expression.• We find some conclusion about cleft lip and palate patients through AI model, like they are less able to express their negative emotions, which can help us analyze the real emotion about patients.• We provide ASCS-Emotion Dataset, which can be used for CLP patients' expression recognition research, shown in Figure 2.
The paper is organized as follows: Section II is related work about our research.In Section III, details of our active-learning based emotion vision transformer network are described.In addition, the experiment results of the proposed method are discussed in Section IV.Finally, section V and VI gives out a discussion and conclusion based on the output.

RELATED WORK
For adolescents with CLP, their true emotions are relatively difficult to be identified accurately, or even incorrectly, given their facial defects and the multiple psychological influences [12].It is difficult using traditional questioning or observation methods to understand the true emotions of patients with CLP or similar facial disorders only by their facial expressions.Therefore, how to read, analyze and understand their facial expression efficiently becomes relatively more important and challenging.Facial Emotion Recognition (FER) is a field of scientific research directed at techniques and methods for recognizing and inferring current emotional states from human facial behavior.According to studies in recent years, the common methods of FER are categorized as machine learning, deep learning and psychology.In the field of psychological methods, Electroencephalogram (EEG) is widely used to study the cognitive process of neural information processing in the human brain [16], i.e. the process of emotion generation in the human brain in the expression of self-emotion.The EEG identifies the brain's activity by placing a number of sensors on the tester's scalp [13].In the field of computer vision, there are multiple current FER technologies, such as Gabor Transform, Principal Component Analysis, Support Vector Machine, etc. Currently, the efficient method in deep learning is CAER-NET [14], which can perform emotion recognition based on facial expressions and contextual features in a joint and boosted manner when recognizing images of portrait samples [15].
As shown in figure 1, by extracting the facial expression and context information from the samples, respectively, and then entering into the adaptive fusion network to combine and analyze the two sets of features to obtain the emotion recognition results.And in the field of machine learning, the common method for facial expression recognition is using the combination of standard machine learning methods, such as convolutional networks and specific image pre-processing operations [2].
However, for the current stage of using machine learning and deep learning methods to process facial emotion recognition, some experiments and application scenarios have indicated that the traditional methods are not ideal due to a large number of samples, irregular facial acquisition or complex backgrounds, which might result in low accuracy, and given that the target of our experiment is to analyze the facial expression of individuals with facial expression disorder, the traditional methods cannot meet the requirement for emotion recognition of people with facial disorder well, so we are going to merge the methods from computer and psychology fields and design a reasonable dataset to train the model which can recognize facial emotions of patients with facial disorders and generate high detection accuracy.

Psychological Investigation
In the psychological experiment, the mechanisms of emotion regulation in CLP patients were investigated by first analyzing and assessing specific emotion regulation strategies using two different standardized questionnaires [19].In the first step, the Emotion Regulation Questionnaire (ERQ) was used to assess the "reappraisal" and "inhibition" strategies of habitual emotion regulation, differentiated according to the moment when the test taker underwent an emotional change.Secondly, the Ambivalence over Emotional Expressiveness Questionnaire G-18 (AEQ-G18) is used to measure emotional ambivalence, i.e., the difference between real emotions and outwardly expressed emotions, which has shown good statistical reliability in past experiments [20].After the statistics, the experiment used a standardized FEEL test for the testers to decode and analyze the collected facial emotion expressions of the faces, and a computer test to quantify the basic human emotion expressions, and the six basic emotions generated were identified by the participants.The test process is shown in Figure 3.In the end, it was concluded that no difference in facial emotion expression was found between CLP patients and normal individuals in this experiment, and the CLP patients did not show any disturbance that affected emotion regulation and recognition, probably since the healed facial muscles provided better emotion regulation and recognition, and the healthy emotional resilience and social competence also helped patients to actively repair their psychological deficits.

Active learning
In the highly specialized medical field, the annotation and classification of sample images can be more difficult compared to ordinary images.Biomedical sample images require specialized knowledge and skills to be labeled by research institutes, so it is difficult to obtain large labeled data sets for learning by convolutional neural networks [19].
Active learning (AL) is a good solution to the time-consuming and costly problem of labeling biomedical images.The most immediate advantage of AL is the ability to significantly reduce the annotation cost of samples.Therefore, the need for a large number of annotated samples must first exist, like the emotional classification of each portrait photograph of a CLP patient, in order to The target of active learning is to learn the samples which are hard to learn or contain too much information actively, and how to judge if the current sample is hard enough is based on 'Entropy' because this kind of output is probability.Entropy can describe the results more straightforwardly.For example, we first classify the pictures into "positive emotions" and "negative emotions", which is a dichotomous problem.The probability of each emotion is 50%.Suppose we have a picture that is extremely difficult to distinguish, in which case the model's maximum possible judgment for the sample is 0.5 positive and 0.5 negative, then the entropy is equal to the product of the two, which is 0.25.When the model analyzes a particularly simple sample picture, the maximum possible judgment is 100% positive and 0% negative.So when the model gets a higher entropy, it means that it is more difficult to classify the current sample picture.And when the entropy is smaller and closer to 0, then the sample is more likely to be classified.

Emotion-Vision-Transformer
Emotion Vision Transformer is to merge the method of computer vision, natural language process and psychological emotion classification.We split the original input image into different parts, then flatten them into the Encoding part and the fully connected layer to classify the input image.
In the first step, we need to define the method of input image cropping.The strategy of image cropping is based on the Area of Interest(AOI).We define each part of the left eye, right eye, nose, These regions are not static, as they change regularly when the tester expresses different emotions, such as the eyes, where the tester's pupils dilate when expressing the emotion of fear.For example, when the test taker expresses fear, the pupils will dilate, and when the test taker expresses disgust, the eyes will follow the frown and make the corresponding deformation [18].
When the image is split into different parts, each part will be flattened into a set of one-dimensional vectors.The size of each vector is based on the original resolution and number of channels of input.Assuming the input image is (L, W) in resolution and has N channels.And the size of the image patch is (P, P).So the total number of patches is L*W/(P*P), and each vector is P**P*N.We import an expansion pack of Einops to achieve this process.Then we implement a linear transformation of each vector, as a fully connected layer.In the programming, we initialize a fully connected layer and the output dimension, then input the split images, as patch embedding.
As shown in Figure 4, the original input is split into a vector set of 6 patches, and we add an extra vector because the structure of the basic vision transformer only implements the encoding structure of the Encoder and there is no decoding process involved.
The extra vector is used as the head of the classification and it is combined with the original set to the encoding.The module of the classification head is implemented by adding LayerNorm and two fully connected layers, using the GELU activation function.During the final classification progress, we only capture the first one, which is the token used for classification, and input it to the classification head to get the final classification result.

Emotion-Weighted-Softmax
We designed a method for obtaining the predicting probability of multiple emotions prediction called Emotion-Weighted-SoftMax(EWS), which is based on combining the posterior probability distribution derived for the psychological judgment of the seven emotions.This method can convert the linear predicting values which we derive into category judgment probabilities.As shown in Figure 5, assuming that we get the prediction result as zi by the psychological method of emotions recognition, then the zi is substituted into Softmax, which means that each zi is processed into an exponential function calculating to get a set of non-negative values to avoid the positive and negative values cancel each other when summarizing.
To consider the different expression ability of each emotion for the patients with facial disorders and balance the difference, the weights of each emotion prediction need to be classified and analyzed.As in the investigation of samples, the ability of individuals with CLP to express different types of emotions is various.For the instance, the ability to express positive emotions is better than negative emotions, and sometimes the testers might not be able to recognize emotion commands or have weak sensitivity for specific emotions.So, when processing the linear predicting values, each variable should be matched with the value of the weight of the phase print, based on each type of emotion.The next step is to normalize the summary of all values into the probability of each emotional category, which is between 0 and 1, and get the result of ( ), as ( ) or (ℎ ) which can represent each probability of emotions where the input sample might belong, as like the Likelihood in statistics.After getting each new data, we counted the predicted sentiment values of the current sample into the model, and then compared the results of the pre-labelling and combined them with the psychological posterior probabilities to derive new weights to adjust the model [21].

EXPERIMENT 4.1 Dataset preparation
In the training phase of the model, we first import the dataset of FER2013 as the first training set, which contains grayscale images of faces that have been automatically registered so that the face is more or less centered and occupies about the same amount of space in each image, and covers the same 7 emotions as our experimental validation.To construct the test dataset for this experiment, we interviewed over 100 volunteers and asked for their permission to collect their face image data.These images were initially processed to ensure that the faces were in the middle of the images to achieve the desired effect for testing and training.For each sample of volunteers, we allow the picture to contain backgrounds in different levels of lighting, scenes indoor and outdoor, multiple complexities of background and different collecting devices (Figure 6).
The dataset contains photos of more than 100 testers of diversified ages and genders.The age ranges of the testers are distributed between seven to twenty-four years old, and there are two sets of test images for both males and females in similar age groups.The amount of testing set is over seven thousand portrait photos in total, including seven kinds of basic human emotion.'Neutral' is referred to in the testing set as a label for standardized testing results, and the value of 'Neutral' can also be an index to recognize the recognition error for the result.Normally, the unclear facial behaviors would be grouped into the set of 'Neutral'.The other six chosen emotions are based on the psychological classification model defined by Ekman and Friesen, 1971, the categories of testing emotions are listed as six different attributes: Anger, Happiness, Fear, Surprise, Disgust and Sadness, which determines the category of recognition objects, and systematically establish a database of thousands of images of facial expressions with different expressions, meticulously describing the facial changes corresponding to each expression, including how the eyebrows, eyes, eyelids and lips move [4].From integrating these sets of facial emotions, we can draw the attributions and extract meanings from testers' emotions, it is possible to analyze the facial emotions of CLP patients more precisely, as well as to verify the natural emotions expressed to the external environment and in interactions [22].

Data analysis
4.2.1 Emotion Types.Based on the result of the category of the model, the organs within each AOI region are required to transform when test subjects display different facial emotions, which makes CLP patients may have an innate impairment in expressing some emotion [21].According to the model's classification and accuracy of the test subjects' portraits, it is relatively difficult to distinguish between the expressions of fear and surprise, even in cured adults, except in CLP minors.When surprise and fear are expressed, there is a greater change in the area of interest of the mouth compared to the other emotions, because when surprise and fear are expressed, the mouth is usually open, and there is a significant impairment in actively expressing these two emotions due to congenital facial muscle tissue damage.It is also evident in the classification results of the model that 60% of a large number of scared or surprised samples were classified as calm or incorrectly classified as other emotions, while a small fraction was correctly identified but obtained an accuracy of only about 50% to 60%.In contrast, the testers have a better expressive ability to display the emotion of 'Happy'.More than 93% of all samples of happy emotions were accurately identified with an average accuracy of 95% because when testers were asked to make happy expressions, there was a greater change within the AOI of the eyes and a greater change in the arc of the eyes and eyebrows, and this change could have a greater variation in the data within the coordinate axis, so the adolescents with CLP and cured patients are able to express 'Happy' to a great extent.It can be found that for negative emotions, such as 'Disgust' and 'Fear', the sample of performance ability of females is not as prominent as males.Among these seven emotions, female patients have the worst ability to express fear and nausea, with 43% of the sample being incorrectly categorized as 'Neutral' or 'Angry', while 'Disgust 'and 'Fear' are the two emotions that are correctly categorized with an average accuracy of 52.01%.In contrast, men expressed nausea and fear with an average of 55.23% of the test pictures being correctly categorized and an average of 70.06% accuracy in categorization.For the positive emotions, the percentage of correctly categorized 'Happy' was 100% for both male and female test samples.As previously mentioned in the data preparation section, happy emotions would perform better in the AOI model for faces, while for the accuracy of emotion judgments, the average accuracy of 'Happy' emotions was 90.38% for males and 91.52% for females.Compared with males, females can express happy emotions better.

Dataset contribution
The target of this study is not only to analyze the facial emotion expression of CLP patients but also summarizes a dataset that can be used as a relevant professional study.This dataset is divided into two parts, one as the collection sample for this experiment, accounting for 70%.In total, we collected seven emotional portrait images from 100 volunteers of both genders and age groups ranging from 7 to 24 years old, with over 7000 face images as part of the dataset.To classify and define the emotions of these faces, we used the AOI model to regionalize the faces in order to de-coordinate the different emotion categories from a 2-dimensional plane.Another part of the image set is mined from the Internet.We collect the

Innovative Method
By combining a large proportion of specific sample images and a relatively less number of normal images, we conclude a dataset with a large amount of data and high generalizability, the ASCA-Emotion Dataset.Since the amount of human face images collected for the experiment is not as much as the standard portrait dataset, we pre-train the model based on this 30% of normal emotion expression images and then detect it within 60% of the collected CLP annotated photos.We analyze and adjust the structure according to the accuracy of the different emotions obtained.By combining the technology of AL, the remaining 10% of the dataset, which is labeled as "most ambiguous" are used to finetune the model to get higher accuracy and efficiency.This model and dataset are effective in figuring out the Computer Vision problems in face recognition, especially for patients with facial behavior disorders, such as CLP or congenital facial injuries, to recognize their facial emotions more accurately.
The EWS method was designed based on a transformer-based neural network model, combined with the psychological posterior probability distribution to identify and differentiate the facial emotion expressions of CLP patients with an accuracy rate of 30%.Meanwhile, during the experiment, the researchers integrated an order of magnitude large dataset, divided into two parts, one for manual recording of labels, and the other from Google Images mined down.These images cover the seven basic human emotions and can be used as a reference for subsequent studies.

Model Performance
As show in the table 3, in this experiments, we used Resnet34, Resnet50, VIT-Base and VIT-Large as our backbone which has been pre-trained on ImageNet, we found that the VIT backbone could get the best performance.So in our model structure, we used VIT-Large as our first option, if we want to balance inference speed, we can choose VIT-Base.We evaluate the model ability on our ASCS-Emotion Dataset, we found the emotion recognition accuracy of female were more accurate than male, and positive emotions were more accurate than negative ones.

CONCLUSION
For patients with CLP, even those who have been cured, do not have the same ability to express emotions correctly through their faces as normal people.Therefore, in this paper, we propose a transformer-based neural network model, combined with psychological posterior probability distribution, to design the EWS method to efficiently recognize the facial emotions of these patients or cured patients.As shown in the results, in comparison with the facial expression recognition in different gender, ages and emotion type, the difference in recognition is clear and our models also show high accuracy in analyzing the face of people with CLP or similar kinds of situations, which has a positive impact on helping researchers to understand the real emotions of people with facial disorders.The ASCS-Emotion Dataset also contributed to help subsequent developers and researchers to conduct more far-reaching research.

Figure 1 :Figure 2 :
Figure 1: The network of CAER-Net, the deep learning method of FER

Figure 3 :
Figure 3: Psychological method: Schematic drawing describing the steps of the FEEL test

Figure 4 :
Figure 4: Emotion Vision Transformer model overview

Figure 5 :
Figure 5: Overview structure of Emotion-Weighted-Software and formula

Figure 6 :
Figure 6: Portrait Samples in seven emotions of the contribution of volunteer

Table 1 :
Sample Status Statistics

Table 2 :
Emotion classification recognition accuracy of samples 4.2.2Difference in Gender and Age.After collecting the accuracy of the model's judgment on the classification of the test image set, we compared the labels for each of the seven emotions for different gender and age groups.The samples' statistics is concluded in table 1 and table 2.

Table 3 :
Different Model Performance individuals with CLP and curved patients, and also contain normal human face images in different expression types.The dataset of classified portrait images contains seven emotions such as anger, disgust and happiness, which are suitable for training as face emotion recognition models.Therefore, the dataset summarized by combining the two kinds of images is highly efficient to be applied to future research on facial emotion expression.