Advancing Audio Emotion and Intent Recognition with Large Pre-Trained Models and Bayesian Inference

Large pre-trained models are essential in paralinguistic systems, demonstrating effectiveness in tasks like emotion recognition and stuttering detection. In this paper, we employ large pre-trained models for the ACM Multimedia Computational Paralinguistics Challenge, addressing the Requests and Emotion Share tasks. We explore audio-only and hybrid solutions leveraging audio and text modalities. Our empirical results consistently show the superiority of the hybrid approaches over the audio-only models. Moreover, we introduce a Bayesian layer as an alternative to the standard linear output layer. The multimodal fusion approach achieves an 85.4% UAR on HC-Requests and 60.2% on HC-Complaints. The ensemble model for the Emotion Share task yields the best ρ value of .614. The Bayesian wav2vec2 approach, explored in this study, allows us to easily build ensembles, at the cost of fine-tuning only one model. Moreover, we can have usable confidence values instead of the usual overconfident posterior probabilities.


INTRODUCTION
In the era of voice-based human-computer interaction devices, the significance of paralinguistics as an integral component cannot be ignored.Paralinguistics is a field dedicated to studying various traits of the speaker.As such, it has gained importance in ensuring effective communication between humans and machines.As these voice-based systems continue to develop, understanding and interpreting paralinguistic cues such as tone, pitch, rhythm, and emotional expressions has become crucial for enhancing the user experience and enabling more natural and intuitive interactions.This year, the ACM Multimedia Computational Paralinguistics Challenge (ComParE 2023) introduced two paralinguistic tasks [20], Requests, utilising the HealthCall30 corpus (HC-C) [13] and Emotion Share, utilising the Hume-Prosody dataset (HP-C) [5].
When addressing paralinguistic tasks, a popular approach is to employ features extracted from pre-trained models [22].The organisers of this year's competition presented several solutions as baselines such as DeepSpectrum [2], AuDeep [1,8] and the Com-ParE Acoustic Feature Set.Lastly, the popular pre-trained wav2vec2 model [3], which has exhibited remarkable results in various paralinguistic domains [9,11,17,23,25], was also employed as a baseline.
This study follows the trend of leveraging large pre-trained models, particularly wav2vec2 variants, which are trained on extensive audio data and suitable for tasks with limited in-domain data.We experiment with audio-only and hybrid solutions that combine audio and text modalities.What sets our work apart from the previous studies is that we aimed to build a model that is able to signal its confidence.Standard, fine-tuned models are usually overconfident in their prediction, and their posteriors are not applicable to assess their confidence level.By connecting a Bayesian output layer [10] instead of the traditional linear layer, we can easily create an ensemble of fine-tuned models at the cost of training only one large network.Moreover, incorporating the Bayesian layer allows us to measure the uncertainty associated with the predictions, providing valuable insights into the decision-making process.task, as an alternative to the commonly used approach of averaging or using the last Transformer layer.For generating predictions, apart from the standard linear layer, we additionally experimented with a Bayesian linear layer.
In Bayesian neural networks [10], instead of learning single parameter values, the goal is to optimise their posterior distribution  ( |).To compute the posterior, we need to calculate the data likelihood.The data likelihood requires integrating over all the possible weights, which is intractable.Instead of calculating the exact posterior, we can approximate it using Stochastic Variational Inference (SVI) [12].The SVI defines a parameterised approximate posterior   (), where the parameters  are used to tune the approximate posterior closer to the original one.To bring the approximate posterior closer to the exact one, we need a way to measure the dissimilarity between the distributions.One way to measure the dissimilarity is by using the Kullback-Leibler (KL) divergence.However, estimating the KL divergence would still require the original posterior  ( |), which is intractable.Luckily, from the KL divergence, a tractable solution can be derived, called evidence lower bound (ELBO) [24], used as an objective function for the SVI.
The Requests task falls within the broader field of intent recognition, which has seen a shift towards end-to-end solutions [6,21].However, these solutions still lag behind the pipeline approach, which involves generating transcripts using an automatic speech recognition (ASR) system and then using those transcripts as input for a text-based classifier.Motivated by this, we extracted text-based features using a pre-trained BERT [7] model.Since both the Requests and Emotion Share tasks come with only audio files, we used the Whisper model [18], which produces transcripts, preserving the capitalisation and punctuation.To combine the audio and text modalities, we experimented with early and late fusion techniques.
We implemented all the models using the PyTorch framework [16].Some experiments utilised the SpeechBrain toolkit [19], while the Bayesian experiments employed the implementation described in [14].The negative log-likelihood was optimised for the Requests task, and the mean squared error (MSE) was used for Emotions.For a detailed implementation description and a comprehensive list of hyperparameters, please refer to the code repository1 .

Requests task
The Requests task involves a classification problem that can be further divided into two binary sub-tasks: HC-Complaints and HC-Requests.Consequently, we have two options for modelling this task: treating it as a combined 4-class classification problem or developing separate binary classification models.For the combined 4-class classification, we merged the HC-Complaints and HC-Requests labels in pairs, resulting in the following classes: "no_affil", "yes_affil", "no_presta", and "yes_presta".
For audio feature extraction, we used the multilingual French ASR wav2vec2 model (wav2vec2-large-xlsr-53-french), which we trained by updating both the convolutional feature encoder and the contextual Transformer layers.The priors for the Bayesian layer were set as the mean and standard deviation of the learned weights from the model trained with the standard linear output layer.To determine the optimal layer for each task, we conducted a layer-wise analysis by fine-tuning the model on a subset of the data.Based on those findings, we used the specific layers in the subsequent experiments.
To generate the transcripts, we used the large Whisper version 2 model (whisper-large-v2), trained on 680k hours of labelled data.To extract the text features, we used the French BERT model [15] (camembert-base).For combining both modalities, we employed the weighted late fusion approach, which calculates the weighted sum of the probabilities with weights tuned using the development set: 1.0/0.9 and 1.0/0.5 (BERT/wav2vec2) for the HC-Requests and HC-Complaints, respectively.

Emotion Share task
To extract the audio features for the Emotion Share task, we experimented with two pre-trained wav2vec2 models.The first one is the large pre-trained multilingual XLSR model [4] (wav2vec2-largexlsr-53).In the initial experiments, we chose this model due to the multilingual nature of the data.As a second choice, we employed the English pre-trained and ASR fine-tuned version (wav2vec2large-960h-lv60-self).The justification for this model is due to the speakers in the dataset being from the United States, South Africa, and Venezuela, meaning that in the majority of the cases, the English language is being spoken.During training, we optimised both the convolutional feature encoder and the contextual Transformer layers, following a similar approach as the Requests task.To generate the emotion intensities, we applied a sigmoid function to the logits produced by either the standard linear or the Bayesian linear layer.This normalisation was necessary because the emotion intensities in the dataset were scaled between 0 and 1.Similar to the Requests task, the priors for the mean and standard deviation were taken from the standard linear layer.To investigate the impact that the different layers have, we performed a layer-wise analysis by fine-tuning the model on a subset of the data.The preliminary experiments revealed that layer 18 was the most optimal for the multilingual model and the last layer for the English.
Transcripts for this task were also generated using the large Whisper model, and the features were extracted using the [CLS] token of the large cased English BERT model (bert-large-cased).To combine the audio and text modalities, we adopted early and late fusion approaches.In the early fusion, the wav2vec2 and BERT models were trained separately.Then, the embeddings from both modalities were concatenated.Finally, a separate multi-output regression model was trained using the concatenated embeddings.

RESULTS AND ANALYSIS 3.1 Requests
The results of the experiments conducted on the Requests task are summarised in Table 1.To select the best candidates for the submissions, we conducted evaluations using multiple approaches on the development set.The initial focus was on the 4-class classification problem, where we trained the first two models (models 1 and 2).The results revealed that the Bayesian output layer slightly outperformed the standard linear model on the HC-Requests subtask while showcasing a more substantial advantage on the HC-Complaints.Next, we trained separate models for each sub-task, which demonstrated a significant improvement compared to the 4-class approach (models 3 and 4).In this case, the standard linear model performed slightly better than the Bayesian on the HC-Requests sub-task, while the Bayesian exhibited a slight advantage on the HC-Complaints sub-task.
Continuing our experiments, we sought to compare text-based solutions with the audio-only models.Since the 2-class training yielded better overall results, we utilised the same setting for the text-based systems.Interestingly, unlike the audio models, the standard linear model demonstrated slightly better performance on the HC-Complaints sub-task, while the Bayesian model showed superiority on the HC-Requests sub-task.Moreover, the text-based solutions demonstrated improvement over the audio-only models for both sub-tasks.Notably, we observed a 0.8% absolute improvement on the development set over the best audio model on the HC-Requests sub-task and 4.9% on the HC-Complaints.
In the final experiment, we explored the benefits of combining both audio and text modalities using the late fusion approach (model 7).This fusion of modalities resulted in the best overall performance on the development set, highlighting the advantages of leveraging both audio and text information.
Overall, the aforementioned approaches outperformed the wav2vec2 baseline system on the development set, with the best model (model 7) achieving a 16.6% absolute UAR improvement on the HC-Requests and a 14% on the HC-Complaints sub-task.
The preliminary experiments conducted on the development set provided us with promising candidates for the final submissions.Notably, for the final submissions, we re-trained the models using the train and development sets.
Our first selected candidate was the model trained with 4 classes using a Bayesian linear layer (model 2).We chose this model based on its superior performance compared to the model with a standard linear layer.Additionally, we wanted to investigate the potential benefits of training a single model with 4 classes, instead of two separate ones.This 4-class Bayesian model achieved 80.8% UAR As the following two candidates, we opted for the standard linear and Bayesian linear models trained on audio-only data using 2-class classification (models 3 and 4).The purpose was to compare the performance of the Bayesian and standard linear layers.The results showed that the model with the Bayesian layer achieved overall better performance.Furthermore, the 2-class Bayesian model outperformed its 4-class counterpart, showcasing the benefit of separate models for each sub-task.
The fusion of audio and text modalities (model 7) yielded the best results on the development set.Hence, we selected this hybrid model as our fourth submission to be evaluated on the test set.As evident from the results, this model achieved superior performance on both the HC-Requests and HC-Complaints sub-tasks, compared to the other submissions.
For our last submission, we created an ensemble of the four selected models by utilising majority voting to determine the final predictions.Although this model combined all previous submissions, it fell slightly behind the multimodal approach.

Emotion Share
Following a similar approach to the Requests task, to select the most suitable candidates for submission, we conducted preliminary experiments on the development set.A summary of our experiments can be found in Table 2.
As an initial set of experiments, we explored processing the utterances in segments using overlapping windows.Specifically, we used a segment size of 30ms with a 10ms stride.We chose the segment and stride sizes based on preliminary experiments.To address the multilingual nature of the data, we conducted experiments with both the multilingual and English wav2vec2 models.With the segment-based processing approach, we observed better performance from the multilingual model.Therefore, in the table, we only present the results from that model.Notably, the comparison between models 1 and 2 revealed that the standard linear layer performed substantially better than the Bayesian.
Subsequently, we evaluated the English wav2vec2 model using both standard linear and Bayesian layers (models 3 and 4).The results indicated a slight advantage for the standard linear layer  in the Emotion Share task, although the difference was negligible.
Additionally, the segment-based processing with the standard linear layer (model 1) showed slightly better results compared to the English model that processed the entire utterance at once.However, due to the Bayesian solution being more stable using the English model, we proceeded with that for the subsequent experiments.
In addition to audio-only techniques, we also explored multimodal solutions by fusing audio and text systems (models 5 and 6).The results revealed that early fusion slightly outperformed the late fusion approach.Furthermore, both approaches outperformed the audio-only solutions, demonstrating the benefits of leveraging both audio and text modalities.
After obtaining results on the development set, we picked the most promising models to evaluate on the test set.First, we chose the multilingual model with segment-based processing and a linear classification layer (model 1).This model achieved a  score of .543,surpassing the baseline performance.Next, we compared the English wav2vec2 models using both the standard linear and Bayesian layers (models 3 and 4).The results demonstrated that the Bayesian approach outperformed the standard linear alternative.This is consistent with the findings in the Requests task.Moreover, both approaches yielded better performance than the segmentbased processing solution.
To assess the effectiveness of modality fusion, we selected the hybrid multimodal model with early fusion (model 5) for our next submission.Based on the results obtained on the test set, we concluded its superiority over the audio-only solutions.
For our final submission, we opted again for an ensemble approach.However, instead of combining all previous submissions, we excluded model 1 due to its inferior results compared to the others.The ensemble was created by averaging the intensities of each model's predictions, resulting in the best performance and surpassing the baseline by a  value of .1.

Measuring model uncertainty
One big advantage of Bayesian neural networks is their ability to model the uncertainty that comes with the predictions.This brings us a step closer to understanding the reasons behind the predictions, which is more difficult to achieve using standard networks.
To delve into the decision-making process, we plotted the probability density function (PDF) for the probabilities of the correct and wrong predictions, presented in Figure 1.To conduct the experiments, we used the Bayesian wav2vec2 model (model 4 from Table 1).Since Bayesian models allow for sampling different weights, we sampled 500 of them, resulting in that many probabilities per sample.Then, we averaged the probabilities to get one probability per sample.By looking at the density for the HC-Complaints sub-task (Figure 1a), we can notice that the mean of the incorrect predictions is smaller than the mean of the correct ones, indicating that when the model is wrong, it is also less confident about the prediction.A similar trend can be observed for the HC-Requests sub-task (Figure 1b), where the mean of the correct predictions is bigger than the average of the wrong ones.Additionally, the erroneous predictions have higher variance, indicating higher uncertainty.

CONCLUSIONS
This study addressed the Requests and Emotion Share challenges within the ACM Multimedia ComParE challenge.Our findings highlighted the advantages of incorporating both audio and text modalities in a hybrid approach, surpassing the performance of the audioonly solutions.Through our exploration of different wav2vec2 models and layers, we demonstrated the importance of carefully selecting the appropriate settings for each specific task.
Our multimodal late fusion approach achieved UAR scores of 85.4% and 60.2% for the HC-Requests and HC-Complaints subtasks, respectively, corresponding to absolute improvements of 18.2% and 8% compared to the wav2vec2 baseline.Notably, the consistency between the development and test results across all models indicates that they did not overfit, further validating their effectiveness.Similar improvements were observed in the Emotion Share task, where our best-performing model, utilising an early fusion approach, yielded a notable  score of .612.Furthermore, an ensemble of our top-performing models resulted in a slight improvement, achieving a  score of .614.
By incorporating the Bayesian layer, we observed improvements over the standard linear layer.Moreover, we gained deeper insights into the decision-making process of our models.Through PDF analysis, we observed that the model exhibited reduced confidence in its predictions while making errors.In the future, this property could be exploited to develop systems that could accurately inform users about their confidence and possible mistakes.
(a) PDF on the HC-Complaints (b) PDF on the HC-Requests.

Figure 1 :
Figure 1: PDF for the correct and wrong predictions on the HC-Complaints and HC-Requests sub-tasks.

Table 1 :
UARs on the Requests task.The layer column indicates the number of Transformer blocs used for the Requests and Complaints sub-tasks, respectively.

Table 2 :
Spearman's  for the dev and test sets on the Emotion Share task.