Model Confidence Calibration for Reliable COVID-19 Early Screening via Audio Signal Analysis

Advanced sensors in mobile devices have served as an effective screening tool for COVID-19 diagnosis, and an alternative to reverse transcription-polymerase chain reaction (rRT-PCR) tests, particularly in underdeveloped countries. In this study, we present a deep-learning approach to enable COVID-19 rapid diagnosis using cough signals. We then leverage spline calibration to enhance the reliability of predictions by calibrating model confidence. We conduct extensive experiments on the Coswara dataset to demonstrate the effectiveness of the proposed calibration approach in audio signal analysis. Our finding suggested that calibration could substantially enhance the reliability of COVID-19 early detection when compared to the uncalibrated model. Furthermore, our Spline calibration-based method outperformed other calibration methods, achieving an expected calibration error (ECE) of 0.148, an area under the receiver operating characteristic (AUROC) of 0.812, a Brier Loss of 0.189, and a logarithmic (Log) Loss of 0.584. The proposed confidence calibration framework on modern neural networks may enhance the reliability and trustworthiness of mobile healthcare for infectious respiratory disease screening in real-world applications.


INTRODUCTION
The COVID-19 pandemic has emphasized the need for quick and reliable diagnostic methods [5,11].Traditional diagnostic methods, such as rRT-PCR tests and imaging techniques, have proven valuable but may be limited by resource constraints, time-consuming procedures, and the need for specialized equipment [4,6,7,19,21,29,30,33,34]. Consequently, alternatives are being explored.One promising avenue is the use of audio signals, specifically cough signals, as shown in Figure 1.A cough, which is a natural physiological response designed to clear the airway, often presents biomarkers pertinent to infectious diseases [10].With recent technological advancements facilitating continuous monitoring of human sounds, cough signals have been successfully used in detecting respiratory diseases such as asthma, pneumonia, lower respiratory tract infections, croup, and bronchiolitis [23].
It is widely recognized that COVID-19 primarily targets the respiratory system, causing noticeable changes in the sound of patients' coughs, breathing patterns, and voice tone.In light of this, This work is licensed under a Creative Commons Attribution International 4.0 License.recent research has delved into the exploration of cough, breath, and speech signals encapsulated in audio signals to develop screening tools for COVID-19 using mobile devices since these signals have long been used as diagnostic tools for other respiratory illnesses [27].However, there are challenges associated with implementing this technology, such as ambiguity regarding the uniqueness of COVID-19 biomarkers.Although these biomarkers exhibit distinct patterns in cough signals that can differentiate between COVID-19 patients and healthy individuals, it remains unknown whether these biomarkers differ significantly from those of other respiratory illnesses.Specifically, COVID-19 primarily affects the lower respiratory tract, in contrast to most other respiratory diseases that primarily affect the upper respiratory tract.Therefore, by utilizing feature extraction techniques, it is possible to access biomarker features that indicate lower respiratory inflammation within the cough signal [26].
Another hurdle in implementing this technology is the need to account for variability in cough signals due to external factors like noise, which may impact the model's reliability when employed in real-world scenarios.Machine learning models, although powerful, often lack calibration, leading to inaccurate probability predictions.Ensuring models indicate their potential for incorrect predictions fosters trust in their outputs [8,13].
To address these problems, we propose a confidence calibration approach to enhance COVID-19 diagnosis using cough signals.
Our goal is to develop a reliable and trustworthy diagnostic model for COVID-19 by leveraging Mel-Frequency Cepstral Coefficients (MFCC) features extracted from cough signals.We develop a deep learning-based model for detecting COVID-19 from cough signal data to fulfill the need for mobile and public healthcare.Specifically, we propose a Spline calibration-based method to enhance the reliability and confidence of the proposed model for potential adoption in real-world clinical settings.We then empirically evaluate the spline calibration with the uncalibrated model and other calibration methods for further demonstration.The proposed confidence calibration framework on modern neural networks may enhance the reliability and trustworthiness of mobile healthcare for infectious respiratory disease screening in real-world applications.

RELATED WORKS
The pandemic has led to an increase in the development of at-home diagnostic tools [25].Cough diagnostic tools, in particular, have garnered interest for artificial intelligence projects due to the ease and availability of human audio signals that can be easily uploaded by participants [22,28].The Diagnosing COVID-19 using Acoustics (DiCOVA) challenge leveraged the Coswara dataset [28] with the goal of encouraging acoustic analysis and diagnosis of COVID-19.A baseline model was provided which consisted of three different classifiers-with Random Forest (RF) model providing the highest AUROC of 0.71 for the cough signals [20].The best-performing model (a multi-layer convolutional neural network) was produced by the winning team, the Brogrammers [18].Additional researchers have tackled the issue of class imbalance using the Synthetic Minority Oversampling Technique (SMOTE) and extracted various features (including MFCC), explored the possibility of determining the presence of COVID-19 infection in human respiratory sound using a deep convolutional neural network, denoising autoencoder, gamma-tone frequency cepstral coefficients, and Inverse-MFCC and investigated the realistic performance of audio-based digital testing of COVID-19 [2].
Although good performances are highlighted, few existing works have focused on the potential confidence calibration issue, where the predicting probability may fail to estimate representative of the true correctness likelihood.Consequently, the model's performance may not be as reliable when applied in real-world scenarios to enable human intervention.To address this issue, our work proposes to calibrate the model confidence using the spline calibration-based method, thereby allowing for the visualization of the model prediction's confidence, thus increasing trust and reliability when used in real-world scenarios.In contrast to existing literature, this study distinguishes itself by employing the Malaya tool to remove periods of silence in each audio signal.Furthermore, it concatenates the extracted MFCC features from shallow and heavy cough signals before training, instead of treating them separately and averaging their predicted results.
Prior studies have also proposed calibration techniques in different domains.Rajaraman et al. [24] proposed calibrating deep learning models to improve medical image classification accuracy, even with imbalanced classes.Guo et al. [8] identified factors (model depth, batch normalization, weight decay) affecting calibration in deep neural networks.They stressed the importance of accurate confidence estimation for model interpretability and user trust [8].Similarly, Krishnan et al. [14] emphasized that a well-calibrated model should be accurate when confident in its prediction and indicate high uncertainty when likely to fail.

METHODOLOGY 3.1 Data Description
The Coswara dataset, compiled by the Indian Institute of Science (IISc) Bangalore, is a comprehensive collection of respiratory sounds, including fast and slow breathing sounds, shallow and heavy cough signals, phonation sounds of vowels, and counting of numbers at various paces, along with their associated metadata.It was developed to facilitate research on the use of respiratory sounds for disease diagnosis and monitoring in individuals with and without COVID-19 and other respiratory illnesses [28].This study utilized both shallow and heavy cough signals which consist of three components: opening of the vocal cords, airflow through the open larynx, and re-apposition of the cords.We categorized individuals with COVID-19 into two groups: healthy (1433 individuals) and infected (681 cases of positive mild, positive moderate, and positive asymptomatic).To conduct our experimental setup, we divided the data into three sets: training (1385 instances), validation (594 instances), and testing (62 instances).
The Coswara dataset presents challenges such as inconsistencies in the lengths of cough signals due to periods of silence and class imbalance, with the ratio of healthy participants to COVID-19positive participants being approximately 4:1.This disparity may significantly impact the development of machine learning models for disease diagnosis and monitoring.To solve the issue of inconsistent cough signal lengths, we utilize the Malaya Speech module [9] to generate audio instances and split on silence method for segmenting the audio signal into chunks based on periods of silence.Specifically, the periods of silence were identified and removed to standardize the length of the cough signals.To address the class imbalance, the training data was oversampled using SMOTE [3] to balance the proportion of positive COVID-19 to healthy patients.
The MFCC feature extraction was then adopted to extract the relevant features for COVID-19 classification.The computation of MFCC involves several stages, including pre-emphasis (amplifying high-frequency components), framing (dividing the signal into short, overlapping frames), windowing (reducing spectral leakage), Fourier transforms (computing spectral content), Mel filterbank (approximating human auditory response), logarithm (compressing dynamic range), and discrete cosine transform (decorrelating filterbank energies).These steps collectively result in a set of coefficients that compactly represent the spectral shape of the sound.
Specifically, we utilized the standard sample rate of 22,050Hz and extracted 13 MFCC features from every cough signal.The decision to select 13 MFCC features was informed by their ability to effectively capture the lower end of the quefrency axis of the cepstrum, which holds the most pertinent information.The 13 MFCC features from both shallow and heavy cough signals were combined to obtain a total of 26 MFCC features for each patient and then scaled.

Classification Model
The proposed Deep Neural Network (DNN) model is composed of four densely connected layers for learning the non-linear combinations of the input features, with each node in these layers being connected to every node in the preceding and following layers.For the first three dense layers, the model employs the ReLU activation function.ReLU introduces non-linearity into the model without affecting the receptive fields of convolution layers.To mitigate overfitting, the proposed model includes three dropout layers to learn robust features and decrease the risk of overfitting.The final dense layer adopts the sigmoid activation function, enabling the model's output to be a probability score between 0 and 1.This score is subsequently used for binary classification.During training, the hyperparameters selected were a batch size of 16, 100 epochs, binary cross-entropy as the loss function, and an Adam optimizer [12] with a learning rate of 0.001.

Model Calibration
Model calibration is a critical process in data modeling and statistical analysis aimed at improving the consistency of model predictions in real-world settings.Its goal is to increase the model's predictive power, identify areas of uncertainty or risk, and provide more reliable and valid insights into complex datasets.During calibration, factors such as algorithms, input data, and model structure are considered, and the model's outputs are adjusted to better align with observed data using statistical techniques and analytical tools.
The reliability diagram also called the calibration curve, is a tool used to provide a qualitative description of calibration.It divides the predicted probabilities into fixed numbers of bins  , each of size 1/ , and plots them against the actual outcomes to measure the accuracy of model predictions.The accuracy of bin   is given by: where  ′  and   are the predicted and true class labels for sample  and   denotes the set of sample indices whose predicted confidences fall into the interval   = ( −1  ,   ), for  ∈ 1, 2, ...,  .The average confidence in the bin   is defined as: where  ′  is the predicted confidence for the sample .In an ideally calibrated model, the  (  ) is expected to be equal to the  (  ) for all  ∈ 1, 2, ...,  .
Model calibration is essential in ensuring that data models provide useful and accurate insights, helping to make informed decisions.In this study, we propose using spline calibration to improve our model's confidence and reliability.In addition, we also describe several other calibration methods for model confidence, including platt calibration, beta calibration, and isotonic calibration.
The spline calibration [17] method involves fitting a natural cubic spline to the predicted probabilities.The algorithm uses a set of knots to define the spline basis and determine the optimal calibration function.The knots are chosen carefully to ensure that the calibration function is smooth and well-behaved.The mathematical representation for spline calibration is: where  ( = 1|) represents the calibrated probability of the positive class given the input  and  denotes the number of basis functions or knots used in the spline calibration.  denotes the coefficients associated with each basis function, and   () represents the value of the -th basis function evaluated at the input .One of the advantages of using spline calibration is its flexibility compared to piecewise constant or sigmoid functions.In addition, other calibration methods like Platt calibration [16] assumes that the relationship between the predicted probabilities and the true probabilities follows a sigmoidal curve; Beta calibration [15] methods leverage a richer class of calibration maps based on the beta distribution; Isotonic calibration [32] fits a piecewise constant function to the classifier's score that is monotonically increasing and minimizes the mean squared error.

Evaluation Metric
We use the confusion matrix and AUROC to measure the performance of the proposed method on COVID-19 classification.In addition, we also introduce Log loss, the Brier score, and the ECE for the evaluation of model confidence calibration.
The log loss measures the quality of predicted probabilities.It specifically tells us how well a classification model predicts the true probabilities of the classes.It is also known as cross-entropy loss or logistic loss.
The Brier score is used to evaluate the performance of probabilistic predictions made by a classification model.It measures the mean squared difference between the predicted probabilities and the actual binary outcomes (0 or 1).The Brier score ranges from   The ECE quantifies the degree of inconsistency between predicted probabilities and actual outcomes.It is computed as the difference between the predicted confidence and accuracy.

RESULTS AND DISCUSSION 4.1 COVID-19 Classification
After training, the overall performance of the proposed deep neural network achieved an AUROC score of 0.812 on the test data.This result indicates that the extracted MFCC features display distinctive patterns capable of differentiating between healthy individuals and those diagnosed with COVID-19.To mitigate any potential sample bias resulting from our limited test data, we conducted a K-fold cross-validation with split sizes of 3 and 5. Consequently, we achieved an average AUROC score of 0.823 and 0.829 respectively.Additionally, our model exhibits strong predictive capabilities and accurately identified 21 out of 25 positive samples.Similarly, out of 37 negative samples, the model correctly classified 23, further attesting to its effectiveness.In addition, a comparison of our approach with existing literature can be seen in Table 1.

Model Calibration
4.2.1 DNN:.To address this research question, we plotted the confidence histogram and reliability diagram of the uncalibrated and calibrated model using the spline calibration method.This provides us with valuable insight into how certain our model is on its prediction.From Figure 2, we observe that the predictive power of our model improved after calibration, and the confidence in prediction increased when the model was sure of its prediction and decreased when the model is uncertain of its prediction.Also, based on the evaluation metrics, the spline-based calibrated model outperformed the uncalibrated model with an ECE score of 0.148, an AUROC of 0.812, a Brier Loss of 0.189, and a Log Loss of 0.584.These results can be seen in Table 2, thereby demonstrating that the spline-based calibrated model was better than the uncalibrated model.
Additionally, we assessed the efficacy of our proposed calibration approach in comparison to other calibration methods.The reliability diagram for each method is visualized in Figure 5. Furthermore, Table 2 outlines the performance metrics for each calibration approach.Our proposed method emerged as the top performer in terms of ECE loss, representing the best-calibrated result.Meanwhile, the Platt method excelled in Log loss, and the uncalibrated model achieved the lowest Brier loss.

XGBoost:
To further demonstrate the effectiveness and generalizability of the spline calibration-based method in other machine learning frameworks, we conducted an additional experiment on the same Coswara dataset using eXtreme Gradient Boosting (XG-Boost).From Figure 3, we observed that the predictive power of XGBoost improved after the application of the spline calibration method.

Model Interpretation
We utilize t-Distributed Stochastic Neighbor Embedding (t-SNE) [31] to visualize the distribution of test samples in the latent space.This allows us to evaluate how effectively the model captures relevant features and separates different classes (Figure 4).Upon analysis, we observed that our uncalibrated model successfully learned discriminative information, enabling it to differentiate between the two classes.As a result, it exhibited fewer errors when classifying healthy patient samples.Conversely, after calibrating the model, we noticed that the model placed greater emphasis on the COVID-19 samples, resulting in fewer misclassifications compared to the uncalibrated model.

Discussion
Though designing an efficient neural network or optimizing hyperparameters can enhance model performance, they may still be inadequately calibrated.This means that the predicted probabilities may not accurately reflect the true likelihood of correctness [8,13,35].
Based on the reliability diagram, ECE, and model interpretation, it is evident that the model's confidence increases with the implementation of spline calibration.Also, taking into account the ECE which measures the consistency between the predicted probabilities and the actual outcomes, the spline calibration-based method was the best-performing method when compared to other calibration approaches used in this study.The reason for this is that spline calibration offers a flexible and non-parametric approach to calibrating predicted probabilities.Compared to parametric methods like Platt and Beta calibration, spline calibration does not assume any specific relationship between the input features and output probabilities, which can be limiting.Although both spline and isotonic methods are non-parametric, spline calibration tends to perform better because it can fit a cubic spline rather than just a piecewise constant function, allowing it to capture complex relationships between the predicted scores and the true probabilities.Additionally, all calibration method achieving the slightly same AUROC illustrates why AUROC is not an appropriate metric for evaluating calibration methods.AUROC measures the ability of the model to discriminate between the positive and negative classes, and it does not take into account the calibration of the predicted probabilities.Therefore, a model with a good AUROC can still have poorly calibrated predicted probabilities, leading to poor decisionmaking in practice.It is crucial to use calibration-specific measures, such as the ECE, Brier loss, log loss, and reliability diagram, to assess the calibration of the predicted probabilities.As a next step, we propose extending our approach to more evenly balanced datasets and validating the effectiveness of our approach externally.

CONCLUSION
In this paper, we present a machine learning and calibration-based approach to increase the reliability of COVID-19 diagnosis using cough signals.Our method to calibrate the model using the spline calibration method accounts for external variations that may occur during prediction and demonstrates its feasibility in detecting COVID-19.We evaluated our method on the imbalanced Coswara dataset using the MFCC features extracted from the shallow and heavy cough signals and compared its performance with several baseline methods.We observed that our method achieved comparable performance to existing literature with improved calibration results.It can effectively handle imbalanced data and has the potential to enhance the reliability and trustworthiness of audio-based COVID-19 diagnosis systems.Our study highlights the importance of model calibration to increase the reliability and trustworthiness of predictions and has the potential to encourage the adoption of mobile healthcare for screening infectious respiratory diseases.

Figure 1 :
Figure 1: The proposed framework aims to facilitate early and continuous screening of COVID-19 using mobile devices, ensuring prompt detection and intervention.It comprises three key components: mobile-based screening as an alternative to traditional rRT-PCR testing, result calibration, and model explainability.

Figure 2 :
Figure 2: The Confidence Histogram and Reliability Diagram compare the uncalibrated model (left) and the Spline-based calibrated model (right).The calibrated model exhibits a higher degree of reliability with lower ECE.

Figure 3 :
Figure 3: The reliability diagram presents a side-by-side comparison of the uncalibrated XGBoost model (left) and the Spline-based calibrated XGBoost model (right).The results highlight that the Spline-based calibrated XGBoost model outperforms the uncalibrated model, showcasing the best ECE and displaying excellent scaling.This indicates that the application of Spline calibration significantly enhances the model's predictive performance.

Figure 4 :
Figure 4: The t-SNE plot for latent representation of how samples from COVID-19 and healthy patients are distributed in the uncalibrated (left) and calibrated model (right).

Table 1 :
Comparison with the state-of-the-art models on COVID-19 classification.

Table 2 :
Comparison of different model calibration methods with baseline uncalibrated models.