Zero-dimensional biomarker based medical action recognition: towards more explainable AI in healthcare

Medical action recognition is an increasingly necessary task as healthcare has shifted towards digital and more remote methods of patient monitoring. To remotely assess a patient with computer-aided diagnosis it is first necessary to identify different actions. In tandem with action recognition, this remote monitoring also requires biomarker identification. This allows meaningful features to be extracted from the patient’s actions, however, each action often requires a different set of features to be monitored. Finally, the mode of data collection requires both portability and accuracy to be used in the healthcare industry. To combine each of these, one of the best solutions for condition-related action recognition is using videos in combination with human pose estimation to create spatio-temporal skeleton data for the patients which allows further analysis to be fast and accurate. By first using manual feature extraction on the skeleton sequences, it is possible to utilise machine learning to both classify different actions and identify latent features as the biomarkers that distinguish each of the action classes. Thus, this new method proposes a new zero-dimensional feature extraction method to classify skeleton sequences and extract medically meaningful, explainable, and objective biomarkers that can be used for the diagnosis and monitoring of patients in a remote setting.


INTRODUCTION
As the world has shifted towards the use of modern computing ideas in medicine and healthcare, primarily due to the coronavirus pandemic of 2019, the role of telehealth and mHealth has been at the forefront of the healthcare industry allowing patients to be seen in record numbers in a time where in-person appointments are not the only option [10].This has increased the need for computer vision in healthcare applications, although the use of computer vision in medical imaging has been one of the driving factors in deep learning, this application requires the analysis of natural imaging in a field which has often been avoided [1,8].One of the main issues facing the healthcare industry is a lack of objectivity in determining the current health of musculoskeletal patients, currently, the gold standard method of assessing a patient is with a physical examination of movement and patient-reported outcome measures (PROMs) [12].The problem with these methods is due to their subjective nature, there are proven studies that show the ineffectiveness of these PROMs and the need for new objective measurements [5].
In this regard, one method of extracting meaningful and medically relevant data about a person from natural imaging is using human pose estimation (HPE), however, this field has mostly been applied in animation and surveillance applications [14].Recent approaches have utilised methods used in biomechanics, such as kinematics and joint dynamics that have primarily been used in sports analyses, as a new application to the healthcare industry for remote monitoring of movement disorders and for computer-aided diagnosis (CAD) [4].In addition, the field of RGB-based human pose estimation has vastly improved in terms of both performance and the accuracy of the skeleton sequences produced.One such method is BlazePose, which can create a 3D skeleton sequence in real-time at 31 frames per second while running on low-power devices such as mobile phones which can allow these methods to be deployed in real-world scenarios [18].
Machine learning has often been used in cutting-edge biomarker identification techniques in biomedical applications, ranging from disease identification to identifying targets in drug discovery [9,11,13].Now that movement can be quantified using human pose estimation, it is also possible to apply similar techniques to this movement data to identify motion biomarkers for disease tracking or disease classification.
This new approach to healthcare and CAD does require additional feature extraction, in which new significant biomarkers can be found to better monitor or diagnose patients.Traditional methods for this utilise methods of dimensionality reduction to extract statistically significant information from larger datasets, one example used in biomarker identification is the Principal Component Analysis [17].However, there are now more advanced methods such as the Shapely Additive exPlanations (SHAP) to tackle the feature engineering, but more importantly, lead to the recent trend towards interpret-able machine learning brings a deep understanding of the learned features that directly indicate the conditions and relevant decisions [15].On the other hand, using more advanced AI and machine learning methods in the healthcare industry can create ethical dilemmas when using the biomarkers produced.This is a result of the lack of trust in the AI and deep learning models that use a 'black box' method whereby the inner workings of the model are unknown, thus the results of the models can lack the trustworthiness required to be utilised in healthcare or medical applications [18].

METHODOLOGY 2.1 Data
The NTU-RGB+D dataset not only provides RGB-D videos of a large set of actions performed by multiple subjects at three different camera angles and three different camera heights, these videos were then processed into skeleton sequences using depth-based human pose estimation using the Microsoft Kinect SDK [13,16].
In this study, the primary actions of interest are sneeze/cough, staggering, falling, headache, chest pain, back pain, nausea/vomiting, fan self, yawn, stretch oneself, and blow nose.Each of these actions were performed by up to 106 different subjects, of varying demographics, with a total of 32 different camera arrangements.This resulted in a total of 11,603 skeleton sequences of varying lengths, where each skeleton consists of 25 joint positions in a 3D coordinate system.
However, this data must first be pre-processed to apply the feature extraction pipeline which transposes the skeletons into a tabular format with each joint position, in each of the three axes, into a tabular dataset for each skeleton sequence.This pre-processing follows the same method as stated in Armstrong et al., 2023, in which the pre-processing pipeline was first utilised [3].

Feature Extraction
The feature extraction pipeline has two primary compartments, the biomechanics/kinematics calculations and the zero-dimensional feature calculations performed on the biomechanics.Firstly, the rotational joint kinematics of the most relevant joints of interest were calculated this includes multiple planes of motion using methods presented in our previous works [3,4].Secondly, zero-dimensional features are calculated using the spatio-temporal kinematics data reducing both the spatial and temporal domains into a single value.This feature calculation uses both our own previously reported calculations as well as the total range of motion for each joint kinematics calculated, and finally zerodimensional statistics such as the standard deviation and median values of each kinematic for each action.
After the feature extraction was performed for each skeleton sequence, this was collated and transposed into a final tabular dataset whereby the rows consist of each sequence and the columns present each zero-dimensional feature.This feature list includes a range of motion, mean, median, standard deviation, smoothness, rotational impulse, variance, skewness, kurtosis, and energy for each of the 17 kinematics produced.A visualisation of this dimension reduction step can be seen in figure 1, whereby each of these features was calculated in four different windows.

Action Recognition
Classical machine learning (ML) methods such as a random forest or XGBoost classifiers have remarkable performances when dealing with tabular datasets, though deep learning has been prevalent more recently.The nature of the classic ML methods in terms of feature engineering makes the model incredibly explainable.The use of these methods is thanks to the tabular data format created using manual feature extraction methods, allowing the prediction of the actions based on the features.The primary machine learning algorithms investigated in this study are the random forest classifier using the scikit-learn library and the XBoost classifier from the XGBoost library, these methods both utilise a large number of decision trees alongside a majority voting and weighted voting that predicts the class based on a set of given features [16].
To assess the action classification accuracy, a standard accuracy metric is used to calculate the fraction of correct predictions.This calculation can be seen in equation 1, where  is the ground truth and ŷ is the predicted class which accounts for the multi-class predictions.

Feature Importance
In this study, two methods of determining feature importance were used.Firstly, recursive feature elimination with cross-validation was used to iteratively reduce the number of features used in the random forest until a desired number of features remains but the accuracy of the model remains unchanged [9].Secondly, the SHAP values were created from the final random forest model to determine the SHAP values of each feature and its importance to each of the different classes.These feature importance values will add validity to the biomarkers, their importance to the machine learning model provides the assumption that they are meaningful in determining which biomarkers will be suitable to track or observe in each action.

Classification
The predictions from the XGBoost and random forest classifier can be shown in both table 1 and figure 2. The results shown in Table 1 show the accuracy metrics following the gold standard techniques [6].Whereas the confusion matrix in figure 2 shows both the true positive result of each class label prediction as well as the false positive result, this indicates both the accuracy and which actions are similarly predicted as each other, in this case, actions 9 and 11 appear to be similar to each other.Additionally, figure 3 shows the receiver operator characteristic curve (ROC) which shows the performance of the XGboost and random forest models by plotting their true positive rate against their false positive rates.These plots also consist of a chance line which indicates that the accuracy of both of these models is not due to chance, the ROC curves in this case shows a strong accuracy for both models.
Finally, when considering both of the algorithms used the results in both table 1 as well as figures 2 and 3 can show that the algorithms have similar prediction performances.The overall class prediction performance shows that the XGBoost algorithm performed slightly higher, however, table 1 shows that the random forest classifier does produce some predictions with a higher accuracy.

Feature Importance
To determine the feature importance of the feature extraction the histogram of SHAP values can be seen in figure 4, whereby each feature is ranked by the importance for each predicted class.This shows that each algorithm produced different features when the values were ranked.
Figure 4 a) for example shows that the XGBoost classifier produced biomarkers which are more action specific in which the range of motion is the most important in terms of shap value when predicting the stretching action, whereas mean arm abduction is more important in terms of the falling action.
On the other hand, figure 4 b) shows that the biomarkers identified are more generalised for each action.These feature importance values show that in each of the windows, the median shoulder angle is most important for several different actions.
In addition, the results of the recursive feature elimination when applied on a dataset without the tumbling windows applied identify a minimum number of features that can be used.In this case the number of features that can be reduced without altering the random forest classifier's accuracy.

Model Tuning
To further fine-tune the model, a grid search with cross validation can be performed to select the hyper-parameters that produces the best classification model.In this case, the results produced had the following hyper-parameters: max depth of 21, number of estimators of 1000, log loss criterion, and a balanced subsample class weight.

DISCUSSION
Using the proposed feature extraction pipeline to produce the zerodimensional features, allows a wide range of machine-learning algorithms to be used on the dataset.Using the feature extraction pipeline as described also presented ideal candidates for the biomarker identification pipeline, to produce motion-based biomarkers that describe the actions.In this study, these extracted biomarkers have been used to predict medical actions from given spatiotemporal skeleton sequences.
It is noteworthy to mention that the precision of most predicted classes is relatively weak in comparison to those associated with movement.For example, class 0 is a sneeze/cough and classes 3-7 consist of feelings of pain and nausea which often do not have an associated movement.When it comes to actions with lots of movement involved such as the stagger or the fall down, the accuracy metrics show a lot of promise considering the amount of data being used is considerably lower than the dataset provides due to the reduction in dimensionality.
This study therefore shows the potential of this zero-dimensional feature extraction pipeline, especially when used with machine learning algorithms to understand which features are more important in each action.Thus, these features can be used as biomarkers that can monitor the movement capabilities of a subject either after treatment or over the course of a long rehabilitation process.
To identify zero-dimensional biomarkers from spatio-temporal skeleton sequences, it is important to note that the features will be dependent on the accuracy of the pose estimation method.However, the NTU-RGB+D dataset uses depth images along with the RGB images to produce more accurate skeletons which have comparative results compared to even the gold standard marker-based motion capture techniques [2].
Previous studies have shown the statistical significance of the zero-dimensional and their power in monitoring the rehabilitation of patients, Armstrong et al., 2023 shows that both the smoothness and the rotational impulse are statistically significant in terms of determining whether a patient has had a treatment to reduce knee pain [4].The results presented in Figure 3 provide more potential biomarkers that can be used in the monitoring of patients in both  the long and short term.This can be investigated further by applying these outlined methods in a longitudinal setting, by adding different time points one can assess the identified biomarkers in a disease-tracking application.However, there are no publicly available motion datasets with longitudinal aspects which provides a limitation to this further work.These kinematics and zero-dimensional biomarkers provide a way to remedy this and remove the subjectivity and need for the currently used patient outcome measures.Integrating this feature extraction and machine learning into an end-to-end solution to accurately extract the pose from 2D RGB videos and produce statistically significant and medically meaningful biomarkers.However, when applying these biomarkers in disease classification, a specific set of actions must be used that can allow a wide variety of musculoskeletal conditions.Similar methods have been performed on larger longitudinal datasets to produce biomarkers from a combination of both quantified MRIs and clinical outcome measurements to identify biomarkers associated with the development of osteoarthritis [11].Therefore, there is a reasonable assumption that the methods described in this study can be applied to diagnostic purposes if performed on motion datasets that have disease classification labels.
Another method which can be investigated further is the effects of changing the window size of the tumbling window.As well as the ability to produce more detailed biomarkers by adding more time points back into the dataset, this will also produce more data that can be used by the classification algorithms and can lead to increased performance.However, this increase can have an adverse reaction as more windows would result in smaller windows with less information per window.In terms of applying these biomarkers in a clinical setting, there are several technical limitations that must be addressed; the ethics behind storing videos of patients, the data storage requirements, the data transfer speed, and the computing power required to process these videos.The storage of videos can be mitigated by deploying a real-time human pose estimation solution, whereby the videos themselves are not required to be stored and the zero-dimensional anonymised data can be stored instead.In addition, the data transfer speed can be mitigated through the use of a loss-less video encoder such as HEVC which has the capability to reduce the size of a video by 60% without decreasing video quality [7].Finally, the computing power required can be addressed in one of two methods; developing a mobile app that can collect and process the videos using a lightweight human pose estimation solution, or deploying to a server which can process any recorded video file.

CONCLUSION
To conclude, the methods and pipelines outlined have shown themselves to be useful in the field of medical action recognition and fall detection.These extracted biomarkers also have the potential to be used in the monitoring of patient outcomes in objective measurements.In contrast to many recent studies and their focus on deep learning, these methods also show that the features come from an explainable AI model rather than a black box.Therefore, these methods and biomarkers produced have more trust associated with them which is essential for both medical professionals and the general public if this work is to be applied.

Figure 1 :
Figure1: Visualisation of the spatio-temporal feature extraction and tumbling window approach to separate the curve into n windows, where n=4.In this case, the lines represent the knee flexion during an action, with the blue line being the left knee and red as the right.

Figure 2 :
Figure 2: Confusion Matrices showing the prediction accuracy for each predicted medical action class for both the XGBoost and Random Forest Classifier

Figure 3 :
Figure 3: ROC curve showing the rate of true positives vs rate of false positives, for both XGBoost and Random Forest algorithms, with a chance line to show the difference between the accuracy and the probability of the accuracy is due to chance.

Figure 4 :
Figure 4: Feature importance histogram produced using the ranked SHAP values for both XGBoost and Random Forest algorithms, with each colour in the histogram showing the importance for each class

Table 1 :
Accuracy metrics of each of the medical actions defined by the NTU-RGB+D 120 dataset, showing the precision, recall, and f1-score for each action prediction