Efficient and Direct Inference of Heart Rate Variability using Both Signal Processing and Machine Learning

Heart Rate Variability (HRV) measures the variation of the time between consecutive heartbeats and is a major indicator of physical and mental health. Recent research has demonstrated that photoplethysmography (PPG) sensors can be used to infer HRV. However, many prior studies had high errors because they only employed signal processing or machine learning (ML), or because they indirectly inferred HRV, or because there lacks large training datasets. Many prior studies may also require large ML models. The low accuracy and large model sizes limit their applications to small embedded devices and potential future use in healthcare. To address the above issues, we first collected a large dataset of PPG signals and HRV ground truth. With this dataset, we developed HRV models that combine signal processing and ML to directly infer HRV. Evaluation results show that our method had errors between 3.5% to 25.7% and outperformed signal-processing-only and ML-only methods. We also explored different ML models, which showed that Decision Trees and Multi-level Perceptrons have 13.0% and 9.1% errors on average with models at most hundreds of KB and inference time less than 1ms. Hence, they are more suitable for small embedded devices and potentially enable the future use of PPG-based HRV monitoring in healthcare.


Introduction
Heart Rate Variability (HRV) measures the variation of the time intervals of consecutive heartbeats and is a major indicator for health conditions, such as coronary artery disease, heart failure, hyperlipidemia, and hypertension [40].HRV is traditionally measured using electrocardiographic (ECG) devices, which record the heart's rhythm.However, ECGs can be expensive and require attaching several electrodes to human bodies, which can be inconvenient to use.
As an alternative to ECG, Photoplethysmography (PPG) sensors, which monitor light signal changes in blood flows, can also be utilized to measure heart rate (HR) and HRV [37,45].A PPG sensor can be placed on the human skin to provide HR/HRV readings and is more convenient to use.Many studies demonstrated the potential of PPG sensors in HRV monitoring [14,39].However, these studies usually face the following four limitations, restraining their application to small embedded devices and healthcare in the future.
First, some of these studies relied on only signal processing for HRV estimation [6,45], which may negatively affect the inference accuracy.PPG sensor signals typically suffer from large noises, particularly, the noises from motion artifacts (MA).To infer HRV with good accuracy, these noises must be removed or reduced.Although signal processing techniques can remove many noises (including MA), these relatively static techniques may not be able to handle all types of noises, leading to low HRV accuracy.
Second, many studies also only employed machine learning (ML) techniques to infer HRV [11,14].Although many ML models by themselves are sophisticated enough to handle all types of noises, the resulting models can be too large and/or too slow for small embedded devices.Moreover, training and tuning ML models with raw PPG data can be quite challenging.ML models without enough tuning may also have low accuracy.
Third, some prior studies also focused on the inference of the interval lengths between consecutive heart beats [14,41], which are known as RR intervals [13,18].That is, these studies do not direct infer HRV metrics, such as SDNN or RMSSD (more in Section 2).This indirect inference strictly follows HRV's definition.However, due to error amplification, this indirection can significantly increase the errors of the final HRV estimations.
Fourth, many prior studies also relied on small datasets.For example, a popular PPG dataset, the IEEE Signal Processing Cup (ISPC) dataset, has PPG signals recording lasted for only five minutes [45].However, a typical HRV inference requires an ECG monitoring length of half to five minutes [1].Hence, it is difficult to conduct HRV inference studies with short recordings, especially for studies with neural networks.
We have studied ML-based HR estimation using PPG in our prior work [43].In this paper, our goal is to design an HRV inference methodology that can provide high accuracy for resource-constrained embedded devices.To achieve this goal, we designed a compound and direct method that combines signal processing and ML to directly infer HRV (i.e., RMSSD/SDNN).More specifically, we first employed signal processing to remove outliers and noises from raw PPG signals and convert the PPG signals into rough HR readings and HRV readings.These rough HR/HRV readings are then fed into an ML model to infer RMSSD/SDNN.Applying ML after signal processing provides additional/better noise removal, and hence, more accurate HRV estimations.Applying signal processing before ML avoids the need for large and slow ML models.Moreover, the direct inference of RMSSD/SDNN, instead of inferring RR intervals as a proxy, further improves accuracy.
To explore the impact of various ML algorithms, we also evaluated different ML models, including Decision Tree (DT), Random Forest (RF), K-nearest neighbor (KNN), Support vector machines (SVM), and Multi-layer perceptron (MLP).To provide more reliable results, we also collected a new dataset of PPG signals, with ECG readings as ground truth.This dataset contains three 2-hour-long PPG/ECG traces for one human subject performing different activities, including office work, sleeping, and sitting.
Evaluation results show that our compound and direct method has 3.5% to 25.7% errors for various activities and monitoring lengths.Both the lowest 3.5% error for RMSSD and the lowest 5.1% error for SDNN were obtained with a monitoring length of 300 seconds per HRV estimation.Note that, 300-second HRV monitoring can be used for caring for chronic renal failure and diabetes [1], showing the healthcare potential of PPG-based HRV monitoring.The evaluation results also show that our method is significantly more accurate than the signal-processing-only and ML-only methods.
Moreover, the model exploration showed that DT and MLP models are usually smaller with good accuracy, making them more suitable for small embedded devices.DT models have an average error of 13.0% and are usually less than 10KB with inference time less than 10s.MLP models have an average error of 9.1% and are less than 469KB with inference time less than 1ms.These results corroborate the "Rashomon" theory [32] that, for some problems, there exist simple models with good accuracy and meet special requirements, such as the limited memory size of embedded devices in this case.
The contributions of this paper include, • The compound and direct HRV inference methodology combines signal processing and ML to directly infer RMSSD/SDNN to achieve high accuracy with small and fast ML models.
An example of ECG waveform showcasing R peaks (red dots) and RR intervals.
• A systematic exploration of different ML algorithms to study their impact on the accuracy, model size, and time on HRV inference.This exploration showed that Decision Trees and MLP can achieve high accuracy with small/fast models suitable for embedded devices.• A comprehensive PPG/ECG dataset to study HR/HRV inference, which contains traces of different activity intensities lasting for 2 hours.
The rest of the paper is structured as follows.Section 2 discusses the background on HRV and the motivation of our work.Section 3 presents our compound and direct method.Section 4 presents evaluation results.Section 5 discusses related work, and section 6 concludes the paper.

HRV from ECG
Human heart rate, when measured as beat-to-beat intervals, is not constant and varies over time [31].This variation, commonly known as Heart Rate Variability (HRV), is an effective indicator of various health and mental problems [40].The traditional medical device to measure HR and HRV is ECG.ECG records heart activity utilizing electrodes placed at certain skin spots on the human body and produces an electrocardiogram, which is a graph that shows the heart's activity over time.Electrocardiogram contains the QRS complexes information, which is an important waveform in an electrocardiogram that shows the spread of a stimulus through the ventricles [13,18].R peaks, which roughly represent heartbeats, can be computed from the QRS complex.The intervals between R peaks are called RR intervals.Figure 1 gives an example output from an ECG device that shows the heartbeats and RR intervals.Note that, normal RR intervals are also called NN intervals in the literature [40].
HRV is defined as the variation of the RR intervals within a time period.HRV is an important health indicator because it represents the adaptive ability of the heart to unpredictable changing circumstances.There is no one standard or best method to calculate HRV [21].In this work, we focus on two commonly used time-domain linear measures for HRV -the Standard Deviation of RR intervals (SDNN) and the Root Mean Square of Successive Differences (RMSSD) [20].SDNN is usually recommended for overall HRV estimation and represents both sympathetic and parasympathetic modulation of heart rate, whereas RMSSD is recommended for estimating short-term components of HRV and represents parasympathetic activity [24,40].
An SDNN or RMSSD is calculated from the RR intervals in a chosen time window, which is usually between 0.5 and 5 minutes [1], and a sequence of HRVs over 5 minutes to 24 hours are typically used in medical practice [24].The definitions of SDNN and RMSSD are, where   is the  ℎ RR interval,  is the average, and  is the number of RR intervals within the chosen time window.
Although the ECG produces accurate HRVs, attaching electrodes to the human body can be inconvenient for long-term monitoring [28,35].Therefore, in this work, we focus on using PPG instead of ECG.Nonetheless, we did use ECG to collect reliable HRV readings as the groundtruth to train and evaluate our PPG-based HRV solutions.

HRV from PPG
PPG sensors are popular due to their non-invasive nature.They are usually attached to human skins at certain locations, such as fingertips, earlobe, and wrist [9].These sensors utilize infrared light that penetrates the skin to detect changes in the blood circulation -if there is a change in the blood flow, the intensity of the infrared light also changes [30].Hence, by monitoring the light changes, PPG sensors can detect blood flow changes, which in turn, can be used to infer HR/HRV.
The main challenge to using PPG for HR/HRV monitoring is light signal noises, especially the motion artifact (MA), which represents the signal noises due to body/hand movements [15].There are also noises from the environment [9,15] or from the inherent sensor inaccuracy/bias (e.g., sensor sensitivity and calibration issues).Accurate HR/HRV monitoring requires the removal or reduction of these noises [9,45].

Motivation
Although there have been many studies on applying PPG in HRV inference, these studies were usually limited by their methodology and/or by small datasets.To illustrate these limitations, and as a motivation for this study, we explored the HRV inference accuracy of a signal-processingonly method [45] and ML-only method with Convolutional Neural network (CNN) based encoder-decoder [14]  MLP-only, and our compound and direct ("Sig-proc+MLP") methods, using the ISPC dataset.
Figure 2 gives the MAPEs (mean absolute percentage error) of these two methods.For the signal-processing-only method, we reproduced its processing procedures to generate estimations of HRV, which were then compared with the ECG ground truths in the ISPC dataset to calculate the MAPEs.For the CNN-only method, its MAPEs were directly calculated based on results reported in its paper.As Figure 2 shows, both methods have high errors for HRV estimations.Especially for RMSSD, both methods have over 50% error.
We observe four reasons that cause these high errors.
1. First, the static signal processing cannot always effectively remove all noises in the PPG signals.Therefore, for RMSSD, which measures the short-time components of HRV, the remaining signal noises could significantly degrade the accuracy of the signal-processingonly method.Interestingly, the signal-processing-only method's SDNN estimation accuracy is less affected, as SDNN represents long-term variability and is less sensitive to the remaining signal noises.2. Second, the CNN-only method suffers from error amplification.This method does not directly estimate RMSSD or SDNN.Instead, it estimates RR intervals, which are then converted to RMSSD/SDNN using Equations ( 1) and ( 2).However, this conversion amplifies the estimation error.Table 1 illustrates this error amplification, where we generated five sets of RR estimations with random errors based on certain average errors (MAPE), converted them into HRV (RMSSD/SDNN), and evaluated the errors of the converted HRV.Table 1 shows that a small 3% MAPE in RR estimations amplifies to 41.55%/23.87%error for RMSSD/SDNN.3. Third, for the CNN-only method, although it can remove most of the noises in theory, the small ISPC dataset does not provide enough data for the CNN model to learn the noise removal completely.The ISPC dataset only has PPG signals from five minutes of monitoring, whereas a single HRV reading requires half to five minutes [1].This small dataset further contributes to the CNN model's high errors for both RMSSD and SDNN estimations in Figure 2. 4. Fourth, the model used in the CNN-only method could be too small and not sophisticated enough to process noisy signals.Table 2 shows that this CNN model is only 195KB (42657 trainable parameters).As shown later, inferring HRV with raw noisy PPG signal would require complex neural network models of tens or hundreds of MBs, as the noise removal computation is typically nonlinear and non-polynomial.Large models, however, are unsuitable for small wearable devices with only hundreds of KBs of on-chip memory.
Based on the above observation, our hypothesis is that a compound and direct method that combines signal processing and ML, and directly estimates RSMSSD/SDNN, can achieve both high accuracy and small ML model size.To verify this hypothesis, we trained MLP-based models to directly estimate RMSSD and SDNN using the signal-processed ISPC's PPG data.The accuracy and model size of our method are also given in Figure 2 and Table 2 under the label "Sig-proc+MLP".As Figure 2 shows, our method had lower error than both the signal-processing-only and ML-only methods.Table 2 also shows that this compound and direct method had small model sizes of less than 50KB because many signal noises are already treated by signal processing.
For the sake of comparison completeness, we also trained MLP models using the original PPG signals from the ISPC dataset to directly estimate RMSSD/SDNN.That is, we also compared an ML-only method with MLP models.These MLP models went through hyperparameter tuning, and the errors of the most-accurate MLP models are reported in Figure 2 under the label "MLP-only", which shows that this MLP-only method has lower accuracy than our compound and direct method, mainly due to the small dataset.Moreover, as Table 2 shows, the sizes of the MLP-only models are much larger than our compound models, due to the need of relying on pure neural networks to remove all noises.
In summary, the above results show that our hypothesis is likely to be valid, although more data are required to further validate this hypothesis.In the rest of this paper, we will present the details of our methodology for data collection and HRV estimation.

Compound and Direct HRV Estimation
In this section, we present our signal processing and machine learning combined method for direct HRV estimation.

Data Collection
A single HRV estimation typically requires 30 seconds to 5 minutes of PPG/ECG monitoring [1].Moreover, typically, a sequence of five minutes to 24 hours of HRV data is used in medical practice [40].Therefore, studying HRV estimation requires hours of PPG/ECG monitoring data to provide enough data points.As there lack of such public datasets, our first task in this research was to collect longer traces of PPG and ECG data.
One subject participated in this data collection.The subject had a PPG sensor attached to the fingertip and an ECG monitor attached to the chests at the same time.The PPG data were collected as features/inputs for the HRV inference, whereas the ECG data were used as the labels for model training and as the groundtruth in model testing/evaluation.To reduce the energy consumption caused by the PPG sensors, we collected PPG readings at a frequency of 25Hz, rather than the 125Hz used by the ISPC dataset and other studies [4,10,27,44].
The subject conducted three activities during the data collection, including sitting, sleeping, and office work.The office work activity includes actions such as working in front of a computer, walking, and drinking water.For each activity, more than two hours of PPG/ECG data were collected, which gave about 180193 to 180271 PPG readings per activity.

Workflow for HRV Inference
Figure 3 gives the overall workflow of our HRV inference methodology.This workflow includes three major steps, PPG data collection, signal processing, and machine learningbased HRV inference.The following paragraphs provide a detailed description of each step.

3.2.1
Step 1: PPG Monitoring.As our HRV inference is based on PPG sensors, the first step is to collect the PPG light signals reflected from the blood flow.Here, we employed a sampling rate of 25Hz, i.e., 25 signals are collected every second.This sampling rate is lower than the 125Hz used by many prior studies [4,10,27,44], and it is employed to reduce the power consumption of sensing, as prior work [6]

3.2.2
Step 2: Signal Processing.The primary goal of our signal processing is to remove the noises due to motion artifacts with outlier adjustment and data smoothing.The secondary goal of our signal processing is to generate HR estimations to be used later as features for ML models.
Step 2.1: Signal Processing to Generate HR Estimations.In this step, we convert the PPG light signals collected in Step 1 into four heart rate estimations per second using signal processing.This processing essentially applied a peak detection algorithm to the PPG signals.These peaks can be viewed as "heartbeats", and hence, their counts can be used to estimate HR [2].After this conversion, there are 4 HR estimations over  seconds, denoted as { 1 ,  2 ,  3 , . . .,  4 }.
This conversion serves two purposes.First, this conversion simplifies the noise removal (i.e., outlier adjustment and smoothing) conducted later.Motion artifacts typically affect a sequence of PPG light signals, and it can be challenging to distinguish erratic signals from real HR fluctuations when working on raw signals.Converting to HR estimations reduces the number of data points, making it easier for noise removal.Second, these converted HRs are also used as the features of our ML models in Step 3.
Step 2.2: Z-score Based Outlier Adjustment.Many noisy signals can be simply viewed as outliers.Therefore, outlier identification algorithms can be used to remove these signal noises.More specially, we applied the popular Z-score-based outlier filter algorithm [23,33].
The Z-score filter identifies the outliers by picking out the data points that deviate the most from the mean.Concretely, consider a large sequence of HR estimations, { 1 ,  2 , . . .}, with mean  and standard deviation .If the difference between a data point   and  ( i.e., |  − |) is larger than a threshold, then   can be viewed as an outlier.This threshold is typically defined based on the standard deviation .Particularly, the threshold is defined as _ × .In this work, we used a z_score of 3, following common practices [36,42].
That is, if the difference between HR estimation   and the mean  is larger than 3, then   is deemed as an outlier.
We do not remove outliers because HR estimations are time series, and removing estimations would create "holes" within the time series.Consequently, after the outliers are identified, their values are just adjusted to be the average of their neighboring estimations.That is, given an outlier   , its value is adjusted to be  −1 + +1 2 .
Step 2.3: Data Smoothing.The above outlier adjustment only identifies data points with large noises, i.e., data points deviate greatly from the overall mean.However, there could still be HR estimations that deviate significantly from local HR averages.These deviating HR estimations usually manifest themselves as abnormal local peaks/valleys.These local peaks/valleys are usually caused by signal noises, because normally, a person's heart rate does not increase (or drop) abruptly and drops (or increases) back within one second.
To remove these local noises, we apply moving average data smoothing.More specifically, this data smoothing converts the four rough HRs within a second into one HR estimation per second by computing their averages to reduce the impact of the abnormal local peaks and valleys.After the smoothing, there are  smoothed HR estimations over  seconds, denoted by { 1 ,  2 ,  3 , . . .  }.

3.2.3
Step 3: ML-based HRV Inference.The last step of our methodology employs ML to infer HRV.As stated previously, the ML models are trained to serve two purposes simultaneously.First, they are trained to further remove/reduce the noises that cannot be filtered by signal processing.These additional noises may include long motion artifacts that affect several seconds of PPG signals, the sensor bias (e.g., sensor sensitivity and calibration issues), and the errors due to our low sampling frequency.Second, with further reduced noises, these ML models estimates the final HRV.These models directly infer SDNN/RMSSD, instead of the RR intervals.
The main input features to the ML models are the smoothed HR estimations over  seconds from Step 2, i.e., the vector { 1 ,  2 ,  3 , . . .  }.These smoothed HR estimations are also used to derive a rough HRV (SDNN or RMSSD) estimation, denoted by  , using the Equations ( 1) and ( 2).This rough HRV is used as a constructed feature for our ML models.As shown with Equations ( 1) and ( 2), computing HRV requires exponentiation and square root operations, which may require large ML models to simulate.Therefore, employing a computed rough HRV estimation as a feature can potentially reduce the trained model size.
In summary, the features of our ML models are the vector, { 1 ,  2 ,  3 , . . .  ,  }.The output of our models is the HRV estimation for the past  seconds in SDNN or RMSSD.Note that,  represents the number of seconds of PPG monitoring, which is a tuneable parameter depending on the HRV use case.In the experimental evaluation (Section 4), we evaluated different values for .

Model Training
During our PPG data collection (described in Section 3.1), the subject also had an ECG attached to collect groundtruth HRV readings.The groundtruth HRVs are used both as the labels in the training data and validation/testing data.
We partitioned the collected data into training and testing datasets with a split of 80% and 20%.All models were also optimized with hyperparameter tuning [5] to find the model with the best accuracy.The specific hyperparameters for each type of model are given in Section 4.1.2.Note that, we limited hyperparameter search space to avoid generating models that are too large for small embedded devices.

Experimental Evaluation
This section presents the evaluation results, focusing on the HRV inference accuracy, ML model sizes, and inference time.Our dataset contains one channel of PPG data.The subject simultaneously wore the PPG sensor on the fingertip and ECG electrodes.The PPG sensor is connected to a Raspberry Pi 4B through an i2C bus to record the data.The hardware components are: 1) a Raspberry Pi 4B with 4 GB RAM; 2) a Maxim MAXREFDES117 HR monitor with a MAX30102 PPG sensor; and 3) a TLC5007 Dynamic ECG.PPG data and ECG data are synchronized according to their timestamps.The Raspberry Pi is also used to measure the inference latency of the signal processing and ML models.4.1.2ML Models and Hyperparamter Configurations.DT, RF, KNN, and SVM models were implemented using Scikit-learn [26] and MLP was implemented using Keras.ML models were saved with .We used the random search function ℎ in Scikit-learn and ℎ in Keras-tuner to tune the model hyperparameters.The hyperparameters are: 1) For DT, the maximum depth of the tree ranges from 3 to 20; 2) For RF, the number of trees is between 2 to 128, and the maximum tree depth is between 3 to 20; 3) For KNN, the number of neighbors is between 2 to 30, and the distance can be Manhattan or Euclidean; 4) For SVM, the kernel may be among RBF, sigmoid and polynomial, and regularization (i.e., ) may be between 0.00001 to 10; 5) For MLP, there are 1 to 5 hidden layers, each with 1 to 100 neurons, and the activation function can be  or ℎ.4.1.3Metrics.For HRV estimation accuracy evaluation, we used the metric MAPE (Mean Absolute Percentage Error) between the HRV estimations and groundtruths.Given  HRV estimations, the definition of MAPE is, For model size evaluation, we report the sizes in KB (kilobytes) or MB (megabytes).For inference time evaluation, we report the average time it takes to make an inference in microseconds (s) or milliseconds (ms).

Accuracy Evaluation
Figure 4 and Figure 5 give the MAPE of our compound and direct methods, using different types of ML algorithms at different monitoring lengths (i.e., the  seconds in Section 3.2.3).The figures also give the errors of the signal-processing-only method ("Sig-proc Only").The following paragraphs discuss these accuracy results in detail.5 show that our compound and direct prediction methods usually had errors between 3.5% to 25.7%.The highest error was 25.7%, which was for the KNN model for SDNN estimation under the office work scenario with 30 seconds monitoring length (Figure 5a).The lowest error was only 3.5%, for the MLP model for RMSSD inference under sit scenario with 300 seconds monitoring length (Figure 4c).

Overall Accuracy of Our Method.
For the majority of the models, the errors of our method were less than 20%.Figure 6 also shows that the overall average MAPEs for different ML models (overall activities for both RMSSD and SDNN) are all less than 13.2%.MLP models have the lowest overall average MAPEs of only 9.1%.These results show that our compound and direct method has high accuracy for HRV estimation.

Comparison with Signal Processing
Only. Figure 4 and Figure 5 also show that our method was usually more accurate than the signal-processing-only method ("Sig-proc Only").In the case of RMSSD estimation (Figure 4), the signal-processing-only method usually had errors above 20%, whereas our method's errors are usually less than 20%.
In the SDNN estimations (Figure 5), the signal-processingonly method had lower errors than their RMSSD estimations, because SDNN measures long-term HRV and is less sensitive to signal noises.We had a similar observation for the ISPC dataset as discussed in Section 2.3.Nonetheless, our method usually still had lower errors than the signal-processing-only method for SDNN estimations.
The only exceptions were for SDNN estimations in sit scenario with 240/300 seconds monitoring lengths (Figure 5c, where our method with some ML models (e.g., SVM) had  higher errors than the signal-processing-only method.However, the error difference was small -only 2.2% at most.Moreover, our method with the MLP model was still more accurate than signal-processing-only in these cases.
Overall, Figure 6 shows that the signal-processing-only method has an overall average MAPE of 25%, which is higher than the overall average MAPEs (about 12%) of our compound and direct method using any ML model.

4.2.3
Traces of HRV Estimation. Figure 7 and Figure 8 give the traces of the RMSSD HRVs from the groundtruth (ECG), signal-processing-only method, and our method with MLP model for the sleep activity data.These traces illustrate that our MLP models reduce at least two types of noises after signal processing.The first type of noise is mostly errors due to either sensor bias or low sampling frequency.For example, for the 5600'th to 5750'th HRVs in Figure 7, signalprocessing HRVs had similar fluctuation trends as the ECG HRVs, but they deviate by roughly 10.The MLP model, however, was trained to correct this "deviation" and produced more accurate HRVs.This same issue can also be observed for the SDNN estimations in Figure 8.The second type of noise is usually from motion artifact noises that affect a longer period.For example, for the 5900'th to 6200'th HRVs in Figure 7, signal-processing HRVs were flat then dropped, whereas the ECG HRVs were sharply increasing.These longer errors were also detected and corrected by our MLP model to produce HRVs increasing from 38 to 50, similar to the ECG.The same conclusion can be drawn from the traces for other models, activities, and monitoring lengths.However, due to space limitations, these traces are omitted.

Impact of Type of Activity and HRV Metric.
Across the results for the three activities in Figure 4 and Figure 5, HRV estimations for sit had the lowest MAPEs.These low errors were because when the subjects sat, they had little movement, hence, lower motion artifacts.In this work, we built one ML model for each activity.However, if only one ML model is built to cover all activities, then these MAPE differences suggest that more sensors/features (e.g., accelerometer or gyroscope) may be needed to conduct noise correction differently for different activities.
Moreover, Figure 6 also show that our method's average MAPEs only differ slightly for RMSSD and SDNN estimations, indicating that our method works for both short-term or long-term HRV monitoring.However, the signal-processingonly method had considerably lower errors for SDNN (longterm) estimations than RMSSD (short-term) estimations.4.2.5 Accuracy Impact of Model Types. Figure 6 shows that MLP models were generally more accurate than other ML models in our method, with average MAPEs for RMSSD, SDNN, and "overall" being only 6.9%, 11.3%, and 9.1%, respectively.Nonetheless, the differences in average MAPEs among model types in Figure 6 are less than 4%.These similar MAPEs match the recently proposed "Rashomon" theory [32], which states that there could be multiple ML models having similar accuracy for the same dataset.A group of models with similar accuracy implies that it is beneficial to conduct ML model exploration to search for models that fit certain non-functional requirements, such as low model sizes and fast inference time in the embedded applications.

4.2.6
Accuracy Impact of Monitoring Length.Figure 4 and Figure 5 also show that another factor that impacts the accuracy of our method is the monitoring length .That is, the longer the monitoring length, the lower the HRV estimation error.And the HRV estimations for the monitoring length at 300 seconds usually had the lowest errors.An HRV estimation for longer monitoring is usually less susceptible to a few noisy PPG signals, and therefore, it tends to have lower errors.
Note that, the required HRV monitoring length depends on the use case [40].For example, 300-second monitoring is applied for caring for chronic renal failure and diabetes [1].The fact that errors vary with monitoring lengths also suggests the applicability of PPG-based HRV monitoring to medical use needs to be evaluated case by case.

4.2.7
Comparison with the ML-only Method.We also experimented with the ML-only method to directly infer HRV using the original PPG data collected from the sensors.However, we were not able to obtain any ML-only HRV models with good accuracy, or even meaningful estimations.These models typically have accuracy similar to, sometimes even     worse than, the signal-processing-only method.For the case of MLP, the HRV estimations produced by each MLP model are mostly the same value, making the estimations practically useless.These MLP models also have average sizes of 6.7MB and maximum size of 26.6MB, larger than the MLP models in our compound method.We believe the main cause of the low accuracy was the small hyperparameter search space.Recall that we limited the hyperparameter search space to limit the ML model sizes, which also limits the model complexity.However, it typically needs very complex ML models to infer HRV using only the original PPG signals.Because of the poor results of the ML-only HRV models, we did not include them in the paper.

Model Size Evaluation
Figure 9 and Figure 10 give the sizes of the ML models generated by our direct and compound HRV inference method.As both figures show, the ML model sizes ranged from 2.8KB to 12.4MB.The figures also show that the model sizes are less affected by the type of activities and the HRV metric (RMSSD or SDNN).Monitoring lengths have a higher impact on the model size -the longer the monitoring length, the more input features, and hence, larger models.
The largest factor for model size is the type of model.RF, KNN, and SVM models are usually large and approaching 10MB.As the input HRs to the ML models are generated from PPG signals, they are usually "messy" and lack a clear "pattern."Hence, RF/KNN/SVM requires large numbers of internal parameters to learn their "patterns."The MLP models, however, are smaller -most MLP models are below 200KB, with the smallest model being 63.4KB (Figure 9c  Considering DT and MLP models can fit in on-chip memory of hundreds of KB, they are better candidates for deploying to tiny embedded devices.

Inference Time Evaluation
Figure 11 and Figure 12 give the inference time of our compound and direct method.As both figures show, the inference time for all models was less than 7.3ms (max 7295s).DT    models are the fastest models due to their small sizes -the inference time for DT models is between only 2.9s to 10s.The inference time of RF models is also fast, ranging between 17.6s to 57.2s.The SVM models are the slowest due to their large sizes, and their inference time ranges from 0.7ms to 7.3ms.Nonetheless, even 7.3ms is fast enough for HRV inference, indicating that all models under our methodology are fast enough for HRV monitoring with embedded devices.The signal processing takes 65.74ms on average and 68.31ms at maximum, which is significantly slower than ML model inference. 1Nonetheless, the signal processing is still fast enough for HRV monitoring.Note that, the total computation time for our method includes both the signal processing time and ML inference time.

Related Work
Although there are many existing works on HR monitoring [4,7,10,25,43,45], only a few studied HRV monitoring. 1 This 65.74ms is also the average processing time for the signal only method.

PPG HRV Monitoring with Signal Processing
Ghamari et al. proposed a signal processing algorithm including High-Pass and low-Pass Filters to detect R peaks [17].Srinivas et al. proposed a signal processing method including moving average and FFT to measure HRV based on the PPG wave [34].However, both of them did not evaluate the accuracy of HRV estimations.Wang et al. proposed an algorithm to estimate RR intervals from a smartwatch PPG sensor and accelerometer [38].They calculated HRV measurements, such as SDNN and RMSSD.However, no quantitative accuracy was reported.Blake et al. devised an HRV estimation hardware that contains a PPG sensor, an accelerometer, a Bluetooth module, and a battery [8].This study only compared the HRV readings from a chest strap instead of ECG with no quantitative report on accuracy.Janković and Stojanović designed a signal processing algorithm including a low-pass filter, Sum Slope Function, and a peak extraction function to find R peaks [19].They collected data for 8 minutes in the experiment and obtained one HRV for each trace, and compared them with ECG HRV.However, it was unclear which HRV metric they used.Bhowmik et al. proposed an algorithm including wavelet denoising, trend removal, and peak extraction to detect R peaks in PPG signals [6].They found that a 100Hz PPG sampling rate is not suitable for a smartwatch due to high power consumption and chose 25Hz.Inspired by this work, we also choose 25Hz as the PPG sampling rate to save energy.Saadeh et al. employed earlobe-attached PPG and wavelet decomposition and moving average filters to estimate RR intervals [29].
Note that, due to the nature of PPG light signals, the above prior studies all estimated RR intervals first, and then converted into HRV.As discussed previously, inferring HRV through RR intervals can lead to low accuracy due to error amplification.Moreover, as we show in this paper, pure signal processing may not be able to handle all types of PPG noises.

PPG HRV Monitoring with Machine Learning
While there are prior studies that employed ML models to estimate HRV, most of these studies focused on R peak/RR interval estimation, instead of predicting HRV directly.Everson et al. proposed a CNN-based encoder-decoder network to construct ECG waves from PPG waves and evaluated HRV based on the predicted wave [14].They evaluated the model with the small ISPC dataset, which led to only one HRV estimation per recording.Similarly, Chiu et al. designed a CNN-based encoder-decoder with a sequence transformer network to generate ECG waves from PPG waves [11].They evaluated the model with the UQVSD dataset and the BIDMC datasetboth are datasets from barely moving patients with low motion artifacts.Xu et al. classified PPG signals to systolic or diastolic phase using an RNN model with the assistance of an accelerometer [41].They obtained RR intervals based on the classification results and evaluated the RR interval estimations.Wittenberg et al. compared a few CNN and GRU classification models for PPG R peaks detection [39].They classified short PPG waves based on whether the first sample in it is an R peak.Maritsch et al. proposed a CNN model to predict the error of the RMSSD estimations from a smartwatch [22].This work did not involve PPG signals.Alqaraawi et al. explored Bayesian learning to detect the PPG peaks with their collected data [3].However, their monitoring length was only 5 or 8 minutes, and hence, only one HRV was provided, while our data lasted 2 hours.Choudhury et al. used a phone camera to capture signals of a human fingertip and extract RR intervals from the collected data with adaptive neural network (ANN) and SVM [12].Instead of predicting HRV, most of the above studies mainly predicted/detected R peaks or RR intervals.However, as we show in Section 2.3, small errors in RR peak intervals can be magnified into large errors in HRV estimates.Therefore, in this work, we estimated HRV directly (represented by SDNN and RMSSD) from PPG signals.
Commercial wearable devices may also provide HRV estimations, such as Garmin smartwatches [16].However, prior work has shown that smartwatch estimations may have high errors [22].Due to their proprietary nature, we were not able to rigorously evaluate commercial wearable devices.Therefore, we focused on comparing and analyzing existing research studies in our motivation and evaluation sections.

Conclusion
Photoplethysmography (PPG) sensors have been shown to be a good alternative for electrocardiographic (ECG) in Heart Rate Variability (HRV) monitoring.However, to be applied to practical and medical use, PPG HRV inference methods must be carefully designed.Prior work typically employed signal-processing-only or machine-learning-only methods to indirectly infer HRV from PPG signals, leading to low accuracy and large models.In this paper, we presented a compound and direct HRV inference method, which combines signal processing and machine learning to directly infer HRV.Evaluation results show that our method has errors as low as 3.5% with model sizes of a few hundred KBs, suggesting that our method can be applied in small embedded devices and potentially for medical uses.

4. 1
Experiment Setup 4.1.1Hardware used in Data Collection and Inference.

Figure 4
and Figure

Figure 9 .
Figure 9. Sizes of HRV/RMSSD models for different activities.

Figure 10 .
Figure 10.Sizes of HRV/SDNN models for different activities.
at 30s) and the largest model being 468.5KB (Figure 10a at 240s).DT models are even smaller, -most DT models are below 10KB, with the smallest model being 2.8KB (Figure 9c at the 30s) and the largest model is 35.4KB(Figure 9b at 120s).

Figure 11 .
Figure 11.ML model inference time for HRV/RMSSD estimations for different activities.

Figure 12 .
Figure 12.ML model inference time for HRV/SDNN estimations for different activities.

Table 2 .
[14]or MLP model sizes of the CNN-only[14], The workflow of our compound and direct HRV inference.has shown that high sampling frequency incurs high power usage.Low sampling frequency, however, may negatively affect HRV inference accuracy.As discussed later, we rely on signal processing and machine learning to compensate for this negative accuracy impact.Let { 1 ,  2 ,  3 , . . .,  25 } denotes the sequence of the 25 light signals over  seconds.These signals are fed to step 2 for processing.