Temporal Neighborhood based Self-supervised Pre-training Model for Sleep Stages Classification

Human sleep must be carefully monitored due to its impact on health. Typically, time series data for sleep monitoring is multimodal, simultaneous, and continuous. Pre-training models based on self-supervised learning can identify internal data patterns without requiring external labelling. In this paper, we propose temporal neighbourhood-based self-supervised pre-training models for multi-modality sleep signals, including EEG, EOG, and Heart Rate Variability (HRV). Two neighbourhood formation approaches are based on 1) the stationarity or trend-stationarity of sleep signals; 2) the feature similarity of sleep signals to find neighbourhoods of a given query sleep signal fragment/window. Both time-domain and frequency-domain features have been extracted and processed. The proposed models will learn latent representations for time series via making binary predictions of whether a fragment/window of time series is a neighbour of the given query sleep signal fragment/window. Downstream sleep stage classifiers can incorporate the pre-training models for sleep stage classification. The experiments conducted on the large-scale multi-modality sleep monitoring data SHHS show that the proposed approaches outperform other baseline classification models, including CNN and LSTM.


INTRODUCTION
A third of a person's life is spent sleeping, and sleep quality impacts all the body's organs and systems [32].The value of sleep is seen not only in how well people feel physical but also in how societies and economies grow [4,20,39].Sleep is a sophisticated physiological architecture consisting of sleep cycles, and each cycle can include five sleep stages [10].The American Academy of Sleep Medicine (AASM) suggests that sleep architecture involves five sleep stages, which have wakefulness (Wake or W), rapid eye movement (REM) sleep, and non-rapid eye movement (NREM) sleep [26,29].The NREM cycle has three stages: NREM1 (light sleep), NREM2 (deep sleep), and NREM3 (slow-wave sleep).Each sleep stage lasts 30 seconds and has different physiological implications.For instance, REM sleep is often characterised by rapid eye movements, increased brain activity, and vivid dreams [10,20].It is thought to play a role in learning, memory consolidation, and emotional regulation [6].
Monitoring the normal and abnormal physiological processes related to sleep is known as "sleep monitoring".It involves simultaneous and ongoing observation.The brain, muscles, eyes, breathing, and heart activity of a person change as they progress through various sleep stages.Devices in labs or hospital environments, as well as daily wearable equipment, can continuously track signals from the body, such as electroencephalogram (EEG), electrooculography (EOG), and Heart Rate Variety (HRV).The performance of sleep monitoring can be improved by using multimodal signal processing frameworks to combine different modalities of sleep signals [29] to make predictions.It is important to explore various methods to understand multi-modality sleep signals better and accurately classify sleep stages to assess the monitored person's sleep patterns and cycles accurately.
Supervised machine learning techniques have been popularly used to classify different sleep stages.For example, random forest [14,45], support vector machines and decision trees [30] based approaches.Some deep learning approaches have been proposed to conduct sleep stage classifications, such as CNN [34], LSTM, SeqSleepNet [35], DeepSleepNet [40], and TinySleepNet [41].Supervised learning approaches require trained sleep technologists or domain experts to annotate each sleep stage.This label annotation process is considered time-consuming, labour-intensive, and expensive.Self-supervised learning can help find the inner patterns of data [13] without needing external labels.It has been widely used as a pre-training model in various areas such as NLP, speech recognition, image analysis and network analysis to promote downstream prediction tasks [7,17,21,33,44].However, only a limited number of studies have investigated the best way to apply self-supervised learning to biomedical signals [27,43,48].How to develop novel self-supervised pre-training models for multi-modality sleep data still needs to be explored.
In this work, we bridge this gap by proposing temporal neighbourhood based self-supervised pre-training models for sleep time series, including EEG, EOG, and HRV.To be able to process multimodality sleep signals, we first discuss how to extract and process different types of features, such as time-domain and frequencydomain features for each signal.Then we discuss how to form temporal neighbourhoods to conduct self-supervised learning for sleep time series.The pre-training model will be integrated with the downstream classifiers to predict sleep stages.The main contributions of this paper are: • We discussed feature extraction and processing approach for multi-modality time series (including EEG, ECG, and HRV) for self-supervised training models.

RELATED WORK
Sleep is a complex physiological process, and sleep stages are divided into REM and NREM.NREM comprises three stages (NREM1, NREM2 and NREM3) [1].The use of multimodal sensing methods can provide an opportunity for a more comprehensive and accurate assessment of an individual's sleep patterns and behaviours.
In clinical sleep research, polysomnography (PSG) is considered the gold standard sleep stage monitoring method.PSG data require independent scoring by multiple trained sleep technologists to consolidate inter-rater variability and are primarily performed in sleep laboratories [26].The annotation process is expensive and time-consuming.
The automated sleep stage monitoring can be achieved using support vector machines and decision trees on handcraft features derived from EOG, EEG and electromyography (EMG) signals [30].Recent work has demonstrated the feasibility of using deep learning models to automatically classify sleep stages that can significantly improve performance [34,35,40,41].In the work of [40,41], the authors evaluated CNN-based models using different single-channel sensing data, such as EOG or EEG for five-stage sleep classification, which demonstrated acceptable performance.The use of HRV features in [38,49] also demonstrated the discriminative power of using HRV features for sleep stage classification.However, these methods perform worse than hierarchical ensemble methods using multimodal data (EEG, EOG, EMG) combined with CNN and LSTM [35].
In addition, supervised learning approaches for sleep stage classification require large amounts of human-annotated data.Selfsupervised learning approaches can provide an alternative solution to automated sleep stage classification.It does not need humanannotated labels.They extract the internal features and patterns of data, which has achieved remarkable results in various research fields such as computer vision, video processing, and natural language processing [7,17,21,33,44].It can even achieve comparable performance to supervised learning on multiple tasks [23].However, only a limited number of studies have investigated the best way to apply self-supervised learning to biomedical signals [27,43,48].In [47], a k-Nearest Neighbor (kNN)-based approach is used with HRV features for sleep/wake classification.However, this research did not explore three-stage and five-stage sleep classification has yet to be explored.In [46], a contrastive predictive coding method was investigated for sleep stage classification.This study only achieved bearable performance (Acc=0.7,1=0.64) on the downstream sleep stage classification, and the method did not account for seasonal/repeatable patterns in the sleep data.Other time seriesbased self-supervised learning methods, such as Contrastive Predictive Coding (CPC) [33], learn representations by predicting the future in a latent space without reconstructing the entire input.The representation is learned by maximizing the mutual information between the original signal and the context vector, preserved using a lower bound approximation and a contrastive loss.Another similar work was done in temporal contrastive learning [25], where they used a contrastive loss to predict segment IDs for multivariate time series to learn representations.In [15], the author further employed time-based negative sampling and triplet loss to learn representations for multivariate time series.
Recently, in [43], the author proposed a temporal neighbourhoodbased self-supervised approach for general time series.This approach uses the Augmented Dickey-Fuller (ADF) test [31] to measure the stationarity or trend-stationarity of time series to select the neighbourhood of a given time series fragment/window.However, those previously mentioned methods are general approaches.How to apply them to sleep time series data and making use of the features of sleep patterns remains an open research question.In this paper, we propose temporal neighbourhood-based self-supervised pre-training models for multi-modality sleep time series.Two neighbourhood formation approaches have been used to find the neighbourhood of a given query sleep signal fragment/window.They are based on: 1) the stationarity or trend-stationarity of sleep signals; 2) the feature similarity of sleep signals.

TEMPORAL NEIGHBORHOOD BASED SELF-SUPERVISED PRE-TRAINING MODEL
To construct a self-supervised pre-training model for multi-modality sleep time series data, we first discuss how to extract and process various types of features in this section.Then we discuss how to form neighbourhoods based on the stationarity/trend-stationarity and feature similarities of the different modalities of sleep signals, respectively.[42].For EEG signals, we calculated PSD values from the above five frequency bands.The frequency range between 0.05 and 45 is typically chosen for EOG frequency domain features.In the same frequency bands as EEG, particularly in the delta and theta bands, EOG can benefit from corrective artefacts [16].Therefore, we process the EOG using the same five frequency bands.There are five types of frequency domain features extracted for EEG or EOG signals.

Multi-modality Feature Extraction
In summary, we extracted 13 types of time-domain features for the EEG signal and five types of frequency-domain features.For the EOG signal, as there are two channels (left and right), we extracted 13 × 2 = 26 types of time-domain features and 5 × 2 = 10 types of frequency-domain features.For HRV, we used the HRV-analysis Python package [36] based on the R-R interval data from the R-peak data.It extracted 16 time domain features including  , , ,  ,  50,  50,  20,  20,  , , , ,  ℎ ,  ℎ ,  ℎ ,  ℎ , and 10 frequency domain features including   ,   , ℎ ,   ℎ ,   , ℎ , 1, 2,  2 1 and .Besides these features, this package also can extract other features (including , ,     and   [8]).For HRV, we extracted 30 features in total.

Feature Extension
The time-series nature of the sleep features in their learning needs to be emphasized.A sliding window makes it easier for the model to read the evolution of feature values during learning.To consider the phenomenon that time series data values change over time, we process time series using sliding windows.We set the window size  to 20 (each window contains 20 epochs).The window can contain information about the direction of the features within 10 minutes based on the fact that the minimum value of the duration of a stage in the sleep structure is roughly 10 minutes.Fig. 1 shows a figure of a feature extension.Each epoch advances and recedes by ten epochs.
We set the extended feature value to -1 if the window value is empty.After using sliding windows to extend features, epoch at  is represented by the features of this epoch denoted as   and those of the 10 left-hand epochs (denoted as   −    The algorithm process is as follows, where  represents each epoch of the entire sleep time series,  is the total number of time epochs of the sleep time series,   indicates the window of the sequence selected according to the settings window size  range of the ,   is the neighbourhood sampling window,   is the non-neighbourhood window.The multi-featured time series signal is represented as  ∈   × ( is the number of features).
• Select  and Window: set the window size equals to , and randomly select an  as a centre.  represents all features measured in the interval . The goal of the model is for the encoder to learn the primary representations of   .Continue to choose a different  at random as the centre, i.e. sliding the   in time, and obtain the trajectory of the underlying state of the signal.
• Select temporal neighbourhood window (  ): There are two approaches to select the temporal neighourhood window.
In the human sleep pattern, each sleep stage lasts 10 minutes.The signal for the same sleep stage is usually smooth.
For the feature stationarity approach, we select a collection of windows centred on  that follows a Gaussian distribution, where the parameters of the Gaussian distribution are determined by the ADF test.The window that has the highest stationarity value will be selected as neighbour window.
For the feature similarity approach, we first define temporal neighbourhood windows as windows of the size  before   (denoted as    ) and after   ( denoted as    ).We calculate the cosine distance between the features of   and    and that between   and    respectively.We compare the two values above and choose the window with smaller value as   .Please note the window size for the two approaches are not necessary the same.
• Select temporal non-neighbourhood window (  ): for a query window, we randomly select one non-neighbourhood window among those that are not neighbourhood windows of this query window.However, the organization of human sleep is cyclical, with the same stage of sleep recurring repeatedly.Randomly selecting time series with distant epochs as   is contradictory to the characteristics of the sleep structure.We optimize the process of random selection.We randomly select the window of size  and compute the cosine distance between   and   .The a priori value  are utilized to establish whether or not the window is a non-neighbourhood window.If a neighbourhood window is picked, it is reselected until a non-neighbourhood window is chosen.
• Training the encoder for representational learning: for each query window, we can use neighbourhood windows as positive samples and non-neighbourhood windows as negative samples to train a model to learn the latent representations of each window of the sleep time series.Since a night of human sleep consists of multiple cycles, the selected nonneighbourhood window may belong to the same sleep stage in a different cycle [2].We use a positive-unlabelled (PU) learning method [12,28] to attenuate the model learning bias caused by this problem.Moreover, we employ [43]'s module in positive-unlabelled learning.The module has two components: the encoder and the discriminator.The encoder converts the high-dimensional space composed of sleep features into a latent low-dimensional space.Where   denotes the encoding of the neighbourhood and   denotes the encoding of the non-neighbourhood.Discriminators are used to predict whether the window is a positive neighbourhood sample or a negative non-neighbourhood sample for a given query window.It is a binary classifier.
Fig. 2 illustrates the self-supervised learning framework for sleep time series.The neighbourhood  and the non-neighbourhood  are first defined for each query sample window   .The encoder learns the representation   of the sample windows while the discriminator conducts binary predictions.

SLEEP STAGE CLASSIFICATION
After the self-supervised pre-training is completed, the weights of the encoder's representation learning are stored.We can use the learned latent representations to conduct downstream tasks such as sleep stage classification.The sleep stage classification task consists of three and five classifications: Wake, REM and NREM, and the five classifications are Wake N1, N2, N3, and REM.The pretrained representations are fine-tuned throughout the downstream classification training process.For different classification tasks, we only need to fine-tune the pre-trained representations for different tasks.The overall training time can be shortened.We used a fully connected neural network as the classifier.The overall structure of the decoder consists of two Dropout layers, a BatchNorm1d layer, a linear layer, and a fully connected (FC) layer.We use cross-entropy as a loss function to compute the agreement between the prediction and the target.

EXPERIMENTS 5.1 Datasets Description
Sleeping Heart Health Study (SHHS) 1 is a multicenter cohort study conducted by the National Heart, Lung and Blood Institute to determine the impact of sleep-disordered breathing on cardiovascular and other outcomes [37].SHHS1 (SHHS Visit1) includes sleep monitoring data for 6,441 men and women aged 40 years and older between November 1, 1995, and January 31, 1998.SHHS2 (SHHS Visit2) is obtained from 3,295 participants.We use the data from the first visit because it is recorded at a fixed sampling rate of 125 Hz, whereas the second has multiple sampling rates.There are 500 subjects randomly selected with both EEG and EOG signals and HRV 2 .We extracted the multimodal features based on these 500 subjects for the experiments.The labels for sleep stages 3 and 4 in the original dataset belonged to NREM3, so after merging, the original sleep stage label is 5 (see table 2).The encoder discovers the spatial distribution of the sampled features from the two windows.Then, these samples are put into a discriminator juxtaposed with   to predict the probability that the windows are in the same neighbourhood.

Experimental Settings
In The total size  of the sliding window in the feature extension is set to 20.For the proposed feature similarity-based temporal neighbourhood formation approach, the window size  is set to 4. The cosine distance is between 0 and 2, and the a priori value  is set to 0.5.The cosine distance calculated is smaller than 0.5, indicating that the two windows are neighbourhood windows.

Baseline models.
We compared the prediction Accuracy performance of the two self-supervised learning models over multimodality sleep signals, including EEG, EOG, and HRV.The two approaches are as follows: • TNC-adf is the approach that considers the stationarity of time series measured by the Augmented Dickey-Fuller (ADF) [31] test to form temporal neighbourhood.• TNC-sim is the proposed approach that considers the feature similarity of time series measured by the cosine similarity of time series windows to form temporal neighbourhoods.
Also, the cosine distance can be used as a basis for judgement when constructing non-neighbourhoods.
We also compared the classification performance of the two approaches with other popular state-of-the-art sleep stage classification models.The list of compared models include: • SVM (Support Vector Machine) [22] is a supervised learning model for standard classification methods.• KNN (K-Nearest Neighbour) [19] is a standard supervised learning classification model.This project sets the number of neighbours to 5 and uses uniform initialisation weights.• CNN-LSTM (long and short term memory) [24] is a supervised deep learning model.The model is built on the TinySleepNet proposed by [41].• CPC (Contrastive Predictive Coding) [33] is an unsupervised learning model that learns representations by predicting the future in the potential space and uses a powerful autoregressive model.• Triplet Loss [15] is an unsupervised model of time series representation learning that uses triplet loss to push a subsequence of time series close to its context and distant from a randomly chosen time series.From this table, we can see that overall, the proposed approach TNC-sim that considers feature similarity to form temporal neighbourhood performed better than the approach TNC-adf that considers stationarity of time series.For EEG signals, time domain features are usually more popular, while the frequency domain features usually have been ignored by many existing approaches.From the table, we can see that after adding frequency domain features, the classification accuracy has significantly improved (e.g., increased from 0.8337 to 0.9224 for the 3-classification task for TNC-sim).The same trend was for the EOG signal.However, the improvement for HRV via adding frequency domain features is not apparent.We also can see that overall, combining EEG and EOG signals achieved better results than combining EEG and HRV, while combining all the EEG, EOG, and HRV achieved the best results.For both approaches, using all features for the combination of all three modalities of signals achieved the best performances.

5.3.2
Comparisons with Baseline Models.We compare the proposed approach TNC-sim with five baseline models.We used all the features of the combination of EEG, EOG, and HRV as input to train all the compared models.The average F1-score and overall accuracy (ACC) of the various models are shown in the table 4. The findings demonstrate that the proposed approach TNC-sim outperforms the other baseline models in both sleep stage classification tasks.Fig. 3 depicts the confusion matrix generated by the TNCsim by combining all modality features for the 5-classification.The model is less accurate for NREM1 than the other four sleep stages.

CONCLUSION AND FUTURE WORK
In this paper, we discussed temporal neighbourhood-based selfsupervised pre-training models for multi-modality sleep signals, including EEG, EOG, and HRV.We extracted both time-domain and frequency-domain features for each signal.We also used sliding window techniques to perform the feature extension to better capture sleep signals' temporal dynamics.We discussed applying the general stationarity-based temporal neighbourhood formationbased self-supervised pre-training model for multi-modality sleep

2 )
and those of 10 right-hand epochs (denoted as   +  2 ) of this epoch .In total, with feature extension, we use the features of 21 epochs to represent each

Figure 1 :
Figure 1: Feature extensions based on sliding window.  denotes all features at that epoch, while the six black dots represent features that are not listed.  −  2 and   +  2 indicate

Figure 2 :
Figure 2: The framework of temporal neighbourhood based self-supervised learning model.The neighbourhood and nonneighbourhood windows are selected for each sample window   (shown by the brown dashed box).The encoder discovers the spatial distribution of the sampled features from the two windows.Then, these samples are put into a discriminator juxtaposed with   to predict the probability that the windows are in the same neighbourhood.
the experiments, we use the popularly used evaluation metrics, including Accuracy, Precision, Recall, and F1-score, to evaluate the effectiveness of the proposed temporal neighbourhood-based self-supervised pre-training model for sleep stage classifications.We conducted a 3-classification task that predicts Wake, REM and NREM sleep stages and a 5-classification task that predicts Wake, N1, N2, N3, and REM sleep stages.In the SHHS dataset, we use data from Visit 1.The original data is divided into a training set and a test set at random by subjects, with 85% (425 subjects) going into the training set and 15% (75 subjects) going into the test set.20% of the training set is used as validation in the deep model framework (85 subjects).The remaining 340 subjects are used as the final training set for the deep model.We use grid search for both supervised and self-supervised deep learning model frameworks.

Table 1 :
1. Total number of features for each epoch of different modality signalIn this subsection, we discuss how to form neighbourhoods based on feature stationarity or feature similarity of multi-modality sleep time series.The sleep time signal is periodic, the cycle comprises various sleep stages, and the time series signal for each sleep stage is relatively smooth.In other words, the sleep time signal undergoes local smoothing.Using its stationarity of features, we can compare the neighbourhood and non-neighbourhood domains of the sleep time signal.We can use the ADF test to measure the local smoothness of sleep time series signals.On the other hand, as the sleep time signal is periodic, we can use feature similarity comparison functions such as Consine or Eclidine distance to find those similar windows to form a temporal neighbourhood for a query window.

Table 3 :
Comparison Results (Accuracy) of Temporal Neighborhood Formation Approaches for Multi-modality Sleep Signals comparison results of the two approaches for different modalities signals with different features.

Table 4 :
Comparison Results with State-of-the-art Classification Approaches