Exploring Structure Incentive Domain Adversarial Learning for Generalizable Sleep Stage Classification

Sleep stage classification is crucial for sleep state monitoring and health interventions. In accordance with the standards prescribed by the American Academy of Sleep Medicine, a sleep episode follows a specific structure comprising five distinctive sleep stages that collectively form a sleep cycle. Typically, this cycle repeats about five times, providing an insightful portrayal of the subject’s physiological attributes. The progress of deep learning and advanced domain generalization methods allows automatic and even adaptive sleep stage classification. However, applying models trained with visible subject data to invisible subject data remains challenging due to significant individual differences among subjects. Motivated by the periodic category-complete structure of sleep stage classification, we propose a Structure Incentive Domain Adversarial learning (SIDA) method that combines the sleep stage classification method with domain generalization to enable cross-subject sleep stage classification. SIDA includes individual domain discriminators for each sleep stage category to decouple subject dependence differences among different categories and fine-grained learning of domain-invariant features. Furthermore, SIDA directly connects the label classifier and domain discriminators to promote the training process. Experiments on three benchmark sleep stage classification datasets demonstrate that the proposed SIDA method outperforms other state-of-the-art sleep stage classification and domain generalization methods and achieves the best cross-subject sleep stage classification results.


INTRODUCTION
Sleep accounts for a significant portion of a human life, precisely one-third, and is directly related to one's physical and mental well-being.As a fundamental technique for disease monitoring [13], arrangement, and intervention, sleep stage classification has remarkable practical significance in healthcare [6].The two principal standards governing sleep stage classification are the Rechtschaffen & Kales (R&K) criteria [49] and the American Academy of Sleep Medicine (AASM) criteria [4].Based on these widely accepted international sleep stage classification standards, sleep monitoring is indispensable in many healthcare areas.Notably, brain disorders such as aphasia, epilepsy, and Parkinson's disease exhibit intricate and close associations with sleep disorders, prompting extensive research into the application of sleep monitoring in the intervention of brain disorders [43].Christensen et al. [11] employed electroencephalography (EEG) monitoring equipment and data-driven analytical methods to reveal sleep characteristics in patients with insomnia.Coelli et al. [12] conducted benchmark research on sleep monitoring in epileptic patients, using a multiscale functional clustering approach to survey epileptic networks in various sleep stages.In Parkinson's disease, sleep disorders represent the most frequent non-motor symptoms, and monitoring sleep quality offers an effective way to anticipate Parkinson's disease onset and track disease progression [27].
The conventional method of sleep stage classification requires professional medical experts to manually analyze the Polysomnography (PSG) signals of subjects [51].This approach is timeconsuming, low in efficiency, and labour intensive.Moreover, this method's results are subjective and easily influenced by the expertise and experience of the analysts [26].The development of artificial intelligence has led to the emergence of automatic sleep classification approaches that significantly improve accuracy and efficiency [26].Typically, these methods extract time-frequency transformation features from the raw PSG signal and employ machine learning methods like Random Forest [38], Support Vector Machine (SVM) [1], and K-Nearest Neighbor [52] to build the final classification model.However, these methods require significant prior knowledge for feature extraction and processing.With the development of deep learning, the emergence of deep learning has brought many advancements in the accuracy and efficiency of sleep stage classification.Deep learning-based sleep stage classification methods employ end-to-end neural networks for feature extraction and model construction.Convolutional neural networks (CNN) have been employed to extract spatial sleep features from the PSG signal [40,48].Goshtasbi et al. proposed a fully convolutional neural network called SleepFCN [18], which utilizes residual dilated causal convolutions to capture temporal context information and thus enhances the accuracy and speed of recognition.Recurrent neural networks (RNN) have also been used to extract temporal features related to sleep from the PSG signal [8,41,54].Furthermore, Long Short-Term Memory (LSTM) [15,40] has been utilized to address the issue of forgetting over long-time series signals.Zhao et al. proposed SleepContextNet [57], which utilizes a CNN-LSTM model structure combined Exploring SIDA for Generalizable Sleep Stage Classification 14:3 with data augmentation techniques, significantly improving classification accuracy.Wang et al. [45] proposed a novel multi-scale attention mechanism incorporating channel and spatial attention, resulting in exceptional classification accuracy.Phan et al. proposed SeqSleepNet [33] to address the sleep stage classification problem as a sequence-to-sequence classification problem.
To achieve interpretability at the epoch and sequence level and improve the accuracy of sleep stage classification, they further developed SleepTransformer [34], which is the first transformer-based sleep stage classification model and achieved state-of-the-art performance.To address the issue of heterogeneity among physiological signals, Zhu et al. proposed MaskSleepNet [59].This model learns the joint distribution of mask and non-mask modalities by leveraging partial modalities of mask signals.It also uses multi-scale convolution and multi-head attention to extract features and make predictions at sub-scales, respectively.In addition, Researchers have utilized sparse autoencoders to categorize pre-extracted time-frequency features [44].And some generative adversarial networks models are used for EEG and electrocardiography (ECG) signal generation to improve related classification tasks [17].
However, the abovementioned models are more suitable for extracting features from grid or image data.They do not utilize the functional connectivity relationship of brain structures in the PSG signal.Furthermore, the brain's cerebral cortex forms a non-Euclidean space, making it well suited for representing the feature distribution of brain space using a graph structure.Correspondingly, the graph neural networks (GCN) have been widely employed and worked well in graphstructured data [58].Although existing studies have achieved acceptable sleep stage classification accuracy [21,23,28], these approaches have not addressed the challenge of PSG signal-based sleep stage classification, which depends on the combination of multiple physiological signals, including EEG, ECG, electrooculography (EOG), and electromyography (EMG) signals, which vary significantly across different subjects [10].For instance, the EEG signal can be affected by subjects' electrode drift and hair, while the EMG signal can be affected by muscle fatigue, skin resistance, and muscle strength of subjects [56].The challenge of subject dependence limits the adaptability of sleep stage classification models, as models trained on certain subjects cannot be applied to new subjects.However, most existing methods only modify the feature extractor based on graph models without focusing on improving subject independence.Furthermore, obtaining and labeling sleep stage classification data is complex and requires professional medical expertise [39], making training a new model for each new subject with their data impractical.
Fortunately, the development of transfer learning has provided hope for achieving subjectindependent sleep stage classification [30,60].Researchers have begun to focus on improving the generalization of the model.Jia et al. proposed the MSTGCN model [22], which integrates domain generalization [5,46] and spatio-temporal GCN, using the domain adversarial (DA) method to improve the model's robustness across subjects.Tang et al. [42] employed the Maximum Mean Discrepancy (MMD) [19] method to reduce the distribution difference between the training set and the testing set data of the ECG signal.Most other transfer learning-based sleep stage classification methods utilize the pre-training and fine-tuning paradigm to enhance prediction accuracy [2].However, this paradigm has many limitations due to the need for target data.Moreover, they ignored the structural characteristics of the sleep stage classification problem, resulting in unsatisfactory limited improvement results.To tackle the aforementioned challenges, we have fused the sleep stage classification problem with domain generalization [31], culminating in the proposal of a Structure Incentive Domain Adversarial learning (SIDA) method to augment subject generalization of the sleep stage classification model.As shown in Figure 1, the inspiration for the SIDA method came from the structure of the sleep cycle.During an entire sleep episode, there are typically five complete sleep cycles [16], each consisting of five stages  from the Wakefulness (Wake) stage to the Rapid Eye Moment (REM) stage and back [7].The sleep stage categories themselves are limited and consist of five distinct stages, and each stage may exhibit unique subject dependencies.Furthermore, we generalize the problems caused by the above structure as the Subject Dependency Differences of different sleep Categories (SDDC) concept.More specifically, in contrast to traditional domain generalization models, SIDA establishes distinct domain (i.e., subject) discriminators for every sleep stage to dissociate the subject dependence differences amongst the various sleep stages.This strategy facilitates the model in precisely learning subject or domain invariant features.Moreover, we have bridged the sleep stage classifier and domain discriminators in SIDA with direct connections, positively influencing the training process.To our knowledge, this study marks the inaugural effort to define the SDDC notion precisely.Leveraging the PSG-based sleep stage classification's category structure, we introduce the SIDA method to attain optimal cross-subject sleep stage classification.Notably, we have utilized the leave-one-subject-out cross-validation method to rigorously validate our method.We have trained the classification model on the data from existing seen subjects and tested the efficacy of the trained model on the data of another unseen subject.Furthermore, we have validated and chosen the ultimate model on separate validation data that are randomly selected from training data.We have evaluated the effectiveness of the proposed SIDA method on three benchmark sleep stage classification datasets (i.e., ISRUC-S1 [24], ISRUC-S3 [24], and Sleep Heart Health Study Visit 1 (SHHS1) [36,53].The experimental results indicate that our proposed SIDA method outperforms other comparing methods and delivers the best cross-subject sleep stage classification results.In conclusion, the primary contributions of this study can be summarized as follows: -We clearly define the SDDC concept and open up the idea of handling the challenge of subject dependence on the category from the perspective of transfer learning.-We propose the SIDA method, which is a domain generalization method, to realize categoryby-category subject dependency alignment and achieve direct soft weighting between the classifier and discriminators.-Our proposed SIDA method is a plug-and-play method that can easily combine with existing methods.With experiments on the three public sleep stage classification datasets, the extensive experiments demonstrate that the results of the existing sleep stage classification methods have been improved by combining them with our SIDA method.

RELATED WORK
The research of this article is mainly related to sleep stage classification and the domain generalization method.Therefore, this section will review these two parts and their intersection.The internationally accepted method for sleep stage classification relies on multimodal timeseries physiological signals, known as the PSG signal, which are collected simultaneously using various sensors attached to different parts of the subjects, such as the brain, heart, or legs.However, traditional analysis of physiological signals heavily depends on extracting statistical and spatial features.Although spectrum analysis is interpretable, it requires prior solid knowledge, and its actual classification performance is unsatisfactory.To address this issue, Hassan et al. [20] proposed a tunable-Q wavelet transform to analyze the EEG signal's spectral features, followed by bootstrap aggregating for classification.Their method achieved state-of-the-art EEG-based sleep stage classification performance on the benchmark Sleep-EDF and DREAMS subjects databases.Furthermore, their approach works well and performs equally well for both R&K and AASM sleep scoring standards.Researchers have integrated machine learning techniques to improve the effectiveness of sleep stage classification.For instance, Rahman et al. [37] utilized Discrete Wavelet Transform to extract and analyze the spectral characteristics of the EOG signal and used Random Forest and SVM as the sleep stage classification model.They evaluated their approach on three publicly available databases, including the Sleep-EDF, Sleep-EDFX, and ISRUC-Sleep databases, and demonstrated that it outperforms state-of-the-art EOG-based techniques in accuracy.Similarly, Alickovic et al. [1] proposed a Rotational Support Vector Machine for sleep stage classification.Besides the traditional SVM, they integrated three components: multiscale principal component analysis, discrete wavelet transform, and rotational support vector machine, to enhance the accuracy of EEG signal sleep stage classification.Their approach achieved sensitivity and accuracy values of 84.46% and 91.1%, respectively, across all subjects on the open-source sleep-edfx dataset.

Sleep Stage Classification
In recent years, numerous researchers have utilized various simple neural networks, including CNN, RNN, and LSTM, for sleep stage classification.Notably, the Time Distributed Multivariate Network, introduced by Chambon et al. [9], has become a standard approach for sleep stage classification problems.This network aggregates the previous d epochs, the subsequent d epochs, and the dth epoch itself to extract features that identify the sleep stage of the dth epoch.Additionally, they employed two convolution kernels of different sizes to extract dual-channel features [40].Following the proposal of FeatureNet by Jia et al. [22], this model structure is widely employed in the  [18], which involves multi-scale feature extraction and residual diffusion causal convolution.This method yields state-of-the-art classification results in the Sleep-EDF dataset consisting of 20 subjects and a sample of 240 subjects from the SHHS1 dataset.Zhao et al. used the CNN-LSTMbased model called SleepContextNet [57] to extract long-term and short-term temporal context information and developed a data enhancement method.Excellent results were achieved on the Sleep-EDF dataset of 20 subjects and the Sleep-EDFx dataset of 78 subjects, as well as data of 329 subjects selected from the SHHS1 dataset.In addition to using a simple CNN network, more complex network models, such as the U-Net model [32] and its variants, are also employed for sleep stage classification.Phan et al. proposed a sequence-to-sequence method called SeqSleepNet [33] to interpret epoch and sequence levels.Based on this, they developed the first transformer-based sleep stage classification model called SleepTransformer [34] and achieved state-of-the-art performance on the SHHS1 dataset of 5,791 subjects and SleepEDF-78 of 78 subjects.Wang et al. [45] designed a residual attention layer that includes channel attention and spatial attention and achieved state-of-the-art results on the Sleep-EDF dataset and the Sleep-EDFx dataset with 197 PSG records.Zhu et al. proposed MaskSleepNet [59], which consists of a mask module, a multi-scale convolutional network module, a compression and excitation module, and a multi-head attention module.It enables simultaneous learning of both masking and non-masking modality information and performs multi-scale feature extraction and prediction.The proposed model achieved outstanding classification performance on Sleep-EDFx, as well as datasets from the Montreal Archive of Sleep Studies [29] and Huashan Hospital, Fudan University.Nowadays, many sleep stage classification methods rely on GCN due to the similarity between the brain's functional areas and graph structures.These methods make good use of the information about the position and function of the brain.Jia et al. [23] proposed a groundbreaking GCN-based method called GraphSleepNet for sleep stage classification.In this method, each PSG signal channel corresponds to a node in the sleep graph, with a connection between two nodes forming an edge.The features are constructed based on the brain's functional connections, and spatial-temporal graph convolution is used to classify sleep stages.GraphSleepNet is considered a pioneering work in using GCNs for sleep stage classification.Following this, Ji et al. [21] proposed JK-STGCN, a module for aggregating features from different layers.Li et al. [28] developed MVF-SleepNet by adding spectral features from time-frequency (TF) images [50] of the PSG time-series signal.Spectral features are extracted with models such as VGG16 and fused with features extracted by the GCN.

Domain Generalization
Some existing models incorporate transfer learning techniques to enhance cross-subject generalization in sleep stage classification, typically pre-trained on sizable datasets and fine-tuned on smaller ones.Nonetheless, this approach has its limitations, requiring labeled targets for subject data fine-tuning and thereby impeding the model's ability to generalize to previously unseen subjects.A meta-learning-based method called MetaSleepLearner [2] has been proposed, which involves pre-training on the Montreal Archive of Sleep Studies dataset and fine-tuning on new samples from the Sleep-EDF, CAP Sleep Database, ISRUC, and UCD datasets.The outcomes have been encouraging in the realm of sleep stage classification.Phan et al. [35] have also leveraged Kullback-Leibler divergence regularization to facilitate the model's generalization.According to their empirical findings on the Sleep-EDF Expanded database, which contains 75 subjects, their method can boost accuracy by 4.5 percentage points relative to the baseline, resulting in a sleep stage classification accuracy of 79.6%.However, a similar pre-training and fine-tuning paradigm is still necessary, necessitating substantial data for pre-training and labeled data for fine-tuning.

14:7
Moreover, this approach can only enhance the model's generalization to specific subjects of the fine-tuning data.Gathering data and carrying out extensive calculations are resource-intensive, making these issues unrealistic in real-world application scenarios where the model cannot access future subject data in advance.To improve the effect of classifying sleep stages exclusive to the subjects, Tang et al. [42] used the MMD method to solve the problem of inconsistency in the distribution of ECG signal data between the training set and the test set and achieved remarkable results on the four datasets including SHHS.Nevertheless, the authors did not investigate methods to enhance model generalization when the test set is unavailable, which is a more pragmatic scenario.
Domain generalization can enhance the model's ability to generalize to unseen subjects without sacrificing accuracy when resources are limited.It improves the model's cross-domain generalization using techniques like Domain-invariant representation learning and Feature disentanglement.Domain adversarial is the most widely used and effective method in domain generalization, which confuses the model's differentiation between domains by introducing a gradient flip layer, thus improving the model's cross-domain robustness.The main advantage of domain generalization is that it improves the cross-domain generalization of the model through the method itself rather than relying on other processes, such as fine-tuning.Moreover, after training, it does not require additional information about the new testing set, making it capable of achieving better results on previously unseen data.Therefore, domain generalization is highly suitable for medical scenarios involving unseen subjects.In the sleep stage classification, Jia et al. [22] proposed a novel framework called MSTCGN, which integrates domain generalization and GCN to extract subject-independent sleep features.Their approach employed the adversarial domain generalization method during training to prevent the model from discerning which source domain the data belonged to, thus enabling it to learn subject-independent information.While their approach achieved state-of-the-art performance at the time, they did not consider the subject dependence difference of the category, and the data from different subjects may be aligned indiscriminately.Therefore, the data from different categories may also be incorrectly aligned.Additionally, their use of cross-validation to only divide the training and testing sets, and the model saving the best result on the testing set during training, could be more rigorous.A better approach is to randomly select a portion of the training set as the validation set, save the best model on the validation set during training, and then test on the unseen testing set data.This ensures complete invisibility of the testing set and generalization verification of the model.It is worth noting that the emergence of different sleep stages is related to the age of the subjects, and it is crucial to take age-related differences into account.To address this issue, Baumert et al. [3] divided their subjects into pediatric, adult, and older adult groups to conduct their research, which is meaningful and provides new insights... , S i , ..., S i+d ) ∈ R N ×T n ×T s , where N denotes the number of channels, T s denotes the time series length of each epoch, T n = 2d + 1 denotes the number of samples of neighbouring 2d + 1 epochs, S represents the temporal context of S i .The classification model will jointly predict the characteristics of the ith epoch according to the transition characteristics of sleep stage rules [9].Features of each sleep epoch are pre-extracted from the dual-channel FeatureNet [22] and an N -channel feature matrix of the ith epoch is defined as

Sleep Stage Classification Problem
, where Cross-subject classification is the following process, suppose the sample of {1, ..., M − 1} domains constitutes is the sample composed of M − 1 domains, where x j denotes the training sample (i.e. the pre-trained feature), y j denotes the sleep stage label, d j ∈ {1, ..., M − 1} denotes the subject domain label.J is the sum of the numbers of M − 1 domain samples.The sample of the M th domain constitutes is the data composed of the M th domain, where x te j denotes the sample (i.e., the pre-trained feature), y te j denotes the sleep stage label, d te j denotes the subject domain label.

Motivation
We aim to enhance our model's cross-subject sleep stage classification robustness through domain generalization.Domain generalization eliminates differences between domains (i.e., subjects) through domain alignment.The alignment process aims to align all data of each domain without distinction.However, the biggest challenge in classification tasks is always the category difference, as different sleep stage categories have subject dependency differences.As illustrated in Figure 2, different shapes represent different sleep stage categories, and different colors represent different subject domains.In aligning subject data, if data of the same category are correctly aligned (i.e., the green box in the figure is a positive transfer), then it will enhance the model's cross-subject generalization and improve its classification accuracy.However, if data from the different categories are incorrectly aligned (i.e., the red box in the figure is a negative transfer), then it will severely impact the model classification accuracy.Inspired by the subject dependency difference of categories of sleep stage classification, we hope to align subjects in a fine-grained way by category.Fortunately, the sleep stage classification problem for subject generalization has a category-complete and recurrent structure with information from domain supervision.This structure motivates us to propose the category-specific domain adversarial method SIDA.

STRUCTURAL INCENTIVE DOMAIN ADVERSARIAL METHOD
To mitigate the influence of individual subject differences in physiological signals, we introduce the concept of SDDC and propose a SIDA method.The overall architecture of the SIDA method is depicted in Figure 3.In the subsequent sections, we will outline how we employ neural networks to obtain an effective representation of multimodal physiological signals in Section 4.1.We

The Effective Representation of Multimodal Physiological Signal
Due to their non-linear and stationary characteristics and their multimodal heterogeneity [47], effectively representing multimodal physiological signals is a challenging problem.This difficulty arises from three factors: (1) The PSG signal acquisition adheres to medical norms, such as the international standard 10-20 system electrode placement method for the EEG [14,25] signal, and the EMG signal placement at various muscle sites based on different measurement objectives, resulting in complex spatial structures.(2) The PSG signal is temporal, with temporal dependencies along the timeline, but integrating context information is difficult.(3) There are large differences between multimodal signals across modalities, yet modal consistency exists.Thus, researchers have been grappling with how to fully exploit the consistency between modalities and the difference between compatible modalities.Furthermore, there are still variations between different subjects within the same category or modality.Mainstream methods mostly utilize neural network-based methods to efficiently capture multimodal physiological signals' temporal and spatial features.As Equation (1), CNN-based methods [55] can extract the spatial features of these signals by implementing linear maps through convolution operations with trainable kernels, where x l +1 β (τ , μ) denotes the feature map β in layer (l + 1), σ is a non-linear function, F l is the number of feature maps in layer l, U l βγ is the kernel convolved over feature map γ in layer l to create  the feature map β in layer (l +1), τ , μ are the horizontal and vertical coordinates of the convolution position, respectively, ψ l ,φ l are the length and width of kernels in layer l, respectively, and b l is a bias vector.However, the performance of CNN in temporal extraction is limited.While RNN can extract features by combining sequence context, it lacks information filtering when processing time-series data and is susceptible to gradient disappearance and explosion in long sequences.LSTM addresses these issues through various gating mechanisms, thereby addressing long-term catastrophic forgetting.The updated representation of the LSTM layer is as follows: where e, f , o, and c are the input gate, forget gate, output gate, and cell activation vectors.They all define the hidden value and are the same size as vector h.The σ represents the non-linear function.
Exploring Incorporating GCN is a promising approach to constructing graph-based data structures and networks based on modalities and functional locations, allowing for the representation and fusion of multimodal physiological signals.In the graph-based framework, each signal channel is allocated to a node in the sleep graph, while the edges between the nodes represent the connections between signal channels.We has yielded exceptional results in graph-based sleep stage classification, as demonstrated by the MSTGCN [22].MSTGCN utilizes a multi-view learning strategy that integrates the function connections (FC) and the distance connections (DC) of sleep graphs and incorporates both temporal and spatial features.We were inspired by Reference [28] to transform the sleep signal into a TF representation using short-time Fourier transform (STFT) and to fuse the hidden feature representations extracted by CNN and GCN.
-Spatial Feature: We extract spatial features of sleep graphs using the spatial attention mechanism and Chebyshev graph convolution.As described in Reference [22], the spatial attention is defined as follows: where X l −1 is the lth layer's input; V p , b p , Z 1 , Z 2 , and Z 3 are learnable parameters; and σ denotes the sigmoid activation function.P denotes the attention matrix, and P m 1 m 2 denotes the correlation between node m 1 and m 2 .The softmax operation is utilized to normalize the attention matrix P. The Chebyshev graph convolution can extract the information of neighboring 0 to E − 1 order neighbors centered at each node and is defined as follows: where д ρ denotes the convolution kernel, * Ω denotes the operation of graph convolution, λ max denotes the Laplacian matrix's maximum eigenvalue, and I U denotes an identity matrix.T ϵ denotes the Chebyshev polynomials recursively, and δ = D − A denotes the Laplacian matrix, where D ∈ R U ×U denotes the degree matrix.ρ ∈ R ϵ denotes a vector of Chebyshev coefficients, and x denotes the input data.-Temporal Feature: As Section 3.1, according to the primarily identical transition rules of adjacent sleep epochs, we combine temporal context information of neighboring T n sleep epochs using the temporal attention mechanism and neural network (two-dimensional convolution [22] or a layer of GRU [28]).As described in Reference [22], the temporal attention is defined as follows: where X l −1 is the lth layer's input; V q , b q , M 1 , M 2 , and M 3 are learnable parameters; Q denotes the attention matrix; and Q u,v denotes the strength of correlation between sleep brain network G u and G v .The softmax operation is utilized to normalize the attention matrix.As shown in Section 3.1, the temporal graph convolution can fuse the temporal context information of adjacent T n sleep epochs and is defined as follows: where Xl−1 is the lth layer's input with temporal attention, д ρ denotes the convolution kernel, ReLU is the non-linear activation function, ϕ denotes the parameters of the convolution kernel, and * denotes the convolution operation.-Spectral Feature: Transforming time-series data into TF image data using techniques such as STFT can effectively facilitate the model in capturing frequency-related information, allowing it to fully utilize the strengths of CNN in image classification and recognition.In a recent study [28], the ResNet and VGG models were utilized to extract features from TF images, which were then combined with GRU to integrate the temporal features of multiple sleep epochs.The resulting features were further fused with the features extracted by GCN, leading to notable improvements in performance.-Multi-view Feature Fusion: In the MSTGCN-based [22] methods, we concatenate the graph features based on the FC and the graph features based on the DC.Each feature consists of spatial features and temporal features.
In other methods based on graph features, we only employ the graph features based on FC.Spatial-temporal features are utilized in most methods.In particular, in MVF-SleepNet [28], not only spatial-temporal features are included but also spectral-temporal features are included.

Structural Incentive Domain Adversarial Learning
To improve the cross-subject generalization of the model while ensuring classification accuracy, traditional domain generalization methods often extract subject-invariant information to ensure that the model learns a general and robust representation.We exploit an adversarial domain generalization method to enhance the generalization of various sleep stage classification models.Suppose the input signal is x j , the feature extractor is G f , the label classifier is G y , and the domain classifier is G d .Specifically, Gradient Reversal Layer (GRL) is implemented between the G f and the G d to form an adversarial relationship.During training, the model parameters of the G f are jointly affected by the G d and the G y .The purpose of the G d is to confuse the model's identification of subjects, thereby enhancing the cross-subject generalization of the model.Unlike the traditional transfer learning framework in which the pre-training and fine-tuning are separated, the domain adversarial method integrates the classification and domain generalization into a unified and endto-end framework.Due to the existence of GRL, the model parameters θ f of the G f are learned by minimizing the loss L y of the label classifier and maximizing the loss L d of category-specific domain discriminators.Without loss of generality, the multi-class cross-entropy L mc is exploited as the basic loss function, where J denotes the number of training samples and y j and d j denote the true label of sleep stage and subject domain, respectively.The network is optimized by minimizing the sum of two losses, and the total loss of domain generalization is defined as It shows how G y 's softmax prediction for a particular category (e.g., Wake) is used to dynamically weight the features of G r d corresponding to that category.Specifically, if G y predicts a value of 0.3 for Wake, then the feature f of G r d corresponding to Wake will be weighted by 0.3.Unlike traditional domain adversarial methods, which lack direct interaction between G y and G d , our approach establishes an interpretable and dynamic connection between the two networks at the category level, resulting in improved performance.
Existing domain adversarial methods in sleep stage classification improve the ability of the model to generalize to different subjects.However, there are considerable differences in the subject dependence on different sleep stage categories.As Section 3.3 shows, aligning the comprehensive data of different subjects will introduce substantial interference to the model's judgment of category.Fortunately, the sleep stage classification has a periodic category-complete structure.As Figure 3 shows, subject to the incentive structure of which we set category-specific domain discriminators and perform category-by-category fine-grained alignment on different subjects' data.This way, the domain adversarial process will not introduce additional errors to the sleep stage classification process.In addition, using the prediction results of the label classifier to weight category-specific domain discriminators dynamically also realizes an adaptive and direct correlation between them.The structurally incentive label classifier loss and total loss are consistent with Equations ( 14) and ( 16), respectively.The structural incentive category-specific domain discriminators loss is defined as where J and R denote the number of training samples and categories, respectively; d j denotes the domain label of the sample x j ; r denotes r th domain discriminator; ŷr j denotes the predicted softmax value of the r th category by label classifier; and α r = 1 R are set as the weight of the loss of domain discriminator of the category r .

The Advantage of Direct Dynamic Bridge
In traditional domain adversarial methods, the relationship between the label classifier G y and the domain discriminator G d is coordinated through indirect gradient backpropagation.The nature of their relationship cannot be easily explained.Therefore, as illustrated in Figure 4, we employ the current prediction softmax value of the G y as the weight for the features of the category-specific domain discriminator G r d , which aims to address the issue of subject dependency differences category-by-category.This approach provides two benefits: (1) the fine-grained alignment at the category level avoids problems associated with the SDDC problem and (2) the original indirect association between the G y and the G d has been transformed into a dynamic weighting association.As a result, a soft attention weight is created that allows for interpretability between the two, which is entirely dependent on the end-to-end learning of the model.The model can adjust the proportion of features of each G r d adaptively based on the current prediction situation.Furthermore, the soft weight is smooth and differentiable, which ensures that each G r d is taken care of without becoming too absolute.Even when the G y is inaccurate, they will still promote and adjust each other.

Method Implementation
SIDA proposed in this article is fully elucidated in Algorithm 1, SIDA aimed at achieving generalizable sleep stage classification.The framework is characterized by the loss L y of label classifier and the loss L d of category-specific domain discriminators.Initially, the features pre-trained with FeatureNet are input, which encompasses the training set D tr ain and the testing set {x te j }.Subsequently, the model's parameters θ f , θ y , and θ d are initialized.Features of x j are extracted with feature extractor G f , and then the classification result is obtained, following which the classification loss L y is computed by the fully connected classifier G y .Based on the softmax value of the current prediction result, each category's prediction result is derived, features used by each categoryspecific domain discriminator are weighted, and the weighted loss sum L d of category-specific domain discriminators is calculated.The loss L d after gradient inversion is added to the loss L y to acquire the total loss L. Gradient backpropagation and updates of all parameters θ f , θ y , and θ d of the model continue until convergence.Finally, the best model for sleep classification is obtained.

EXPERIMENTS
All these experiments are implemented with Python 3.8.0,Nvidia-TensorFlow 1.15.0, and Keras 2.3.1.We conducted them on a computer server equipped with 960GB Memory, Ubuntu 20.04.1 operating system, and four Nvidia A100 GPUs with 80 GB GPU Memory each.information is shown in Table 1.The PSG recording was segmented into 30-second-long epochs and annotated by two experts according to the AASM standards.In detail, (1) the common points of the first two datasets are as follows: Each recording contains six EEG channels (F3-A2, C3-A2, O1-A2, F4-A1, C4-A1, and O2-A1), two EOG channels (LOC-A2 and ROC-A1), three EMG channels (the chin EMG, left leg movements, and right leg movements), and one ECG channel.Considering the sleep task has little correlation with the EMG signal of the legs.We are consistent with the comparative method, such as MSTGCN, removing the EMG channels of the two legs for experiments to focus on brain signals and employing the data of 10 channels in total.In addition, signals were resampled at 100 Hz. (2) ISRUC-S3 subgroup contains 10 healthy adults (nine male and one female, aged from 30 to 58). ( 3) The ISRUC-S1 subgroup contains 100 adults with sleep disorders (55 males and 45 females, aged from 20 to 85). ( 4) Following References [18,57], 329 subjects with regular sleep of SHHS1 dataset are selected according to the Apnea Hypopnea Index.The six channels (two EEG, two EOG, one ECG, and one EMG) are employed in our experiment.In addition, signals were sampled at 125 Hz.As is shown in Section 3.1, the model will jointly predict the features of the intermediate epoch according to the T n epochs, which will better contain the context information.

Dataset and Experiment Settings
As is shown in Table 1, the pre-trained features of ISRUC-S3 with context are from 10 subjects, and the pre-trained features of each subject discard a total of four epochs, so ISRUC-S3 with context is 40 epochs less than the original features.Similarly, ISRUC-S1 with context with 100 subjects is 400 epochs less than the original features, and SHHS1 with context with 329 subjects is 1,316 epochs less than the original features.

Parameter
Settings.We compare SIDA with several baselines and against experiments with only integrating the traditional domain adversarial method, described in Tables 3, 4, and 5.We employ the same experimental settings for all models for a fair comparison.We reproduce each comparative method and employ 10-fold cross-validation to divide the training and testing set.In detail, the ratio of the training set to the testing set is 9:1.Then, randomly selecting 20% from the training set as the validation set, we save the best model validated on the validation set and test it on the completely invisible new subject testing set.The comparative method in their papers did not use the validation set, so our experimental results seem lower than in the comparative papers.All networks and parameters are consistent with each original method.Detailed hyper-parameters are shown in Table 2, where the parameter neighbouring epoch size means the number of neighbouring temporal epochs to aggregate (i.e., T n ), the parameter Order of Chebyshev polynomials ϵ is set to five in GraphSleepNet and MSTGCN and is set to nine among other methods to remain consistent with the original method.

Sleep Stage Classification Methods.
Features are extracted and fused based on the dualchannel FeatureNet [22].FeatureNet is an effective baseline method for sleep stage classification and has been commonly employed as the pre-extraction of sleep stage classification features.The MVF-SleepNet+DA method is integrated with the traditional domain adversarial method to extract subject-independent information.-MVF-SleepNet+SIDA: The FeatureNet model is utilized for features pre-extraction, and the MVF-SleepNet method is exploited to study graph-based features and classify sleep stages.The MVF-SleepNet+SIDA method is integrated with our SIDA method to extract subjectindependent information.
Besides FeatureNet, the same pre-extracted features extracted by the FeatureNet are employed in all methods.The word DA denotes the traditional domain adversarial method.The word SIDA means adding our SIDA method to the origin method.All DA-related and SIDA-related parameters are kept the same in the original methods.Compared with GraphSleepNet, MSTGCN mainly adds a subject discriminator for domain adversarial operation.Moreover, according to the distance between different positions of the electrodes in the brain area, MSTGCN incorporates distance connections to enrich the spatial proximity structural features of the brain.The backbone of the two methods is basically the same, so we mainly compare our SIDA method based on the upgraded version of GraphSleepNet, MSTGCN.

Performance Metrics.
The evaluation measures, including Accuracy (Acc), F 1 score (F 1), Macro F 1, and Kappa are defined as follows: The bold and underline items denote the best and second-best results, respectively.
where T P refers to the number of samples of the current sleep stage classified correctly, FP refers to the number of samples of other sleep stages classified wrongly to be the current sleep stage, F N refers to the number of samples of the current sleep stage classified wrongly to be other stages, and T N refers to the number of samples of other sleep stages classified correctly.F 1 score is the harmonic mean of Recall and Precision, r is the sleep stage category, R is the number of sleep stage categories, and Macro F 1 is for each sleep stage category of the F 1 score calculation arithmetic mean.Moreover, p o is the relative observation consistency between raters, and p e is the hypothetical probability of probability consistency.

Comparative Experiment Results
Our SIDA method provides a fine-grained distribution alignment to reduce subject-dependence variability in sleep stage classification compared to traditional domain adversarial techniques.We evaluate the classification performance using FeatureNet on raw data from ISRUC-S1, ISRUC-S3, and SHHS1 datasets (Table 1) and compare other methods on pre-extracted features with context.The experimental results presented in  percentage points, respectively, which are significant improvements.On the SHHS1

Across-age Experiment Results
Following Reference [3]

Feature Visualization Analysis
To investigate the impact of the traditional domain adversarial method and our SIDA, we performed feature visualization for some methods by selecting hidden features before the fully connected classifier and utilizing the tSNE tool to reduce their dimensions to a two-dimensional plane.As illustrated in Figure 8 (d), (e) and (f) for the ISRUC-S3 dataset, the classification boundary of the GraphSleepNet method is ambiguous, with the features of the same subjects being primarily concentrated together, showing significant differences in subject personalization.Moreover, the category boundary of sleep stage N1 is the most unclear and challenging to identify, consistent with our experimental findings.Nevertheless, after using the traditional domain adversarial method, the category boundary becomes distinct, and the model's dependency on the subject is weakened, enhancing cross-subject generalization, although misclassifications still occur.While our SIDA method results in each category's feature shape being close to circular, with a more apparent boundary and a significantly improved classification effect.Notably, the features of different subjects are evenly distributed, resulting in low subject dependency on the model.As shown in the results of the ISRUC-S1 dataset in Figure 8 (a), (b), and (c), The size of ISRUC-S1 is 10 times that of the ISRUC-S3 dataset, so it is tough to classify the GraphSleepNet method.The high-dimensional feature representation is difficult to separate.After adding the traditional domain adversarial method, the category boundary of the feature is vaguely visible, and each category is separated into different clusters.After using our SIDA method, the characteristics of different categories tend to be further separated.As shown in the SHHS1 dataset results.In Figure 8 (h), and (i), the size of the SHHS1 dataset is about 40 times that of the ISRUC-S3 dataset, so it is especially difficult for the sleep stage N1 with a relatively small number of samples.Hidden features are difficult to assemble into a cluster, and many misclassifications occur.After adding the traditional domain adversarial method, wrong clustering is greatly improved.However, the distance between different categories is relatively close, and the interface is unclear.After using our SIDA method, the features of different categories tend to be further apart, and each category has a clear interface, the distribution of features across subjects also becomes uniform.

Confusion Matrix Analysis
As shown in Figures 9, 10, and 11, we analyzed the confusion matrix by combining the experimental results of our SIDA method and the comparative method on the three classical datasets

Inference Time Analysis
Figure 13 depicts the average inference time per sample for each method during testing on the ISRUC-S1, ISRUC-S3, and SHHS1 datasets, with time measured in milliseconds.To maintain uniform representation, GraphSleepNet+DA in Figure 13 refers to the MSTGCN model.Since our SIDA is an improvement over the conventional method, it has more network parameters than the methods with the traditional domain generalization method and the baseline method that does not employ domain generalization.As a result, our model's inference time is longer.As seen in Figure 13(b), our method takes longer to process a single sample than other methods, with the inference time for a single sample with SIDA increasing by less than 25% compared to the methods with the traditional domain generalization method.The time increase is still acceptable.As seen in Figure 13(a) and (c), when incorporating some methods with the SIDA method on the ISRUC-S1 and SHHS1 datasets, the increase of inference time may be insignificant, and the time cost is also within an acceptable range.

CONCLUSION AND FUTURE WORK
Inspired by the structure of sleep stage classification, we propose a plug-and-play method called the SIDA method.It considers the subject dependency differences between different sleep stage categories and aligns them category-by-category to improve the model's classification accuracy and subject generalization robustly.We integrate mainstream sleep stage classification methods and compare our method against the traditional domain adversarial method and the three latest state-of-the-art methods.Experimental results on three classic sleep stage classification datasets show that our SIDA method significantly improves the model's subject generalization and classification accuracy.
Looking to the future, we acknowledge that some challenges remain to be addressed.Our current method of soft weighting between classifiers and category-specific domain discriminators relies on neural network learning in an end-to-end manner, which is both convenient and efficient.However, we observe that classifier performance is unsatisfactory in the early stages of training, leading to domain discriminator features being given inappropriate weights by the classifier, leading to training bias.Therefore, our goal is to improve the training process of the domain discriminator by using the training labels to adjust the weight distribution based on the classifier's current prediction accuracy.This approach aims to improve the accuracy and generalization of the model in the early stages of training by correcting the weight distribution based on the classifier's performance.

Fig. 1 .
Fig. 1.The structure of the sleep cycle: Compared to other classification problems for physiological signals in time series, the sleep stage classification problem has a distinctive category-complete structure, along with a specific pattern in the category transition and cycle changes of sleep stages.Typically, during an entire sleep episode, there are five complete sleep cycles, each comprising five stages ranging from Wake to REM and back.

14 : 6 S
. Ma et al. pre-extraction of features in sleep stage classification.This pre-extraction of features enhances the rapidity of neural network training while guaranteeing accuracy.Goshtasbi et al. proposed Sleep-FCN PSG is often employed to record various human body electrical signals during sleep.It contains multi-channel EEG, ECG, EOG, and EMG signals.The PSG signal can be segmented into multisegment multi-channel signals with 30-second epochs each for sleep stage classification.According to the AASM standard, sleep stages are divided into five stages: Wake, REM, N1, N3, and N3, corresponding to the five categories in sleep stage classification.The sleep stage classification aims to make the model learn the mapping relationship between the input signal and the sleep stage category.The sleep stage classification problem is defined as ŷi = G y (G f (x i )), building a sleep stage classification model based on the input sample x i , where G f is the feature extractor, and G y is the label classifier.Given the input signal sequence 14:8 S. Ma et al. S = (S i−d , .

Fig. 2 .
Fig. 2. The SDDC.The varying distributions among subjects are particularly evident for different sleep stage categories.In the figure, different colors represent different subject domains, while different shapes represent different sleep stage categories.The objective of domain generalization is to strengthen model robustness by aligning multiple domains.However, haphazard alignment across subjects can result in different categories' unintended alignment, which will cause considerable inaccuracies in sleep stage classification.

Fig. 3 .
Fig. 3.The overview of the proposed SIDA method.Hidden features from multimodal physiological signals are extracted using CNN, GCN, and so on, and are divided into two streams for sleep stage classification and domain adversarial learning.Category-specific domain discriminators are employed to align subject data in a fine-grained way, promoting the positive transfer and preventing the negative transfer.The direct connection between the sleep stage classifier and domain discriminators facilitates the domain adversarial training process.

Fig. 4 .
Fig. 4.This figure illustrates the Direct Dynamic Bridge between the label classifier, G y , and the categoryspecific domain discriminator, G r d .It shows how G y 's softmax prediction for a particular category (e.g., Wake) is used to dynamically weight the features of G r d corresponding to that category.Specifically, if G y predicts a value of 0.3 for Wake, then the feature f of G r d corresponding to Wake will be weighted by 0.3.Unlike traditional domain adversarial methods, which lack direct interaction between G y and G d , our approach establishes an interpretable and dynamic connection between the two networks at the category level, resulting in improved performance.

ALGORITHM 1 : 1 3 4 6 7 8
The whole process of SIDA method Input: The pre-trained feature D tr ain = {(x j , y j , d j )|j ∈ {1, ..., J }} for training, the pre-trained feature {x te j |j ∈ {1, ..., J }} for testing Output: The prediction result of the test data ỹte j Initialize the parameter θ f , θ y , and θ d of G f , G y , and G d , respectively; 2 repeat Extract the feature of x j with G f : f j = G f (x j ); Calculate the loss L y of the label classifier with Equation (14); 5 ŷj = so f tmax ( f j ), ŷj = ( ŷj,1 , ..., ŷj,R ); Calculate the loss L d of category-specific domain discriminators with Equation (17); Calculate the total loss L = L y − L d ; Update the θ f , θ y , θ d by minimizing L; 9 until Iterate until model convergence; 10 Predict the sleep stage classification results: ỹte

14 : 18 S
. Ma et al.-SleepContextNet[57]:The FeatureNet model is utilized for features pre-extraction, and the SleepContextNet model is exploited to study neural network-based features and classify sleep stages.-DAN [42]: The FeatureNet model is utilized for features pre-extraction, and the DAN model is exploited to study neural network-based features and classify sleep stages.-GraphSleepNet [23]: The FeatureNet model is utilized for features pre-extraction, and the GraphSleepNet model is exploited to study graph-based features and classify sleep stages.-MSTGCN [22]: The FeatureNet model is utilized for features pre-extraction, and the MST-GCN model is exploited to study graph-based features and classify sleep stages.The MSTGCN model combines the traditional domain adversarial method to extract subjectindependent information.-MSTGCN+SIDA: The FeatureNet model is utilized for features pre-extraction, and the MST-GCN model is exploited to study graph-based features and classify sleep stages.The MST-GCN+SIDA method is integrated with our SIDA to extract subject-independent information.-JK-STGCN [21]: The FeatureNet model is utilized for features pre-extraction, and the JK-STGCN method is exploited to study graph-based features and classify sleep stages.-JK-STGCN+DA: The FeatureNet model is utilized for features pre-extraction, and the JK-STGCN model is exploited to study graph-based features and classify sleep stages.The JK-STGCN+DA method is integrated with the traditional domain adversarial method to extract subject-independent information.-JK-STGCN+SIDA: The FeatureNet model is utilized for features pre-extraction, and the JK-STGCN model is exploited to study graph-based features and classify sleep stages.The JK-STGCN+SIDA method is integrated with our SIDA method to extract subject-independent information.-MVF-SleepNet [28]: The FeatureNet model is utilized for features pre-extraction, and the MVF-SleepNet model is exploited to study graph-based features and classify sleep stages.-MVF-SleepNet+DA: The FeatureNet model is utilized for features pre-extraction, and the MVF-SleepNet model is exploited to study graph-based features and classify sleep stages.

Fig. 10 . 5 . 6
Fig. 10.The confusion matrix of different methods on the ISRUC-S3 dataset with (a) GraphSleepNet, (b) JK-STGCN, (c) MVF-SleepNet, (d) MSTGCN, (e) JK-STGCN+DA, (f) MVF-SleepNet+DA, (g) MSTGCN+SIDA, (h) JK-STGCN+SIDA, and (i) MVF-SleepNet+SIDA methods.of the ISRUC-S1, ISRUC-S3, and SHHS1 datasets.From the confusion matrix, we can see that the classification results are generally good.However, the number of samples of the sleep stage N1 is the least, and the classification effect is the worst.This phenomenon is reflected in the previous classification methods of sleep stages.On the one hand, the number of samples of the sleep stage N1 is small.On the other hand, sleep stage N1 is a light sleep between the sleep stage Wake and N2.Physiologically, the brain is between the lightly active state and the light sleep state, and the signal fluctuation is slight, and the category changes are varied.5.6 Loss and Accuracy Change AnalysisThe loss and accuracy change of the training and validation sets of the ISRUC-S1, ISRUC-S3, and SHHS1 datasets during training are shown in Figure12.It can be seen from the curve in the figure that the loss and accuracy of the training and validation set converge well during the training process, and the performance of the three datasets is basically the same.The loss and accuracy of the validation set converge earlier than the training set, so there is a particular gap between the effect of the training and validation set.

Fig. 12 .
Fig. 12. Accuracy and loss change of training and validation set during MSTGCN with SIDA training on the ISRUC-S1 dataset with (a) and (b), the ISRUC-S3 dataset with (c) and (d), the SHHS1 dataset with (e) and (f).
2, ..., N } denotes features pre-extracted from channel n at epoch i.Sometimes, features are preprocessed by bandpass filters according to the frequency distribution of different signals.However, current sleep stage classification methods generally use full unfiltered features.Suppose we have M subjects (i.e., subjects), we randomly divide M subjects into M groups, where M = | M num |, num is the number of subjects in each group.Group m = {m 1 , ...,m num }, where {m 1 , ..., m num } is random group sampling without replacement from the set {1, ..., M}.The data of the M group constitutes M domains (i.e., D m = {(x m ,k , y m ,k )|k ∈ {1, ..., K }} and K denotes the number of samples of D M ), and the joint distributions between each pair of domains are different (i.e., P j 1 SIDA for Generalizable Sleep Stage Classification 14:11 The x t is the input to the memory cell layer at time t.W ae , W he , W ce , W af , W hf , W cf , W ac , W hc , W ao , W ho , and W co are weight matrices, b e , b f , b c , and b o are bias vectors.

Table 1 .
Data Description

Table 6 .
The Across-age Performance Comparison of the GraphSleepNet method with/without Our SIDA or Traditional Domain Adversarial Method on the ISRUC-S1 Dataset

Table 3 ,
4, and 5 show that we achieved further improvement on three datasets by combining SIDA with the original methods, which also shows superiority compared to several state-of-the-art methods.Results on the ISRUC-S3 dataset demonstrate that MSTGCN+SIDA and MVF-SleepNet+SIDA attained the highest Acc result of 0.7972.Notably, the Acc result of MSTGCN+SIDA increased by over one percentage point compared to the MST-GCN method and about two percentage points compared to GraphSleepNet.The Macro F 1 result of MVF-SleepNet+SIDA also reached a high of 0.7882, indicating a significant improvement.Compared to the best-performing method SleepContextNet in the comparison method, the Acc, Macro F 1, and Kappa of MVF-SleepNet+SIDA increased by approximately three percentage points each.On the ISRUC-S1 dataset, MSTGCN+SIDA achieved the highest Acc result of 0.8004 due to the larger number of samples.Notably, the Macro F 1 and Kappa results of MSTGCN+SIDA also reached a high value of 0.7792 and 0.7411 and increased by about one percentage point each compared to the MSTGCN method.In comparison to the best-performing method MaskSleepNet, the Acc, Macro F 1, and Kappa of MSTGCN+SIDA increased by approximately five, four, and six dataset, JK-STGCN+SIDA attained the highest Acc of 0.8843, Macro F 1 of 0.8048, and Kappa of 0.8366.Compared to the JK-STGCN method without SIDA, the results increased by approximately one percentage point.The Acc, Macro F 1, and Kappa of JK-STGCN+SIDA also increased by about four, seven, and five percentage points, respectively, compared to the best-performing method DAN.Our exper- iments visualized the changes in sleep stage categories throughout an entire sleep episode when fusing SIDA with/without other methods.The classification performance is excellent, with most misclassifications occurring during sleep stage transitions, which are typically difficult to detect in medical practice, as demonstrated in Figures5, 6, and 7 for the ISRUC-S1, ISRUC-S3, and SHHS1 datasets, respectively.
, we divided the ISRUC-S1 dataset into three age groups: 34 pediatrics, 32 adults, and 34 older adults, treating each group as a domain.Our experiments utilized GraphSleep-Net, MSTGCN, and MSTGCN+SIDA methods to evaluate performance in different age groups.Table6displays that our MSTGCN+SIDA method outperforms both MSTGCN and GraphSleep-Net methods.Notably, we discovered a noteworthy phenomenon: the model consistently outperforms older groups in identifying younger groups, which warrants further exploration in future research.