ADATIME: A Benchmarking Suite for Domain Adaptation on Time Series Data

Unsupervised domain adaptation methods aim to generalize well on unlabeled test data that may have a different (shifted) distribution from the training data. Such methods are typically developed on image data, and their application to time series data is less explored. Existing works on time series domain adaptation suffer from inconsistencies in evaluation schemes, datasets, and backbone neural network architectures. Moreover, labeled target data are often used for model selection, which violates the fundamental assumption of unsupervised domain adaptation. To address these issues, we develop a benchmarking evaluation suite (AdaTime) to systematically and fairly evaluate different domain adaptation methods on time series data. Specifically, we standardize the backbone neural network architectures and benchmarking datasets, while also exploring more realistic model selection approaches that can work with no labeled data or just a few labeled samples. Our evaluation includes adapting state-of-the-art visual domain adaptation methods to time series data as well as the recent methods specifically developed for time series data. We conduct extensive experiments to evaluate 11 state-of-the-art methods on five representative datasets spanning 50 cross-domain scenarios. Our results suggest that with careful selection of hyper-parameters, visual domain adaptation methods are competitive with methods proposed for time series domain adaptation. In addition, we find that hyper-parameters could be selected based on realistic model selection approaches. Our work unveils practical insights for applying domain adaptation methods on time series data and builds a solid foundation for future works in the field. The code is available at \href{https://github.com/emadeldeen24/AdaTime}{github.com/emadeldeen24/AdaTime}.

. Our benchmarking suite AdaTime consists of three main steps: Data Preparation, Domain Adaptation, and Model Selection. We first prepare the train and test data for both source and target domains (i.e., , , , ). Then the training sets of source and target domains are passed through the backbone network to extract the corresponding features. The domain alignment algorithm being evaluated is then used to address the distribution shift between the two domains. Last, given a specific risk type, we calculate the risk value for all the candidate models and then select the hyper-parameters of the one achieving the lowest risk. The selected model is lastly used for reporting the test results given the target domain test set (best viewed in colors).
• We systematically and fairly evaluate existing UDA methods on time series data. To the best of our knowledge, this is the first work to benchmark different UDA methods on time series data.
• We develop a benchmarking evaluation suite (AdaTime) that uses a standardized evaluation scheme and realistic model selection techniques to evaluate different UDA methods on time series data.
• We evaluate 11 state-of-the-art UDA methods on five representative time series datasets spanning 55 crossdomain scenarios, and present comprehensive conclusions and recommendations for the TS-UDA problem. These evaluation results and analysis can provide a systematic guideline for future research on TS-UDA.
The following sections are organized as follows. In Section 2, we define the unsupervised domain adaptation problem and how adaptation is generally achieved. Section 3 describes the main components of our AdaTime suite such as benchmarking datasets, unified backbone networks, adapted UDA algorithms, model selection approaches, and unified evaluation schemes. Section 4 shows the evaluation results and discusses the main findings of our experiments. Section 5 presents the main conclusions and recommendations.
Manuscript submitted to ACM 2 DOMAIN ADAPTATION

Problem Formulation
We start by defining the unsupervised domain adaptation problem. We assume access to labeled data from a source domain X = {( , )} =1 that represents univariate or multivariate time series data, and unlabeled data from a target

General Approach
The mainstream of UDA algorithms is to address the domain shift problem by finding domain invariant feature representation. Formally, given a feature extractor network : → , which transforms the input space to the feature space, the UDA algorithm mainly optimizes the feature extractor network to minimize a domain alignment loss L align , aiming to mitigate the distribution shift between the source and target domains such that ( ( )) = ( ( )).
The domain alignment loss can either be estimated from a statistical distance measure or an adversarial discriminator network, which can be formalized as follows: where ℓ can be a statistical distance or an adversarial loss.
Concurrently, a classifier network ℎ is applied on top of the feature extractor network to map the encoded features to the corresponding class probabilities. Particularly, given the source domain features generated from the feature extractor, we can calculate the output probabilities p = ℎ ( )). Thus, the source classification loss can be formalized as follows where 1 is the indicator function, which is set to be 1 when the condition is met and set to 0 otherwise.
Both the source classification loss L cls and the domain alignment loss L align are jointly optimized to mitigate the domain shift while learning the source classification task, which can be expressed as min ,ℎ L cls + L align .
we refer to the composition of the the feature extractor and the classifier network ℎ as the model , such that = ℎ ( (·)).
Manuscript submitted to ACM 3 ADATIME: A BENCHMARKING APPROACH FOR TIME SERIES DOMAIN ADAPTATION

Framework Design
The key motivation for our approach is to address the inconsistent evaluation schemes, datasets, and backbone networks.
Such inconsistencies can boost the performance and be misattributed to the proposed UDA method. Therefore, we design our benchmarking framework to address these issues while ensuring fair evaluation across different UDA methods. For example, to remove the effect of different backbone networks, we use the same backbone network when comparing different UDA methods. Additionally, we standardize the benchmarked datasets and their preprocessing schemes when evaluating any UDA method. Table 1 summarizes the existing experimental flaws and our corresponding design decision.

Framework Overview
In this work, we systematically evaluate different UDA algorithms on time series data, ensuring fair and realistic procedures. Fig. 1 shows the details of AdaTime flow, which proceeds as follows. Given a dataset, we first apply our standard data preparation schemes on both domains, including slicing, splitting to train/test portions, and normalization.
Subsequently, the backbone network extracts the source and target features and from the source training data and target training data respectively. The selected UDA algorithm is then applied to mitigate the distribution shift between the extracted features of the two domains. We generally categorize the adopted UDA algorithms into discrepancy-and adversarial-based approaches. Last, to set the hyper-parameters of the UDA algorithm, we consider three practical model selection approaches that do not require any target domain labels or allow for only few-shot labeled samples. These approaches are source risk (SRC), deep embedded evaluation risk (DEV) [28], and few-shot target risk (FST). Our evaluation pipeline standardizes experimental procedures, preventing extraneous factors from affecting performance, thus enabling fair comparison between different UDA methods.
The code of AdaTime is publicly available for researchers to enable seamless evaluation of different UDA methods on time series data. Merging a new algorithm or dataset into AdaTime will be just a matter of adding a few lines of code.

Benchmarking Datasets
We select the most commonly used time series datasets from two real-world applications, i.e., human activity recognition and sleep stage classification. The benchmark datasets span a range of different characteristics including complexity, type of sensors, sample size, class distribution, and severity of domain shift, enabling more broad evaluation. Sleep-EDF dataset [32], which contains EEG readings from 20 healthy subjects. Following previous studies, we select a single channel (i.e., Fpz-Cz) following previous studies [33] and include 10 subjects to construct the five cross-domain scenarios.
3.3.5 MFD. The Machine Fault Diagnosis (MFD) dataset [34] has been collected by Paderborn University to identify various types of incipient faults using vibration signals. The data was collected under four different operating conditions, and in our experiments, each of these conditions was treated as a separate domain. We used five different cross-condition scenarios to evaluate the domain adaptation performance. Each sample in the dataset consists of a single univariate channel with 5120 data points following previous works [24,35]. However, some previous TS-UDA works adopted different backbone architectures when comparing against baseline methods, leading to inaccurate conclusions.
To tackle this problem, we design AdaTime to ensure the same backbone network is used when comparing different UDA algorithms, promoting fair evaluation protocols. Furthermore, to better select a suitable backbone network for TS-UDA application, we experiment with three different backbone architectures: • 1-dimensional convolutional neural network (1D-CNN): consists of three convolutional blocks, where each block consists of a 1D-Convolutional network, a BatchNorm layer, a non-linearity ReLU activation function, and finally, a MaxPooling layer [4,9]. • 1-dimensional residual network (1D-ResNet): is a deep Residual Network that relies on a shortcut residual connection among successive convolutional layers [1,36]. In this work, we leveraged 1D-ResNet18 in our experiments.
• Temporal convolutional neural network (TCN): uses causal dilated convolutions to prevent information leakage across different convolutional layers and to learn temporal characteristics of time series data [37,38].
These architectures are widely used for time series data analytics and differ in terms of their complexity and the number of trainable parameters.

Domain Adaptation Algorithms
While numerous UDA approaches have been proposed to address the domain shift problem [39], a comprehensive review of existing UDA methods is out of our scope. Besides including state-of-the-art methods proposed for time series data, we also included prevalent methods for visual UDA that can be adapted to time series. Overall, the implemented algorithms in AdaTime can be broadly classified according to the domain adaptation strategy: discrepancy-based and adversarial-based methods. The discrepancy-based methods aim to minimize a statistical distance between source and target features to mitigate the domain shift problem [13][14][15], while adversarial-based methods leverage a domain discriminator network that enforces the feature extractor to produce domain invariant features [40,41]. Another way to classify UDA methods is based on what distribution is aligned distribution. Some algorithms only align the marginal distribution of the feature space [8, 13-15, 40, 42], while others jointly align the marginal and conditional distributions [16,20,41,43], allowing fine-grained class alignment.
The selected UDA algorithms are as follows:   [16]: minimizes the discrepancy between source and target domains via a local maximum mean discrepancy (LMMD) that aligns relevant subdomain distributions.  Table 3 summarizes the selected methods, showing the application for which each method was originally proposed, the classification of each method according to domain adaptation strategy (i.e., whether it relies on discrepancy measure Manuscript submitted to ACM or adversarial training), their classification based on the category of the aligned distribution (i.e., marginal or joint distribution), the losses in each method, and the risk that each UDA method adopted to tune its model hyper-parameters.
It is worth noting that our AdaTime mainly focuses on the time series classification problem, so we excluded methods proposed for time series prediction/forecasting.

Model Selection Approaches
Model selection and hyper-parameter tuning are long-standing non-trivial problems in UDA due to the absence of target domain labels. Throughout the literature, we find that the experimental setup in these works leverages target domain labels to select hyper-parameters, which violates the primary assumption of UDA. This is further clarified in Table 3, where we find that five out of the 11 adopted UDA works use the target risk (i.e., target domain labels) in their experiments to select the hyper-parameters, while another three works use fixed hyper-parameters without describing how they are selected. To address this issue, we evaluate multiple realistic model selection approaches that do not require any target domain labels, such as source risk [19] and Deep Embedded Validation (DEV) risk [28]. In addition, we design a few-shot target risk that utilizes affordable few labeled samples from the target domain. In the following subsections explain the risk calculation for each model selection approach.
where is the best model that achieves the minimum risk value, and R * ∈ {R SRC , R DEV , R FST , R TGT } can be any of the model selection approaches described below.
3.6.2 Source Risk (SRC). In this approach, we select the candidate model that achieves the minimum cross-entropy loss on a test set from the source domain. Therefore, this risk can be easily applied without any additional labeling effort as it relies on existing labels from the source domain [19]. Given the source domain test data ( , ), and a candidate model , we calculate the corresponding source risk R as: where ℓ is the cross-entropy loss. Despite the simplicity of the source risk, its effectiveness is mainly influenced by the sample size of source data and the severity of the distribution shift. When the distribution shift is large and the sample size of the source data is small, the source risk may be less effective than the target risk. However, the source risk can be estimated using only source labels, whereas the target risk requires labeled data from the target domain.

DEV Risk.
This approach [28] aims to find an unbiased estimator of the target risk. The key idea is to consider the correlation between the source and target features during the risk calculation. More specifically, the DEV method puts larger weights on the source features highly correlated to the target features while giving lower weights to the less correlated features. To do so, an importance weighting scheme has been applied to the feature space. Given the source domain training features , the source domain test set , and the target domain training features , we first train a two-layer logistic regression model to discriminate between and (label features from as 1, and as 0), which can be formalized as follows Subsequently, we leverage the trained to compute the importance weights for the source test set.
where is sample size ratio of both domains. It is worth noting that the sample ratio parameter can be computed without any target labels. Given the importance weights for each test sample of the source domain, we compute the corresponding weighted cross-entropy loss, , for the test samples of the source domain, which can be expressed as where is one candidate model. Given the weighted source loss and its corresponding importance weight = , we compute the DEV risk as follows: Var( ) is the optimal coefficient. The DEV risk can be more effective than the source risk. However, we observed in our experiments that DEV may have unstable performance with smaller source and target datasets and adds additional computational overheads. Nevertheless, DEV risk is still a more practical solution than the target risk as it does not require any target labels.

Target Risk (TGT).
This approach involves leaving out a large subset of target domain samples and their labels as a validation set and using them to select the best candidate model. Using this risk naturally yields the best-performing hyper-parameters on the target domain. This can be seen as the upper bound for the performance of a UDA method.
The target risk R TGT is calculated as: Even though this approach is impractical in unsupervised settings, it has been used for model selection in many previous UDA papers [41,46].

Few-Shot Target (FST) Risk.
We propose the few-shot target risk as a more practical alternative to the target risk. Our goal in introducing the concept of few-shot target risk was to find a more practical and realistic model selection method for unsupervised domain adaptation. We reasoned that labeling a small number of samples, known as few-shot labeling, could be practical and affordable. Therefore, we used this approach to select the best model for the unsupervised domain adaptation problem. Formally speaking, we use a set of samples from the target domain as a validation set to select the best candidate model. The few-shot target risk R FST is calculated as follows. Data splitting. Next, we divide the data into training and testing sets. Specifically, we split the data from each subject into stratified splits of 70%/30%, ensuring that the test set includes samples from all classes in the dataset. It is worth noting that we do not use a validation set for either the source or target domains. A validation set is typically used to select the best hyperparameters for the model, but we use four risk calculation methods that only require the source training data, source testing data, and target training data to select the best model.
Normalization. Normalization is a crucial step in the training process of deep learning models, as it can help to accelerate convergence and improve performance. In this work, we apply Z-score normalization to both the training and testing splits of the data, using the following equation:  Fig. S1 in the supplementary materials). The accuracy metric may also not be representative of the performance of the UDA methods. Therefore, we report macro F1-scores instead, considering how the data is distributed and avoiding predicting false negatives.

RESULTS AND DISCUSSIONS
In this section, we first investigate the contributions of different backbone networks to the performance of UDA algorithms. Subsequently, we study the performance of different model selection techniques on the benchmark datasets.
Last, we discuss the main findings of our experiments.

Evaluation of Backbone Networks
To investigate the impact of the backbone networks on the models' performance, we evaluate all the UDA algorithms under three different backbone networks. We employ 1D-CNN, 1D-ResNet, and TCN (described in Section 3.4) as backbone networks. To better evaluate the performance of different backbone networks, we experimented on datasets with different scales, i.e., the small-scale UCIHAR and the large-scale HHAR datasets. We reported the average performance of all cross-domain scenarios in the adopted datasets, as shown in Fig. 2 We also conducted additional experiments using Long-Short Term Memory (LSTM). We compared its performance to other CNN-based models on ten different methods for unsupervised domain adaptation on the UCIHAR dataset.
Our results show that LSTM performed significantly worse than all the other CNN-based approaches. This may be due to the lower capacity of LSTM at modeling local patterns and producing class-discriminative features [37], as well as its difficulty in handling long-term dependencies [47], which are common in many time series applications. Detailed results of the LSTM experiment are provided in the supplementary materials.

Evaluation of Model Selection Strategies
In this experiment, we evaluate the performance of various model selection approaches, i.e., SRC, DEV, FST, and TGT (described in Section 3.6) on the UDA methods. We first select the backbone network to be 1D-CNN due to its stable performance and computational efficiency. Then, for all the UDA algorithms, we choose the best model according to each model selection strategy while testing its performance on the target domain data. To summarize, when obtaining target labels is cost-prohibitive, both the source and DEV risks offer viable solutions as they do not require any target labels. The appropriate choice between the two depends on the dataset's characteristics and the computational resources' availability. While the DEV risk is more robust on class-imbalanced datasets, the source risk can be more computationally feasible. On the other hand, if only a small amount of labeled data from the target dataset is available, the few-shot risk can be the best choice as it achieves competitive performance with the target risk using only a small number of labeled samples, given that the target dataset has balanced classes.

Discussions
AdaTime provides a unified framework to evaluate different UDA methods on time series data. To explore the advantage of one UDA method over the others, we fixed the backbone network, the evaluation schemes, and the model selection strategy. We unveil the following insights.
Domain gap of different datasets. We conducted the experiments on two small-scale datasets and two large-scale datasets. Regardless of the dataset size, all the adopted datasets suffer a considerable domain gap, as shown in Table 5.
This  Visual UDA methods achieve comparable performance to TS-UDA methods on time series data. With further exploration of Table 4, we find that, surprisingly, the performance of visual UDA methods is competitive or even better than TS-UDA methods. This finding is consistent for all the model selection strategies across the benchmarking datasets. For example, with the TGT risk value, we find that the methods proposed for visual applications such as DIRT-T and DSAN perform better than CoDATS and AdvSKM on the four datasets. A possible explanation is that all the selected UDA algorithms are applied on the vectorized feature space generated by the backbone network, which is independent of the input data modality. This finding suggests that visual UDA algorithms can be strong baselines for TS-UDA with a standard backbone network.
Methods with joint distribution alignment tend to perform consistently better.  Accuracy metric should not be used to measure performance for imbalanced data. It is well known that accuracy is not a reliable metric for evaluating the performance of classifiers on imbalanced datasets. Despite this, many existing TS-UDA methods use accuracy to measure performance in their evaluations [3,8,23]. Our experiments reveal that only using accuracy or F1-score alone can lead to inconsistent results on imbalanced datasets such as WISDM. This highlights the need to consider the imbalanced nature of most time series data when evaluating classifier performance.
To illustrate this point, we present the results of our experiments in terms of both accuracy and F1-score on four datasets: WISDM, SSC, UCIHAR, and HHAR. WISDM and SSC are imbalanced, while UCIHAR and HHAR are mostly balanced. As shown in Fig. 3, on the imbalanced WISDM dataset (Fig. 3(b)), CDAN achieves higher accuracy than some other methods such as DDC, MMDA, and DSAN, but has one of the worst performance in terms of F1-score. In contrast, the results on the balanced UCIHAR dataset (Fig. 3(a)) show that accuracy can still be a representative performance measure and is similar to F1-score. Therefore, we recommend using the F1-score as a performance measure in all TS-UDA experiments.
Effect of labeling budget on the Few-shot risk performance To investigate the influence of the number of labeled samples on few-shot learning performance, we conducted additional experiments using 10 and 15 labeled samples per class on the UCIHAR dataset. The results in Table 6 show that the overall performance was relatively consistent regardless of the size of the few-shot labeled sample, indicating the robustness of few-shot learning to this hyperparameter. Limitations and future works One limitation of our study is that it only focuses on time series classification. In future work, we plan to extend the scope of our benchmarking to include time series regression and forecasting tasks.
Additionally, we have only considered the closed-set domain adaptation scenario, where the source and target classes are similar. In future work, we aim to consider partial and open-set domain adaptation scenarios, which are common in time series applications and involve varying classes between the source and target domains.

CONCLUSIONS AND RECOMMENDATIONS
In this work, we provided AdaTime, a systematic evaluation suite for evaluating existing domain adaptation methods on time series data. To ensure fair and realistic evaluation, we standardized the benchmarking datasets, evaluation schemes, and backbone networks among domain adaptation methods. Moreover, we explored more realistic model selection approaches that work without target domain labels or with only a few-shot labeled samples. Based on our systematic study, we provide some recommendations as follows. First, visual UDA methods can be applied to time series data and are strong candidate baselines. Second, we can rely on more realistic model selection strategies that do not require target domain labels, such as source risk and DEV risk, to achieve reliable performance. Third, we recommend conducting experiments with large-scale datasets to obtain reliable results by fixing the backbone network among different UDA baselines. We also suggest adopting the F1-score instead of accuracy as a performance measure to avoid any misleading results with imbalanced datasets. Lastly, we believe that incorporating time series-specific domain knowledge into the design of UDA methods has the potential to be beneficial moving forward. Manuscript submitted to ACM

SI CLASS DISTRIBUTION OF DIFFERENT SUBJECTS
This section visualizes the class distribution for each selected subject across all datasets. Specifically, Figure S1(a) depicts the class distribution of subjects in the UCIHAR dataset, where it is noted that all subjects possess data for every class.
Conversely, as shown in Figure S1(b), certain subjects within the WISDM dataset lack data for select classes.

Method
Hyperparameter Range

SII DETAILED PARAMETER RANGES FOR THE HYPER-PARAMETER SEARCH
We provide the detailed ranges for each parameter among all selected domain adaptation methods, which are shown in Table S1. We tuned the learning rate from the same range for all the UDA algorithms while we chose different ranges for each specific loss in its prospective UDA method.

SIV PERFORMANCE OF LSTM-BASED BACKBONE NETWORK
Recurrent Neural Networks (RNNs) are widely used for time series forecasting and regression due to their ability to learn the temporal dynamics of time series signals. However, there are several limitations to using RNNs for time series classification tasks. First, compared to CNN-based approaches, RNNs are less effective at modeling local patterns and producing class-discriminative features, which can negatively impact their classification performance. Second, RNNs struggle to handle long-term dependencies common in many time series applications. For example, the sleep stage classification (SSC) dataset has a sequence length of 3000 timesteps per sample, which can be challenging for RNNs to process. Lastly, while CNN-based backbones can be efficiently trained using parallel computations, RNNs require sequential computation, leading to longer training times. Given these limitations, we decided to focus on CNN-based backbones in our experiments. However, we also conducted experiments using an LSTM backbone network.
We compared its performance to other CNN-based backbones on ten different UDA methods on the UCIHAR dataset, as shown in Table S2. The results of our experiments show that the LSTM backbone performs significantly worse than all the other CNN-based backbones, demonstrating the deficiency of RNNs on time series classification tasks.  SRC  2  5  6  4  6  9  10  2  11  1  8  DEV  1  8  10  2  4  3  5  11  9  7  6  FST  3  4  6  5  1  8  11  7  10  2  9  TGT  5  4  6  7  1  9  9  2  11  3  8 To demonstrate the impact of model selection on performance, we evaluated various unsupervised domain adaptation (UDA) methods on the SSC dataset using multiple model selection strategies. The results in Table S3 reveal that the ranking of the UDA methods varies according to the model selection strategy employed. This demonstrates the importance of carefully considering the appropriate model selection strategy for the domain adaptation task.

SVI DETAILED RESULTS OF OUR FRAMEWORK
This subsection provides detailed results of 10 scenarios for all the datasets. In specific, Tables S4, S5, S6, S7, and S8 show the mean and the standard deviation for each cross-domain scenario in UCIHAR, WISDM, SSC, HHAR, and MFD dataset respectively.    Table S7. Detailed results of 10 scenarios on HHAR dataset in terms of MF1 score. Table S8. Detailed results of 10 scenarios on MFD dataset in terms of MF1 score.