Generalizable Low-Resource Activity Recognition with Diverse and Discriminative Representation Learning

Human activity recognition (HAR) is a time series classification task that focuses on identifying the motion patterns from human sensor readings. Adequate data is essential but a major bottleneck for training a generalizable HAR model, which assists customization and optimization of online web applications. However, it is costly in time and economy to collect large-scale labeled data in reality, i.e., the low-resource challenge. Meanwhile, data collected from different persons have distribution shifts due to different living habits, body shapes, age groups, etc. The low-resource and distribution shift challenges are detrimental to HAR when applying the trained model to new unseen subjects. In this paper, we propose a novel approach called Diverse and Discriminative representation Learning (DDLearn) for generalizable low-resource HAR. DDLearn simultaneously considers diversity and discrimination learning. With the constructed self-supervised learning task, DDLearn enlarges the data diversity and explores the latent activity properties. Then, we propose a diversity preservation module to preserve the diversity of learned features by enlarging the distribution divergence between the original and augmented domains. Meanwhile, DDLearn also enhances semantic discrimination by learning discriminative representations with supervised contrastive learning. Extensive experiments on three public HAR datasets demonstrate that our method significantly outperforms state-of-art methods by an average accuracy improvement of 9.5% under the low-resource distribution shift scenarios, while being a generic, explainable, and flexible framework. Code is available at: https://github.com/microsoft/robustlearn.


INTRODUCTION
Human Activity Recognition (HAR) plays an indispensable role in ubiquitous computing.The goal of HAR is to build models by leveraging data such as the acceleration and rotation of angles recorded by inertial measurement units or other sensor devices to recognize users' activities.With the development of machine learning and deep learning techniques, wearable sensor-based HAR has been applied to many real-life scenarios such as fatigue detection during the driving procedure [4], fall detection of the elders [30], healthcare in daily life, and diagnosis of the Parkinson's disease [7,13], etc.These applications show strong need for designing generalizable and accurate HAR algorithms in real-life scenarios.
Despite the great progress, current HAR still faces two critical challenges that prevent us from building a generalizable model to perform well on unseen data.First, the low-resource problem.While massive labeled data is indispensable for training powerful deep learning models, it is difficult to collect sufficient user data and annotate them in reality.As shown in Figure 1(a), the limited training data (the solid line) actually cannot represent the diverse activity patterns that may change over time (the dashed lines).Therefore, the low-resource collected data fails to represent the rich and diverse patterns that can be used to learn a generalized model.Second, the distribution shift problem.From Figure 1(b), the data collected from different people have distribution discrepancies due to their different living habits, body shapes, age groups, etc. Building models without considering the Non-IID (i.e., not identically and independently distributed) situation may result in great performance degradation since traditional machine learning often assumes that the training and test data are IID.For example, a model trained on the data from existing patients may fail if applied to new patients with totally different body statuses.

Low Resource
Distribution Shift  How to solve the generalizable low-resource HAR problem?Domain generalization (DG) [45] is an emerging paradigm for solving the distribution shift problem between the source and target data.Different from domain adaptation where the target data can be accessed in the training procedure, DG cannot access the target data, which is more challenging.Unfortunately, existing DG algorithms cannot be used directly for our low-resource distribution shift problem since limited training data might fail to capture the diverse patterns which undermine the generalization abilities.
In this paper, we propose a novel approach named Diverse and Discriminative representation Learning (DDLearn) for Generalizable Low-Resource HAR.Our goal is to learn diverse and discriminative representations that can generalize to unseen target data with limited training samples.We design a self-supervision task with particularly-designed sensor data transformations to solve the lowresource problem.To further tackle the distribution shift problem, the learned representations should be more diverse and discriminative to ensure stronger generalization capability.The main procedures of DDLearn are illustrated in Figure 2: Diversity generalization, diversity preservation, and discrimination enhancement modules.With data augmentation and the self-supervision auxiliary task, we expand the diversity of data space in semantic scope.With diversity preservation, we preserve the diversity between the learned representations of the generated and original data, avoiding the degradation of diversity during the feature extraction procedure.In order to achieve accurate activity recognition, the semanticdiscriminative capability is also essential.Thus, we further enhance the discriminating capability of representation learning.
Our main contributions are summarized as follows: We aim to tackle a more challenging and realistic problem: generalizable low-resource HAR.This brings two critical challenges: extremely limited training data and distribution shift.
We propose a novel DDLearn framework to solve this generalizable low-resource HAR problem by considering the diversity and discrimination in learning processes.
To solve the low-resource challenge, we propose to use the selfsupervision technique to design an auxiliary task to learn the latent motion properties and expand data space with diversity.We then propose a diversity preservation module.
We propose a discrimination enhancement module for the semantic consistency and discrimination of activities by pulling together the intra-class representations and pushing away the inter-class pairs with supervised contrastive learning.We make comprehensive evaluations on three public sensorbased datasets, showing that DDLearn effectively improves the HAR performance under low-resource domain generalization scenarios while remaining generic, explainable, and flexible.

RELATED WORK
Human Activity Recognition (HAR).HAR focuses on making classification of different activities that happen in daily living.According to the type of data, HAR can mainly be divided into two categories: sensor-based and vision-based [8].Vision-based HAR collects data with optical devices [51] but may encounter some security problems, for example, sensitive information such as facial information and irrelevant people may be accidentally disclosed on camera.Sensor-based HAR captures the activity data through environment deployed sensors or wearable sensors.The wide usage of the IMU-deployed wearable sensors makes it more convenient and practical to record activity data in people's daily life [17].So we mainly focus on the wearable sensor-based HAR problem.
To solve sensor-based HAR, many machine learning-based methods [1,35] are proposed and with the fast development of deep learning techniques, the performance of HAR has been significantly improved [14,22].Despite the great progress that has been achieved, most of these methods are based on the assumption of training data are sufficient and the training and testing data have independently identical distributions.However, low-resource and distribution shift of data are two realistic and long-standing problems hindering generalization performance for new unseen data in HAR.

Domain Generalization (DG)
. DG [45] is an emerging technology of transfer learning scope [49] and is becoming popular in recent years.DG aims to learn a robust and generalizable model with one or several source domains that have different probability distributions to get a minimized error on an unknown target domain.Different from domain adaptation [25,46,48] assuming the availability of the target/test data in training, DG focuses on the scenario that the target data is absent for training, and the model is directly applied to the target data without re-training or fine-tuning.It is worth noting that although in traditional machine learning, the testing set also cannot participate in training, the learning process is based on the assumption that the data is IID, and DG considers the out-of-distribution (OOD) problem.This is more challenging but closer to real-life applications.In the computer vision community, DG is a popular topic and many effective domain generalization methods are emerging.Existing DG methods can be categorized into three branches [45]: Data manipulation [27,34,52], Representation learning [18,26,28,56] and Learning strategy [5,23,29,55].Despite DG thriving in the computer vision community, there still very few works are proposed for human activity recognition tasks [36].Recently, to the best of our knowledge, the first work to solve the DG problem for HAR is proposed [32] by disentangling domainagnostic and domain-specific features.However, there are still few existing methods try to solve the generalizable low-resource human activity recognition problem.

Feature Extractor
Activity classifier

The Trained Model Classification
Training Inference Self-Supervised Learning (SSL).SSL [15,24,53] is a popular technique that can help alleviate the dependence on annotated data by learning representations with the supervision of self-defined pseudo labels.A popular and effective branch in SSL is to design a different but relative auxiliary task to pre-train the feature extractor for latent representations which can improve the learning of the downstream HAR tasks [39].Recently, contrastive self-supervised learning is applied to HAR to improve the performance.Based on the contrastive learning framework SimCLR [6], [19] makes a combination with a transformer-based encoder to solve sensor-based HAR, and [41] modifies SimCLR with time-series transformation functions.[33] explores several important components of the existing contrastive learning algorithms when applying them to HAR tasks from both algorithmic-level and task-level aspects.
The major difference between our approach and existing selfsupervised learning-based HAR is that they are based on the framework of pre-training and fine-tuning or supervised training on the labeled target activity data, but don't consider the OOD problems.Most of them assume the training and testing data are IID, and do not consider the more realistic situation that the test data usually has distribution shifts with the training data and cannot be available during the training.As aforementioned, directly applying them to the unseen test data may get sharp performance degradation.Thus, it is an essential and urgent challenge to tackle, generalizationorient designed method is important for learning more general representations for generalizable low-resource HAR.

PROPOSED APPROACH
A domain can be defined as a joint probability distribution P  , on X × Y, where X and Y denote the instance space and label space, respectively [9,32].We are given a labeled source domain =1 which can only be accessed in inference.All domains share the same feature and label spaces while having different probability distributions (joint distribution shift), i.e.X  = X  , Y  = Y  ,   (x i ,   ) ≠   (x i ,   ).Note that the generalizable low-resource setting is more challenging than the conventional setting due to the small training data size and distribution shifts.

Motivation and Overall Framework
Two main issues, i.e., low-resource and distribution shift, seriously impede the capability of machine learning for HAR.However, it is difficult and impractical to collect a large amount of data.For instance, the elderly who have difficulty moving or children who have difficulty following the instruments do not support massive data collection.Meanwhile, sensor readings can be easily affected by diverse personalities, leading to different data distributions between different users even if they perform the same activities with the same kind of sensors.Enough data is important for training a good deep learning model [24], especially when the test data has different distribution from the training data.Thus, given limited training data, it is essential and intuitive to utilize data augmentation to explore the invariant properties of data by expanding and completing the data space within the semantic scope [42].In addition to data space expansion, it is also important to get diverse representations that have strong generalization capability and keep the semantic discrimination capability for activity recognition.
In this paper, we propose a novel approach called Diverse and Discriminative representation Learning (DDLearn) for generalizable low-resource HAR.The core of DDLearn is to learn a model which has strong generalization capability to OOD data with limited training instances.As shown in Figure 2, DDLearn learns diversity both in data and feature space to overcome the limitation of lowresource training data.Specifically, DDLearn learns to generate diversity by data augmentation and exploring the invariant properties in the latent space by injecting the self-defined prior knowledge of augmentations in a self-supervised auxiliary task.To learn generalizable representations in feature space, we propose a diversity preservation module and a discrimination enhancement module to preserve the representation diversity for accurate recognition.
The overall learning objective of DDLearn can be formulated as: where L  is the activity classification loss, L  is the loss of diversity generation, L  is the loss for diversity preservation, and L  is the loss for discrimination enhancement learning.,  and  are trade-off hyper-parameters.In the following sections, we will elaborate on these learning modules.

Diversity Generation
In the low-resource scenario, it is hard to learn a generalized model due to the limited representation capabilities from limited data.The distribution shift between the test and training data worsens the performance.Therefore, we propose to generate diversity from both the data space and representation space by expanding data space with data augmentation and exploring the generalized and invariant properties by self-supervised auxiliary tasks.
Wearable sensor data differs from images because of its temporal property and motion patterns, thus the augmentation techniques should be well designed.Specifically, we employ seven data transformation techniques [42] to get the augmented data.These data transformations are: rotation, permutation, time-warping, scaling scale, magnitude warping, jittering, and random sampling.[38] verified these transformations are feasible and practical for HAR.Details of these transformations are as follows.
Rotation: Rotate the data with an arbitrary angle that is sampled randomly from a uniform distribution.This transformation can simulate the various orientations of the sensors placed on different body locations when performing the same activity (i.e.different sensor orientations with the same label).Permutation: Slice a window of data into N segments and randomly permute these segments to form a new window.This transformation may help to explore the permutation invariant properties in learning.Time-Warping: Perturb the location in the temporal dimension by smoothly and locally distorting the time intervals between samples.This may broaden the local diversity in temporal.Scaling: Scale the magnitude of a window-length data by multiplying with a random scalar.This can enlarge the magnitude diversity of the entire samples with constant noise.Magnitude Warping: Different from scaling, it warps the magnitude of a window-length data with a smooth scalar around 1 which can add smoothly-varying noise to samples.Jittering: This applies different noise to samples.It may also enlarge the diversity of data magnitude and push the model to be more robust against the multiplicative and additive noise.
Random Sampling: This transformation is similar to time-warping transformation, but it only uses subsamples for interpolation.Motivated by existing work [38] that utilized sensor transformations to design a multi-task self-supervised network, we design a self-supervised auxiliary task with the original and seven types of transformed data.More differently, we design one multi-class classification task to reduce the complexity of the model.Thus, we have two tasks: The original activity recognition task (i.e., L  ) and the self-supervised auxiliary task.
To explore the underlying properties of the generated sensor signals, the learning objective of the self-supervised auxiliary task is to classify which kind of transformation the input data belongs to (the original data is also regarded as a category here, i.e., 8 categories).The label space of the auxiliary task is denoted as Y  .We adopt the standard cross-entropy loss for the self-supervision task to learn the self-supervision model   : X  → Y  , where X  = X  ∪ X  , formulated as: where  is the number of classes in the self-supervision task,   is the ground-truth class label of the self-supervision task,   (  ()) is the predicted probability and  is the softmax function.

Diversity Preservation
After diversity generation in the data space, we extract general features of the augmentation and original data with CNN framework and learn the task-oriented features with fully-connected layers.During the feature learning procedure, the representation space of the same category may be narrowed and features gather closer under the effect of the classification objective.This may result in the reduction of diversity in data augmentation.Hence, it is important to preserve the diversity of the learned features, i.e., to make the augmented features distinguished from the original ones such that the feature space is expanded.To this end, we focus on enlarging the distance between the original and the augmentation feature spaces.Note that, with the regularization of the semantic discrimination enhancement (details will be introduced in the next section) and activity classification loss, the distance can be held in a moderated range.For easy optimization, maximizing the domain distance is identical to minimizing the following: According to [3,10], the distribution divergence of two domains D  and D  can be defined as Definition 3.1: Definition 3.1 (Distribution distance).For data  and domains D  and D  over  , the H -divergence of the binary classifier set H = {} between these two domains is where  denotes the classification prediction probability.Thus, the distribution between two domains can be bridged by the Proxy A −  [3], which is defined as the error of building a linear binary classifier to discriminate two domains.
Denote the domain discrimination error as  with a classifier , then, the A −  can be defined as: We aim to maximize the divergence between the original and augmentation data, which is identical to minimizing the error of the binary classifier.Thus, we propose to utilize a domain discriminator with a binary classifier to discriminate the original and augmentation domains.By minimizing the domain discriminator error, the distance between two domains can be enlarged and these two domains can be discriminated.So the diversity preservation objective is to minimize the classification loss of the domain discriminator, and we find this discriminator does not need to be well trained, random conditions can get a good effect.
It can help preserve the features of two domains from overlapping too much to lose diversity.Note that DDLearn is not limited to a specific metric such as minimizing the loss of domain discriminator, we can easily utilize other distance metrics and maximize them as alternatives to enlarge the distribution of the original and augmentation space such as Maximum Mean Discrepancy (MMD) [11], and Kullback-Leibler (KL) divergence [21] and so on (refer to Sec. 4.7).

Discrimination Enhancement
To achieve accurate activity recognition, it is important to enhance the semantic discrimination of representations.We aim to enlarge the inter-class distance (i.e., the distance between samples from different classes) and reduce the intra-class distance (i.e. the distance between samples from the same class) to achieve semantic discrimination enhancement.
We adopt the supervised contrastive loss to the original and augmented features [20,47] to enhance the activity semantic discrimination, and randomly regard each sample not only the augmentations as anchors.Supervised contrastive learning [20] first makes data augmentations with two random augmentations or called views (e.g.rotation, permutation).It randomly regards an augmented sample as the anchor, then the positive samples are the other samples that belong to the same class as the anchor's, and others are regarded as negative samples.Due to the presence of labels, supervised contrastive learning can help achieve the aforementioned intra-class pulling and iter-class pushing.Therefore, the enhancement loss is computed as: where  is the index set of the original and the augmented representations and index  is the anchor.() ≡  \,  () ≡  ∈ () : ỹ = ỹ is the set of indexes of the representations which is in the same activity class with the  ℎ representation,   is the positive representation, and  ∈ R + is the scalar temperature.By applying contrastive loss on original and augmented representations, representations of the same class are pulled closer and representations of different classes are pushed further away from each other.Thus, the semantic discrimination is enhanced to achieve accurate activity recognition.

DDLearn for Low-Resource HAR
We introduce the training and inference of DDLearn for low-resource generalizable HAR.As shown in Algo. 1, we first conduct data augmentation with the aforementioned seven transformations to get the augmented data.The original and augmented data are concatenated into mini-batches with the 1 : 1 ratio as input and then fed into the network.Then, representations for both the original and the augmented data are learned by the feature extractor.With the diversity preservation module, the distance between the original and augmented space is enlarged to avoid their fusion.By discrimination enhancement, intra-class features are pulled closer and inter-class features are pushed away, thus the semantic discrimination is enhanced.Subsequently, all features are fed to the main activity classifier and auxiliary augmentation classifier.
As for inference, we directly apply the trained model to the unseen test data.Without data augmentation and fine-tuning, only the original test data are fed into the network to extract features and only apply the activity classifier to them for activity recognition.Get the predict label ŷ = M (); 3: end for 4: Calculate the classification accuracy.5: return Classification results on target HAR data.

EXPERIMENTS
We evaluate DDLearn via extensive experiments on the low-resource generalizable activity recognition problems to investigate the following research questions (RQs):

Setup
Datasets.We adopt three public activity recognition datasets.(1) DSADS.UCI Daily and Sports Data Set [2] collects data from 8 subjects around 1.14M samples.Three kinds of body-worn sensor units including triaxial accelerometer, triaxial gyroscope, and triaxial magnetometer are worn on 5 body positions of each subject: torso, right arm, left arm, right leg, and left leg.It consists of 19 activities.The total signal duration is 5 minutes for each activity of each subject.(2) PAMAP2.PAMAP2 Physical Activity Monitoring dataset [37] consists of data collected from 9 subjects wearing 3 inertial measurement units (IMU) on hand, chest, ankle and a heart rate monitor (HRM), around 2.84M samples.We utilize the data collected with the IMU including triaxial accelerometer (with the scale of ±16 as the official recommendation), gyroscope and magnetometer data.We utilize 8 common activities from 8 subjects for evaluation: lying, sitting, standing, walking, ascending stairs, descending stairs, vacuum cleaning, and ironing.(3) USC-HAD.USC Human Activity Dataset [54] collects data from 14 subjects with a motion mode packed into a mobile phone and attached to the front right hip of subjects, around 2.81M samples.During data collection, triaxial accelerometer and triaxial gyroscope sensor data are captured.Each subject performs 12 activities in their own styles.Detailed dataset information is in Appendix A.1.

Implementation Details
4.2.1 Data Pre-processing.To construct the domain-generalized activity recognition scenario, we divide subjects into several groups for leave-one-out-validation, i.e. we regard one group of subjects' data as the target domain and the remained as source domain data, each can be regarded as a task.We divide 8 subjects into 4 groups for DSADS and PAMAP2 and divide 14 subjects into 5 groups where each of group 0-3 consists of 3 subjects and the last consists of 2 subjects.We further split each group's data into training, validation, and test data with a ratio of 6:2:2, and select the best model on the source validation set, meanwhile, we can further make a comparison with the ideal situation that straightly training on the target, which is a less practical situation for real-life applications.
We randomly sample 20%-100% samples from the training data with a step of 20% as the training set to evaluate the influence of the training data size, and simulate the low-resource setting.For testing, we evaluate the trained model on the test set of the target domain.Details of using sliding window for pre-processing each sample are in Appendix A.1.

Network Architecture and
Training.We directly follow [33] to reproduce SimCLR, since [33] has employed SimCLR to domain generalization HAR tasks with sensor data transformations.Other methods are popular domain generalization methods and we apply them to HAR with the same backbone network (also a CNNs architecture) for a fair comparison following DomainBed [12].The feature extractor includes conv2d layers (two conv2d layers with kernel of (1, 9) for DSADS and PAMAP2, 3 conv2d layers with kernel size of (1, 6) for USC-HAD), each along with a ReLU and maxpool2d operation and then connect with a fully connected layer for higher layer feature extraction.The output feature dimension is 64 for DSADS and PAMAP2 and 128 for USC-HAD.We utilize a fully connected layer as the classifier which uses the features as input and outputs the class-number-dimension logits.By utilizing a softmax operation, we can get the prediction probability of each class that adds up to 1.We set the learning rate to 0.0008 and use batch size 64 for the original data and 64 for the augmented data in our method.Adam optimizer is utilized to optimize the training process.DDLearn is implemented with PyTorch and trained on GTX 3090.We repeat each experiment 3 times with 3 different random seeds and report the mean and standard deviation.

Overall Performance (RQ1)
As the aforementioned division of subjects' data, each group is regarded as the testing target domain in turn and denoted with { 0, 1, ...} as their indexes, and the reminders together are regarded as the source domain.We make the experimental evaluation of low-resource regime that only use 20% of the training data.The classification accuracy on three public datasets is shown in Table 1.From the results, we observe that DDLearn significantly improves the classification accuracy by 11.93% on the DSADS dataset, 9.38% on the PAMAP2 dataset, and 7.19% on the USC-HAD dataset, respectively, i.e., an average improvement of 9.5% on three datasets.This indicates that our method can achieve accurate activity recognition and makes good generalization to unseen new data without external fine-tuning, thus can solve the generalizable low-resource HAR challenges.Besides, we observe that when facing more challenging tasks in which other methods get degraded accuracy such as the first task of PAMAP2 dataset, our method still can significantly improve the performance with 12.27%.This indicates that our method is more robust to deal with difficult tasks.Compared with directly applying the contrastive self-supervised learning method SimCLR to generalizable HAR, the performance of the proposed method is better, which indicates the diverse and discriminative representation learning is effective for generalization.The standard deviation of DDLearn is relatively small compared with the others when experimenting with three trials, implying that our method is stable.We also make an experiment that pre-training a model on the source training data and then fine-tuning it on the target data.The accuracy results can be significantly improved after fine-tuning.
Table 1: Classification accuracy (%) (± standard deviation) on three public datasets in low-resource setting with only 20% of the training data.The best and the second-best results are marked with bold and underline, respectively.
Tar ERM [44] Mixup [50] Mldg [23] RSC [16] AND-Mask [31] SimCLR [6] Fish [40] Ours DSADS T0   to decrease, the accuracy of all methods tends to decline.Specifically, our DDLearn suffers performance drops of 4.07%, 3.36%, and 5.15% on three datasets when the training data proportion drops from 100% to 20%, which the drops are 8.29%, 7.98%, and 14.05% for the second-best baselines.This indicates the superiority of our approach.Besides, Figure 3 implies that the improvement is more obvious both compared with the baseline ERM and the second-best baselines with data reduction.Especially, the improvement can reach 26.5% and 11.93% compared with ERM and the best comparison method on DSADS.This is because DSADS has the smallest amount of samples among the three datasets.These observations illustrate that our method is superior and more robust when faced with low-resource generalizable HAR problems.

Ablation Study and Interpretation (RQ3)
We conduct quantitative and qualitative analysis to investigate the contribution of each module in DDLearn and more importantly, to interpret why DDLearn is effective for low-resource generalizable HAR problems.We compare the complete version DDLearn with five variants, i.e., ERM; ERM+Augmentation (ERM+A) which uses   the original and augmentation data to train a deep model with ERM; ERM+Diversity Generation (ERM+DG); ERM+Diversity Genera-tion+Diversity Preservation (ERM+DG+DP); and ERM+Diversity Generalization+Discrimination Enhancement (ERM+DG+DE).
Quantitative ablation results.Results are shown in Figure 4. Comparing ERM and ERM+A, we see that data augmentation can help improve the activity recognition accuracy, especially on the DSADS dataset where there is a significant improvement of 9.21%.This is because DSADS has the smallest number of samples among the three datasets, indicating that the augmentation is more effective for the low-resource scenario.With the diversity generation module, the performance of the model is highly improved by 3.21% to 7.17%.This indicates the self-supervised auxiliary task can help a lot in exploring latent diversified properties.By adding diversity preservation and discrimination enhancement, the classification accuracy is further improved, which illustrates that both modules make contributions to the HAR task.Combining all modules, the proposed DDLearn can achieve the best performance for the lowresource generalizable HAR tasks.
Qualitative ablation results.For more interpretable analysis, we further visualize the feature embeddings to show the effect of each component in DDLearn using t-distributed stochastic neighbor embedding (t-SNE) [43] on the PAMAP2 dataset with 100% data.From Figure 5, we have the following observations: (1) Comparing ERM+A with ERM, the data space is largely expanded by using data augmentation and the gaps between intra-and inter-class samples are filled up with diversified augmented data.This indicates that using augmentation can expand the representation space with more diversity.(2) As shown in Figure 5(c), when adding the selfsupervised auxiliary task, the class discrimination gets enhanced.This might be due to the exploration of some latent characteristics of activities.(3) Figure 5(d) shows that by introducing the domain discrimination between the original and augmented data, these two domains are pulled away that avoids them to overlap, which preserves the diversity.(4) From Figure 5(e), by using contrastive learning, the margins between different classes are becoming larger and representations are more discriminated.( 5) Combining all components, DDLearn can attain better representations that are more diverse and more semantic-discriminated to achieve accurate activity recognition performance.

Case Study by Class-Wise Analysis (RQ4)
We further evaluate the classification performance of each activity as case study.We utilize the confusion matrix and class-wise precision, recall, and F1-score to make a fine-grained analysis.We compare our approach with the baseline method ERM [44] and a state-of-the-art method Mixup [50] on the first task of PAMAP2 dataset with 100% training data.
The confusion matrices are shown in Figure 6 and Figure 7(a)-7(c) present the F1-score, precision, and recall of each class.From the confusion matrices, we can observe that directly minimizing the classification error on the training data with ERM may get degraded performance on the unseen test data.Mixup has better generalization performance with data augmentation while has less satisfactory performance on certain activities such as sitting and standing.This may be because it only enlarges the data diversity while lacking semantic discrimination capability.Our DDLearn can improve the poorly-performed classes by considering both diversity and discrimination learning.Thus as shown in Figure 7(a)-7(c), our approach gets the best F1-score, precision and recall on most activities.Specifically, it significantly improves the performance on difficult category standing that is misclassified by other approaches.This case study demonstrates the effectiveness of DDLearn.

Compatibility (RQ5)
What about the compatibility of DDLearn?As mentioned in Section 3.3, DDLearn is a general framework and can be feasibly extended with other distance metrics for diversity preservation.In this section, we replace the domain discriminator with two typical distance metrics, namely Maximum Mean Discrepancy (MMD) [11] and Kullback-Leibler (KL) divergence [21].MMD calculates the discrepancy between embeddings in Reproducing Kernel Hibert Space (RKHS) and can be regarded as a technique to measure the distribution distance between two domains.KL divergence is also a common technique to measure domain similarity.
As shown in Figure 7(d), DDLearn obtains similar results by replacing minimizing loss of domain discriminator with maximizing MMD and KL, which are clearly better than previous comparison   methods.This indicates that DDLearn can be feasibly extended with different distance metrics and it achieves the best average accuracy on three datasets with domain discriminator.Thus, we mainly use this one in our method.Additionally, we provide parameter sensitivity analysis.We focus on three key trade-off hyper-parameters of each learning module:  for diversity generalization module, chosen value from  ∈ {0.01, 0.1, 1, 10},  for diversity preservation module chosen from  ∈ {0.01, 0.1, 1, 10} and  for discrimination enhancement module, chosen from  ∈ {0.1, 0.5, 1, 5, 10}.As shown in Figure 8, DDLearn has robust performance with a wide range of hyperparameters on three public datasets.

CONCLUSIONS AND FUTURE WORK
In this paper, we proposed DDLearn for low-resource generalizable HAR.DDLearn can generate diversity in data space and explore the latent activity properties.Then, feature diversity is further preserved by enlarging the distribution divergence between the original and the augmented domains.By utilizing supervised contrastive learning, it can enhance the semantic discrimination of features.DDLearn significantly outperformed SOTA methods in comprehensive experiments which is a generic, explainable, and flexible framework.
In the future, we will use DDLearn to assist in mining deep correlation between the sensor-based data and motion-related diseases as well as optimization of wearable healthcare applications.Additionally, we will further equip DDLearn with federated learning to take care of privacy issues for some safety-critical and privacy-related applications.

(
a) Low-resource.Different sensor readings are collected from the same subject at different time.Low Resource Distribution Shift (b) Distribution shift.Different sensor readings are collected from different subjects.

Figure 1 :
Figure 1: Low-resource and distribution shift problems in HAR.Sensor readings are walking activity in DSADS dataset.

Figure 2 :
Figure 2: The training and inference procedures of DDLearn.

Algorithm 1 1 : 4 : 5 : 6 : 7 : 8 : 9 :
DDLearn for low-resource activity recognition Training: Input: The training domain D  , and hyper-parameters , , .Output: The trained model M. Randomly initialize the model parameters  ; 2: Conduct data augmentation with data transformation techniques and get the augmented data.3: while not converge do Sample a mini-batch B = {B  , B  } from the original and augmented data and concat them as x  ; Extract features   (x  ) by the feature extractor   ; Learning diversity preservation with Eq. (3) and get L  ; Learning discrimination enhancement via Eq.(6) (L  ); Learning the auxiliary task classifier and calculate the self-supervised task loss L  according to Eq. (2); Get classification loss L  for the main activity classifier; 10: Calculate the total loss of DDLearn according to Eq. (1); 11: Update the model parameter using Adam optimizer.12: end while Inference: Input: The trained model M, target domain data D  .Output: Classification results on the test domain.1: for (, ) ∈ D  do 2: RQ1 (Accuracy): What is the effectiveness of DDLearn in public datasets?RQ2 (Robustness): Is DDLearn robust in different levels of lowresource settings, i.e., different sizes of training data?RQ3 (Interpretability): What are the contributions of each component in DDLearn and how to interpret their importance?RQ4 (Case study): How does the DDLearn improve the performance on each activity?RQ5 (Compatibility): What about the compatibility of DDLearn or can it be a flexible framework?

( a )
Improvement over ERM (b) Impro.over the second-best baseline

Figure 3 :
Figure 3: Results improvement over ERM (baseline) and the second-best result with different percentage of training data.

Figure 5 :
Figure 5: Visualization of the t-SNE embeddings of the PAMAP2 dataset.We randomly select the same amounts of original and augmented data.Each class is denoted by color.The original and augmented domains are denoted by shapes of dot and plus.The classes are denoted by numbers: lying, sitting, standing, walking, ascend stairs, descend stairs, vacuum cleaning, and iron.Best viewed in color and zoom in.

Figure 8 :
Figure 8: Parameter sensitivity analysis =1 with   instances as the training set.Note that   is much smaller than the sample size of a normal training set and it is not enough to train a robust model.The goal is to learn a model on the existing subjects' data  : X → Y which can generalize well on an unseen test domain (i.e.new subjects' data)

Table 2 :
Classification accuracy (%) on three public datasets with different percentages (%) of training data.The results are shown in Table 2. Results demonstrate that with different percentages of training data, the proposed DDLearn can attain the best results with significant accuracy improvement over other baselines.As the amount of data continues

Table 3 :
Statistical information summary of three public activity recognition datasets

Table 4 :
Summary of data pre-process settings