Unleashing the Power of Shared Label Structures for Human Activity Recognition

Current human activity recognition (HAR) techniques regard activity labels as integer class IDs without explicitly modeling the semantics of class labels. We observe that different activity names often have shared structures. For example, "open door" and "open fridge" both have "open" as the action; "kicking soccer ball" and "playing tennis ball" both have "ball" as the object. Such shared structures in label names can be translated to the similarity in sensory data and modeling common structures would help uncover knowledge across different activities, especially for activities with limited samples. In this paper, we propose SHARE, a HAR framework that takes into account shared structures of label names for different activities. To exploit the shared structures, SHARE comprises an encoder for extracting features from input sensory time series and a decoder for generating label names as a token sequence. We also propose three label augmentation techniques to help the model more effectively capture semantic structures across activities, including a basic token-level augmentation, and two enhanced embedding-level and sequence-level augmentations utilizing the capabilities of pre-trained models. SHARE outperforms state-of-the-art HAR models in extensive experiments on seven HAR benchmark datasets. We also evaluate in few-shot learning and label imbalance settings and observe even more significant performance gap.


INTRODUCTION
Sensor-based human activity recognition (HAR) identifies human activities using sensor readings from wearable devices.HAR has a variety of applications including healthcare, motion tracking, smart home automation, human-computer interaction [5,8,18,25,30,50].For example, acceleration sensors attached to legs record subjects walking around and performing daily activities for gait analysis for Parkinson's disease patients [2]; accelerometer and gyroscope can monitor user postures to detect falls for elderly people [47].
While tremendously valuable, HAR data remain difficult to collect due to security or privacy concerns, as human subjects involved in the collection process may not consent to data sharing or data transmission over the network.This often leads to local training at the edge using limited samples from just a few human subjects.Additionally, certain types of human activities happen less frequently by nature, further complicating data collection.
We note that existing HAR methods treat labels simply as integer class IDs and learn their semantics purely from annotated sensor data.This is less effective especially when labeled data are limited.To achieve better recognition performance, prior research mostly is concentrated on designing better feature extraction modules [14,24,33,36] while largely overlooking the advantages of modeling label structures.Since sensory readings measuring human activities are time-series data, existing time-series classification models are also applicable to HAR.These methods, however, are also primarily focused on enhancing feature extraction [9,12,52].It is noteworthy that both HAR and time-series classification methods in the literature miss the modeling of label name structures.
We argue that a more effective approach to learning activity semantics is through label name modeling, as activity names in HAR datasets often share structures that reflect the similarity between different activities.For example, both "open door" and "open fridge" (sharing the action "open") involve pulling a (fridge) door around a hinge (while "open door" first rotates the knob to release the lock and "open fridge" directly pulls the handle); both "stairs up" and "stairs down" (sharing the object "stairs") need to bend the knees and extend the legs.Figure 2a illustrates more examples of activity label names in typical HAR datasets (e.g., "eat pasta" and "eat sandwich", "elevator up" and "elevator down").The common actions or objects in these examples translate to similarities in the IMU data space.
Figure 1: Existing HAR framework vs SHARE.SHARE exploits shared structures in label names and generates activity name sequences as prediction, rather than predicting integer class IDs.We also design three label augmentations at different levels to better capture shared structures.
As shown in Figure 2b, we apply t-SNE visualization on sensor readings from the Opportunity dataset [38].We color different activities by common actions or objects.Activities of the same color (sharing the same action or object in label names) appear closer in the embedding space, indicating stronger similarity in the original sensory measurements.Such mapping between input features and label names motivates us to design a more effective learning framework that extracts knowledge from label structures.
To this end, we propose SHARE, shown in Figure 1, which models both input sensory data features and label name structures.SHARE comprises an encoder for extracting features from sensory input and a decoder for predicting label names.Unlike existing HAR models that output integer class IDs as prediction results, SHARE outputs label name sequences, thus preserving structures among various activities and providing a global view of activity relationships.During training, we optimize the model by minimizing the differences between predicted label names and ground-truth label names.During inference, we exploit a constrained decoding method to produce only valid labels.
We also design three label augmentation methods at different levels to better capture shared structures across activities.The basic token-level augmentation randomly replaces the original label sequences by their meaningful tokens (e.g., all actions of "eat X" are treated as a class of "eat").This happens only during training and helps the model consolidate semantics of shared structures across different activities.We further develop two embedding-and sequence-level augmentations leveraging pre-trained models.At the embedding level, we integrate pre-trained word embeddings to capture shared semantic meanings not obvious in label names (e.g., the similarity between "walk" and "run").At the label sequence level, for HAR datasets that do not have shared structures in their original labels, we offer an automated label generation method to generate new labels with shared tokens while preserving the same semantic meanings, leveraging large language models.Specifically, we use OpenAI's  to extend atomic, non-overlapping label names into sequences of meaningful tokens.To the best of our knowledge, SHARE is the first solution to HAR classification via decoding label sequences.We evaluate SHARE on seven HAR benchmark datasets and observe the new state-of-the-art performance.We summarize our main contributions as follows: • We find shared structures in label names map to similarity in the input data, leading to a more effective HAR framework, SHARE, by modeling label structures.SHARE captures knowledge across activities by uncovering information from label structures.
• We propose three label augmentation methods, each targeting at a different level, to more effectively identify shared structures across activities.These include a basic token-level augmentation and two pre-trained model-enhanced augmentations at the embedding level and at the label sequence level.• We evaluate SHARE on seven HAR benchmark datasets and observe the new state-of-the-art performance.We also conduct experiments under few-shot settings and label imbalance settings and observe even more significant performance improvement.

RELATED WORK 2.1 Human Activity Recognition
Existing HAR approaches can be categorized into statistical methods and deep learning-based methods [6,56].Traditional methods are based on data dimensionality reduction, spectral feature transformation (e.g., Fourier transformation), kernel embeddings [34], first-order logic [51] or handcrafted statistical feature extraction (e.g., mean, variance, maximum, minimum) [15].These features are then used as input to shallow machine learning methods like SVM, and Random Forest.In recent years, deep learning methods have advanced automatic feature extraction and have begun to substitute hand-crafted feature engineering in HAR [14,16,50] , including convolutional neural networks, recurrent neural network, attention mechanism, and their combinations.DeepConvL-STM [33] is composed of convolutional layers for feature extractors and recurrent layers for capturing temporal dynamics of the feature representations.MA-CNN [36] designs modality-specific architecture to first learn sensor-specific information and then unify different representations for activity recognition.SenseHAR [20] proposes a sensor fusion model that maps raw sensory readings to a virtual activity sensor, which is a shared low-dimensional latent space.AttnSense [29] further integrates attention mechanism to convolutional neural network and gated recurrent units network.THAT [24] proposes a two-stream convolution augmented Transformer model for capturing range-based patterns.We shall note that these models focus on designing more effective feature extractors for better performance but neglect the semantic information in label names, which is the focus of this work.

Time-Series Classification
HAR data are time-stamped sensory series, enabling the use of time-series classification methods.Existing time-series classification models fall into two categories: statistical methods and deep learning methods.Statistical methods are based on nearest neighbor [3,41], dictionary classifier [39], ensemble classifier [27,40], etc.These statistical methods are more robust to data scarcity but do not scale well when the feature numbers in high-dimensional space become huge.On the other hand, deep learning methods can extract features from high-dimensional data but require abundant data points to train an effective model.Convolutional Networks (FCN and ResNet) [19,45] and Recurrent Neural Networks [21,22] show better performance compared with statistical methods.TapNet [58] is an attentional prototype network that calculates the distance to class prototypes to learn feature representations.ShapeNet [26] performs shapelet selection by embedding shapelet candidates into a unified space and trains  the network with cluster-wise triplet loss.SimTSC [53] formulates time-series classification as a graph node classification problem and uses a graph neural network to model similarity information.Recently, Rocket [12] applies plenty of random convolution kernels for data transformation and attains state-of-the-art accuracy.
MiniRocket [13] maintains the accuracy and improves the processing time of Rocket.TST [52] and TARNet [10] incorporate unsupervised representation learning which offers benefits over fully supervised methods on the downstream classification tasks.Similar to existing HAR methods, time-series classification models focus on designing more advanced feature extraction or unsupervised representation learning methods without taking into account the label semantics, whereas SHARE models the shared structures in the label set for more effective representation learning.

Label Semantics Modeling
Given label name semantics as prior knowledge, classification tasks could benefit from modeling such semantics through knowledge graph [43] or textual information [23,35,54,59].Tong et al. [42] exploit knowledge from video action recognition models to construct an informative semantic space that relates seen and unseen activity classes.Recent works designed specifically for zero-shot learning in human activity recognition also combine semantic embeddings [31,44,49].However, these works mostly calculate the mean embeddings for labels with multiple words, which misses label structures and is suboptimal.Unlike these works, SHARE preserves label structures and enables knowledge sharing through decoding label names for generic HAR.

PRELIMINARY
We focus on human activity recognition such as walking and sitting, captured by the sensory time-series data in a given time period.We formulate HAR settings of conventional methods and SHARE.Conventional HAR.We denote HAR dataset in conventional methods as D ′ = {(x i ,   )}  =0 , x i ∼ X,   ∼ C, where X and C denote the input space and the label space.Each sample of timeseries input is denoted as x i ∈ R   × , where   is the length of the time series, and  is the number of measured variables.The where data space X is the same as conventional HAR methods, and Y denotes the label space in SHARE.We denote as a sample human activity label sequence, where   is the length of the label sequence y i .For example, the label "walk upstairs" contains a word sequence of length two, ["walk", "upstairs"] respectively.The label space Y contains  classes and  tokens.Instead of presenting labels as independent integer IDs, there exist shared structures across different labels in the label space Y.For example, "walk upstairs" and "walk downstairs" both have "walk" in label names.Formally, there exist labels y i , y j ,  ≠  that have the same word   =   , where

Token-Level Augmentation
Sequence-Level Augmentation   We encode the time-series features and decode the label sequences as predictions.We further design three augmentation methods at different levels to better capture the shared semantic structures.

METHODOLOGY
We design a label structure decoding architecture for HAR, called SHARE, that exploits label structures and promotes knowledge sharing across activities.SHARE consists of two modules: Time-Series Encoder and Label Structure-Constrained Decoder.We pass multivariate sensory readings as input to the encoder and use the extracted feature vector to initialize the hidden states of the decoder.The decoder generates an activity name sequence (e.g., "climb stairs") as the prediction label.By binding sensory features with label structures, the structures in label names help the model better learn the similarity in the sensory data.We further propose three augmentation methods, including one basic token-level augmentation (randomly selecting from "climb", "stairs", "climb stairs") and two pre-trained model-enhanced augmentations at embedding (using pre-trained embeddings to initialize "climb" and "stairs" word embeddings) and label sequence levels (rephrasing "climb stairs" as "leg up" which share more tokens with other label names), to better capture shared structures across different activities.We summarize the pipeline of SHARE in Figure 3 and Algorithm 1.

Time-Series Encoder
We use   : X → Z ⊂ R  parameterized by  to denote the time-series encoder.This part appears in both conventional HAR and SHARE.The encoder maps data from the input space X to the -dimensional hidden space Z.For conventional HAR, the final predictions are obtained from the hidden representations after a fully connected layer fc.Denote ĉ = fc( (x i ;  )) ∈ R  as the distribution of the predicted label.Optimization is based on the cross-entropy loss between prediction ĉ and ground truth   : In SHARE, the encoded representations z i =  (x i ;  ) are used to initialize hidden states of the decoder, instead of being directly used for classification.This transfers learned representations from the encoder to inform the structured decoding process.To instantiate the time-series encoder, we keep both efficacy and efficiency in mind, given that HAR models usually run on edge devices with limited compute.Therefore, we use one-dimensional Convolutional Neural Networks (CNN), as they are relatively lightweight with superior capability in extracting time-series features [11,36,45,57].

Label Structure-Constrained Decoder
We use   : Z → Y parameterized by  to denote the label structure-constrained decoder in SHARE.The decoder generates word sequences in the label space Y given the encoded representations as initialization of the decoder hidden states.Following our notation in Section 3 (Problem Setting), we further require that each label name sequence starts from a start token 〈〉 and ends at an ending token 〈〉.Specifically, where  0 = 〈〉,  +1 = 〈〉.Decoding the token 〈〉 means that we reach the end of the label sequence.At each decoding step, we estimate the conditional probability   of decoding label y i from x i , given the encoded representations z i from the encoder as: Training.During the training of SHARE, we adopt the teacher forcing strategy [48] where the ground truth label token   at each decoding step  is used as input to be conditioned on for predictions at decoding step  + 1. Teacher forcing improves convergence speed and stability during training.We optimize SHARE based on crossentropy loss between the predicted label sequence ŷi and the ground truth label sequence y i : where ŷ  ∈ R  indicates distribution of th predicted token of ŷi .
Inference with Constrained Decoding.During inference decoding, predicted label token ŷ from the current decoding step  is used as input to be conditioned on for predicting tokens at step  + 1.In typical natural language processing tasks, e.g., machine translation, it is common to decode the sequence using beam search during inference.However, beam search would not work properly as it only tracks a pre-defined number of best partial solutions as candidates in decoding, and the final predictions may not belong to our label space.To guarantee that all generated labels are valid, we adopt a constrained decoding method.We start from the start token and iterate over all valid label sequences in the label set.We then calculate the probability of decoding each valid label sequence and choose the one with the highest probability as the final predicted label.The decoding is constrained as we only keep track of the valid partial sequences during decoding.In HAR datasets, the size of the label set is relatively small, and constrained decoding consumes only a small constant of memory (the size of the label set).At step , we calculate the probability for all the valid partial sequences of length  and pass them into the decoder for generating tokens at step  + 1.The final inference prediction is the sequence that maximizes the overall sequence probability: We use Long Short-Term Memory (LSTM) as an example for our label structure-constrained decoder, given its effectiveness in modeling sequential dependencies [33].We transform the CNN-extracted features z i through two separate linear layers to initialize the hidden state and cell state of LSTM.

Basic Token-Level Label Augmentation
To better learn the semantics of each token in the label sequence, we apply a token-level label augmentation strategy as illustrated in Figure 4.During training, with pre-defined probability, we randomly choose meaningful single words from the original label sequence as the new labels.For example, an original label sequence "ascending stairs" contains single words "ascending" and "stairs", so we randomly select from "ascending", "stairs", and "ascending stairs" as the new labels during training.Following notation in Section 3 (Problem Setting), the original label is augmented as a set of new labels {y i ,  1 ,  2 , • • • ,    } containing the label sequence y i and its meaningful tokens.For each iteration, with a pre-defined probability we randomly select the new label y ′ i from the new label set as the actual label.Optimization with token-level label augmentation can be formulated as: where  ′  is the length of the new label y ′ i ,  ′   is the th token of y ′ i , and ŷ  is the distribution of the predicted th token.Since the goal of label augmentation is to help the model better capture the semantics of different activities, we only choose meaningful single tokens in the original label sequences (e.g., actions and objects) as new labels.Other single tokens like stop words or numbers (e.g., "1" in "open door 1") will not count as new labels.Note that the token-level augmentation is only applied during training.During evaluation, the ground truth label stays the same as the original label.Because we adopt a constrained decoding method during inference, it is guaranteed that all the generated label sequences are valid sequences in the original label sets.

Enhanced Embedding-Level and Sequence-Level Augmentations
Apart from the basic token-level augmentation, we also develop two enhanced augmentation techniques to better capture label structures from embedding and sequence levels by leveraging the power of pre-trained models.
Embedding-Level Augmentation.Our label structure decoding architecture can capture label structures explicitly presented as shared label names.Yet, apart from these explicit shared label names, there may also exist semantic structures that implicitly span across different activities.For example, "walk" and "run" are similar activities involving the movement of legs, but they don't directly share label names.We have observed that such semantic structures can be captured by word embeddings from pre-trained models.We thus propose to use word embeddings from pre-trained models to initialize our decoder's word embedding layer, replacing the original random initialization.Specifically, we utilize word embeddings from ImageBind, a multimodal pre-trained model that learns a joint embedding space across six modalities.As shown in Figure 5, we apply t-SNE visualization to both the ImageBind word embeddings and the input sensor readings from some example activities in PAMAP dataset [37].For activity names comprising multiple tokens, we calculate the average embedding of the aggregated tokens.T-SNE visualizations show similar clusters between ImageBind word embeddings and original data embeddings.As a result, incorporating pre-trained word embeddings helps SHARE better capture semantic structures.
Sequence-Level Augmentation.Most HAR datasets have sufficient overlapping structures in label names.However, there also exist datasets that do not have or rarely have shared tokens in their original label names.For these datasets, we can use large-scale language models to automatically generate label names with shared tokens.Specifically, we employ GPT-4 with the following prompt: Describe the following activities one by one with information of 1. body part used, 2. action or adverb, 3. object (if involved).Please maximize the number of shared tokens across different activities and make the description as short as possible.
As human activities naturally have shared actions and objects, the prompt helps find common tokens across activities.With the aid of pre-trained language model, such a process is performed with minimal human expert effort.Based on the structured information provided by the pre-trained model, we can summarize the label names with shared tokens.We apply sequence-level augmentation mostly for datasets without original shared tokens.If the target HAR dataset already has sufficient overlapping tokens, we will directly use the original label names provided by human experts.

EVALUATION 5.1 Datasets, Baselines, and Metrics
We use six HAR benchmark datasets for evaluation, summarized in Table 1 with examples of shared label names.We split data and choose window size following previous works [10,20].The training and testing split is based on different participating subjects.
Opportunity1 [38] collects readings from 4 users with 6 runs per user.Sensors include body-worn, object, and ambient sensors.The full dataset includes annotations on multiple levels, and we use midlevel gesture annotations which contain shared label structures.PAMAP22 [37] comprises readings collected from 9 subjects wearing 3 IMUs sampled at 100 Hz and a heart rate monitor sampled at 9Hz.Three IMUs are positioned over the wrist on the dominant arm, on the chest, and on the dominant side's ankle, respectively.UCI-HAR3 [1] is collected from a group of 30 volunteers.A Samsung Galaxy S II smartphone was attached on their waist.Feature vectors were further extracted from each sliding window of the collected data in the time and frequency domain.
USCHAD 4 [55] involves 14 subjects performing 12 low-level activities.They use MotionNode (6-DOF IMU designed for human motion sensing applications) to collect the datasets.WISDM5 [46] is collected from accelerometer and gyroscope sensors in smartphone and smartwatch at a rate of 20Hz.51 subjects perform 18 activities for 3 minutes respectively.
We evaluate the performance of SHARE and baselines using accuracy and macro-F1.Macro-F1 is defined as macro-F1= 1   =1 2 × Prec i ×Rec i Prec i +Rec i , where Prec i , Rec i represent the precision and recall for each category , and  is the total number of categories.

Experimental Setup
We use a two-layer convolutional neural network as the encoder for extracting features.The kernel sizes for both layers are set to 3 and each layer is followed by batch normalization.We adopt LSTM with a hidden dimension of 128 as the decoder, based on a grid search of {64, 128, 256}.We use Adam optimizer with learning rate 1 −4 based on a grid search of {1 −5 , 1 −4 , 1 −3 , 1 −2 } and batch size 16.For all datasets, we further randomly split the training set into 80% for training and 20% for validation.We conduct the experiments in Pytorch with NVIDIA RTX A6000 (with 48GB memory), AMD EPYC 7452 32-Core Processor, and Ubuntu 18.04.5 LTS.We tune the hyper-parameters of both SHARE and baselines on the validation set and then combine training and validation set to re-train the models after hyper-parameter tuning.

Results
We repeat 5 runs and report the average accuracy, macro-F1 score, and standard deviations of SHARE and baselines in Table 2.We see that SHARE consistently outperforms both statistical and deep , "open drawer", "close drawer" with pairwise overlap, forming a graph rather than tree structure), without the cost of manual labeling from experts.TST and TARNet leverage unsupervised representation learning to boost classification performance.However, they do not explicitly take account of label structures to model relations across different activities.Other top-performing HAR or time-series classification methods, such as Rocket and THAT, propose better feature extractors to improve recognition performance, but they also neglect the label name structures.SHARE is capable of leveraging the inherent shared structures in label names, leading to the highest accuracy and macro-F1 score.
To assess the statistical significance of the performance differences between SHARE and the baselines, we applied the Wilcoxonsigned rank test with Holm's  (5%) following the procedures described in ShapeNet [17,26].The Wilcoxon-signed rank test indicates that the improvement of SHARE compared with all the baselines is statistically significant with  far below 0.05 (e.g.,  = 5 −4 for the best-performing baseline THAT).

Model Variants
We also compare SHARE with some of its variants to examine the source of the performance gain.For all variants, we use the same encoder for feature extraction as SHARE.
• VanillaHAR: We use the same encoder as SHARE to extract features embedded in the data, and directly append a linear layer for classification without label name modeling.• VanillaHAR + ImageBind embeddings: We also try directly incorporating ImageBind embeddings into VanillaHAR.This variant has two separate linear branches at the end.One branch is for classifying the labels, and the other branch predicts embeddings for the label names.During training, apart from the classification cross-entropy loss, we maximize the cosine similarity between the predicted embeddings and the pre-trained ImageBind embeddings.If the label names have multiple words, we use the average ImageBind embedding of each word as the embedding for the entire label name sequence.• multi-label classification: We also try two separate classifiers subsequent to the encoder.The first classifier predicts the original labels, and the second operates as a multi-label classifier that estimates individual tokens within the label sequences.For example, to predict the class "walk forward", the second classifier labels "walk" and "forward" as positive and other tokens as negative.Classification of shared tokens helps learn dependencies across activities, and during testing, we only compare scores from the first classifier for original activity classes.• no aug: We stay with the label structure decoding architecture but remove all three label augmentations.• no token aug: We stay with the label structure decoding architecture but remove token-level augmentation during training.• no embed aug: We randomly initialize the decoder word embedding layer instead of using ImageBind word embeddings.• no seq aug: The Mhealth dataset [4] (publicly available at UCI Machine Learning Repository 7 ) rarely has shared tokens in its original label names.We compare the performance of SHARE on its original non-overlapping label names and pre-trained modelaugmented shared label names.As shown in Table 3, we observe significant improvement from only applying a feature encoder to the proposed label structure architecture that decodes label names.Regressing label name embeddings by optimizing a cosine similarity loss with ImageBind embeddings only slightly improves the performance.This demonstrates that directly incorporating word embeddings does not explicitly take into account the shared label name structures and loses information when aggregating multiple words into a single label embedding.By contrast, SHARE generates label sequences which preserves the label structures and encourages knowledge sharing across activities.Compared with multi-label classification, our label structure decoding approach can preserve the word order (especially for multi-gram) and word correlation in label sequence. 7http://archive.ics.uci.edu/ml/datasets/mhealth+datasetMoreover, the performances degrade after removing either tokenlevel or embedding-level augmentation (or removing both), which validates their importance in capturing shared word semantics.For sequence-level augmentation, we summarize the original and generated label names from pre-trained model (GPT-4) in Table 6.We compare SHARE using generated label names against both baselines (Table 4) and our model variants (Table 5) on the Mhealth dataset.With the help of the automated label generation method, SHARE demonstrates state-of-the-art performance for HAR datasets without original shared label names.Moreover, we observe that sequence-level augmentation and embedding-level augmentation serve as complementary strategies that synergistically enhance performance.

Few-Shot Settings
We further evaluate SHARE under various few-shot settings.Reduced Training Samples.We randomly reduce the number of samples in the training set from two HAR datasets (Opportunity and UCI-HAR) to 20%, 40%, 60%, and 80%, and evaluate the macro-F1 on the same original test set.We conducted the experiments for 5 runs and report both average Macro-F1 as well as standard deviation.Figure 6a illustrates the performance trend of SHARE, VanillaHAR as well as the best-performing baselines when we vary the size of the training set.As we could observe from the figure, the macro-F1 generally increases as the number of available training samples increases.On top of that, the performance gap between SHARE and other methods becomes larger when there are fewer training data available, showing that decoding label names helps learn the common structures that are shared across different classes.Label Imbalance.The above experiment reduces training samples for all the classes.Many HAR datasets also naturally have a long-tail distribution where some activities have fewer samples as being more difficult to collect.We also experiment under such label imbalance scenarios as shown in Figure 7.We compare SHARE and the vanilla classification model VanillaHAR by visualizing example activities with shared tokens.The activity names are sorted in decreasing order by the label percentage in the dataset.The performance grows significantly when adopting the label structure decoding architecture, as decoding label names helps transfer the shared word semantics to those classes with fewer available samples.For example, for the tail classes "open drawer 1", "close drawer 1", "open drawer 2", VanillaHAR shows a low F1 score (even zero for "close drawer 1"), while SHARE substantially improves the  performance on these classes, as SHARE is able to leverage label structures to learn from other classes.Reduced Window Size.We also reduce the sampling frequency (window size) on both training and test sets by a factor of 2,4,8 and report the performance of SHARE, VanillaHAR as well as the best-performing baselines in Figure 6b.SHARE also stays robust with respect to down-sampling factors, as it encourages knowledge transfer via modeling label name structures.We observe that our proposed SHARE consistently outperforms VanillaHAR and baselines under different down-sampling factors.

Case Study
In this section, we further explore the benefits of modeling label structures through some case studies.Confusion Matrix.We use Opportunity dataset as an example and show through the confusion matrix in Figure 8 that SHARE better discriminates activities compared with VanillaHAR, especially for activities with fewer samples.In Figure 8, values at the -th row and -th column represent the number of instances that have ground truth label  and are predicted as label ."open drawer 1" instances mispredicted as "close drawer 1" are reduced from 2 to 0, and the correctly predicted instances increase from 1 to 4. Feature Embedding.We apply t-SNE visualization to the feature space of VanillaHAR and SHARE on the WISDM dataset.We visualize the average feature of each activity, as illustrated in Figure 9.
VanillaHAR loses the semantic information in the feature space.For example, "eating soup" is positioned at a large distance from other "eating"-related activities.By contrast, SHARE preserves the label structures in the feature space, indicating a more coherent and precise mapping of related activities.

Complexity Analysis
We compare the model complexity of SHARE and the best-performing deep models TST, TARNet and THAT on PAMAP2 data.Specifically, we compute the number of parameters, the model size (number of bytes required to store the parameters in the model), and the average running time for a batch of 16 samples (averaged over 10000 runs).We conducted the complexity analysis on a single NVIDIA RTX A6000 48G GPU.For TST, we only compare the complexity for the supervised fine-tuning phase.As shown in Table 7, SHARE has the smallest number of parameters, model size, and average running time, while outperforming more complex deep models.

CONCLUSION
We proposed a novel HAR approach, SHARE, that explicitly models the semantic structure of class labels and classifies the activities by decoding label sequence.SHARE enables knowledge sharing across different activity types via label name modeling and alleviates the challenges of annotated data shortage in HAR, compared with conventional methods that treat labels simply as integer IDs.We also design three label augmentation techniques, at token, embedding and sequence levels, to help the model better capture semantic structures across activities.We evaluated SHARE on seven HAR benchmark datasets, and the results demonstrate that our model outperforms state-of-the-art methods.
In the future, we plan to adapt our design to more complex backbone models, as well as image-based or video-based human activity recognition.We also plan to experiment on other types of datasets that also have shared label name structures , e.g., medical datasets with shared disease names.Also, in this work, we assumed that the shared label name structures very likely imply similarity in activity types.However, the assumption may not hold when we extend the problem scope to simultaneously handling multiple datasets where the same label names may correspond to slightly different data collection settings.We believe further investigation to lift such an assumption will offer meaningful insights.

ACKNOWLEDGEMENTS
Our work is supported in part by ACE, one of the seven centers in JUMP 2.0, a Semiconductor Research Corporation (SRC) program sponsored by DARPA.Our work is also supported by Qualcomm Innovation Fellowship and is sponsored by NSF CA-REER Award 2239440, NIH Bridge2AI Center Program under award 1U54HG012510-01, as well as generous gifts from Google, Adobe, and Teradata.Any opinions, findings, and conclusions or recommendations expressed herein are those of the authors and should not be interpreted as necessarily representing the views, either expressed or implied, of the U.S. Government.The U.S. Government is authorized to reproduce and distribute reprints for government purposes not withstanding any copyright annotation hereon.
T-SNE results of sensory data with the colors denoting different shared label structures.

Figure 2 :
Figure 2: (a) Labels in HAR datasets typically share common structures.(b) T-SNE visualization of sensory data in Opportunity dataset [38].Activities with the same actions or objects (marked by the same colors) are closer.Each point represents one data sample, and each type of marker represents a different type of activity.The two figures have the same set of data points and markers and only differ in colors.The same color represents common actions (left figure) or common objects (right figure).
t e x i t s h a 1 _ b a s e 6 4 = " D y L 8 4 u H

Figure 3 :
Figure3: Framework of SHARE.We encode the time-series features and decode the label sequences as predictions.We further design three augmentation methods at different levels to better capture the shared semantic structures.

Figure 4 :
Figure 4: Illustration of basic token-level augmentation.We augment the original label name sequence by randomly choosing its meaningful tokens or the sequence itself .

Figure 5 :
Figure 5: T-SNE visualizations show analogous clusters between input data and ImageBind word embeddings.Activities in the same color represent clusters of similar activities.

Figure 6 :
Figure 6: Macro-F1 of SHARE, VanillaHAR and best-performing baselines with reduced training samples and window size.

Figure 7 :
Figure 7: Macro-F1 of example activities with shared label names for SHARE and VanillaHAR on Opportunity dataset with long-tail label distribution.

Figure 8 :
Figure 8: Confusion matrix of VanillaHAR and SHARE on Opportunity dataset.SHARE better discriminates different activities, exemplified by classes with red squares.

Figure 9 :
Figure 9: T-SNE visualization on feature space.SHARE better preserves the semantics in the feature space.

Table 1 :
Dataset statistics and an example subset of shared label names.

Table 2 :
Accuracy and Macro-F1 for SHARE and baselines.We bold the best score and underline the second best.

Table 4 :
Accuracy and Macro-F1 on Mhealth dataset.We bold the best score and underline the second best.

Table 5 :
Different model variants on Mhealth dataset.We bold the best score and underline the second best.

Table 6 :
Original/generated label names for Mhealth data.