CroSSL: Cross-modal Self-Supervised Learning for Time-series through Latent Masking

Limited availability of labeled data for machine learning on multimodal time-series extensively hampers progress in the field. Self-supervised learning (SSL) is a promising approach to learn data representations without relying on labels. However, existing SSL methods require expensive computations of negative pairs and are typically designed for single modalities, which limits their versatility. We introduce CroSSL (Cross-modal SSL), which puts forward two novel concepts: masking intermediate embeddings produced by modality-specific encoders, and their aggregation into a global embedding through a cross-modal aggregator CroSSL allows for handling missing modalities and end-to-end cross-modal earning without requiring prior data preprocessing for handling missing inputs or negative-pair sampling for contrastive learning. We evaluate our method on a wide range of data, including motion sensors such as accelerometers or gyroscopes and biosignals (heart rate, electroencephalograms, electromyograms, electrooculograms, and electrodermal). Overall, CroSSL outperforms previous SSL and supervised benchmarks using minimal labeled data, and also sheds light on how latent masking can improve cross-modal learning.


INTRODUCTION
Sensory signals captured through multiple modalities of smart devices, such as accelerometers, gyroscopes, and electroencephalography (EEG), facilitate various applications; ranging from human activity recognition (HAR) [33] to sleep tracking via brain-activity monitoring [18].Many of such emerging applications are based on machine learning (ML) techniques and particularly deep neural networks (DNNs).However, the reliance on labeled data for training DNNs has hindered their ability to scale effectively [38].Due to the high cost and time demands of gathering, annotating, and managing extensive labeled datasets, self-supervised learning (SSL), which learns from unlabeled data, has been investigated [27]: by defining an artificial task, known as a pretext task, where the supervisory signal is automatically generated from unlabelled data, enabling the training of an encoder model to learn a latent representation of the input data [38].SSL has demonstrated practicality in HAR [33], by leveraging large amounts of unlabeled data and fine-tuning on downstream tasks with limited labeled data.
However, existing SSL methods are mostly designed for unimodal [11,27,33,36] data and struggle to handle multimodal data [11,27,33,36].Particularly in numerous health monitoring and physiological applications, data is often acquired from heterogeneous sensors of various modalities with different characteristics (e.g., sampling rates and resolution).Current SSL methods show inadequate performance when it comes to aggregating and compressing a time-window of various sensors into a coherent global embedding that can properly serve a downstream task [7].Creating embeddings that incorporate multiple modalities becomes even more challenging due to the dynamic nature of real-world situations in which the granularity of sensor data and availability of modalities can differ from one user to another or from time to time.
By collecting and analyzing data from diverse sources, we can inform and improve our understanding of human behavior and physiology.For example, in the case of elderly health monitoring, a 360-degree health-monitoring system that combines images captured by smart glasses, audio signals captured by smart earbuds, and sensor time series captured by smartwatches can provide a wealth of information about the user's physical and cognitive state [20].Here, multi-modal SSL can distill the combined data into a unified inference engine that can facilitate several downstream tasks, such as the early detection of falls, the prediction of cognitive decline, and the monitoring of sleep quality, and physical activity levels.Moreover, multi-modal SSL enables the discovery of complex interconnections and correlations among multiple data sources, which can provide a more comprehensive understanding of human sensing.For instance, as humans, we naturally learn to identify objects in our surroundings through observations made using our multiple senses.In the absence of input from some senses, we can still recognize the object through our remaining senses, demonstrating the power of multi-modal integration.
Notably, to design multimodal SSL in real-world settings we face two major challenges.(1) Heterogeneous Sensors: different sensors require different preprocessing due to their data characteristics and different sampling rates.Direct integration of heterogeneous sensor data leads to inconsistencies in the global embeddings and unsatisfactory performance [31,37].(2) Missing Sensors: multimodal SSL must be robust to missing modalities.During SSL training, a model might learn representations that rely on correlations between different modalities; however, there is no guarantee that all the modalities are available at inference time.These two challenges are partially addressed by a prior work COCOA [7] via the concept of modality-specific encoders and a customized loss function to align latent embeddings across different modalities.However, COCOA does not perform aggregation on the modality-specific embeddings and does not consider the challenge of missing modalities.Thus, as we show, COCOA struggles to produce useful global embeddings for downstream tasks when some modalities are missing.

RELATED WORK
There are two common pre-training approaches: (1) utilizing models pre-trained on labeled data from a different task, or (2) leveraging unlabeled data from the same task [38].The former is not applicable to healthcare applications due to limited labeled data and heterogeneity of sensors, but the latter has shown effectiveness in learning general and transferable features [4,21].Supervised learning works explore fusion strategies and adapted DNNs for HAR [24].Unsupervised or self-supervised learning approaches are also applied to sensor signals, focusing on HAR, including multi-tasking, contrastive learning, and predictive coding [12,27,33,35].None of these works are specifically designed for multimodal learning, where we need representations that capture both sensor-specific temporal dependencies and global spatial dependencies across sensors.ColloSSL [17] and COCOA [7] address some aspects of multimodal learning but have limitations such as non-trivial negative pair mining and the inability to handle missing modalities.Missing modalities are challenging as they result in incomplete or biased data, and learning representations that generalize across modalities is difficult due to different distributions and feature spaces.Incorporating various modalities is even more complex due to the absence of shared information and the challenges associated with alignment and integration.Masking techniques have been proposed mostly for vision and text data [14?].MultiMAE [1] uses input-level masked autoencoders which however are designed for image data and lack support for multimodal time series.While masking is straightforward in the input data space, it is not commonplace in the latent space.As seen in recent speech models such as TERA and Wav2Vec [2,19], a temporal mask is first randomly applied in the latent space, where 50% of the projected latent feature vectors are dropped.While our approach is conceptually similar to these, we improve this idea within a multimodal architecture that learns from different views of two masks.This is crucial, as not all modalities may be available during training or inference.Addressing these challenges, we incorporate latent masking and modality-specific encoders to learn joint representations of multimodal healthcare data, enabling efficient integration and robustness to missing modalities.Supervised Fine-tuning 2 SSL objective function  Let G denote a downstream classifier that takes   as the input and makes the final prediction.For the purpose of brevity, we drop the index  unless it is required.Goal.A naive solution combines data from all modalities into a single encoder, but it does not work when the data types (  ) and sizes ( ) vary.An alternative is to use separate encoders (  ) and classifiers (  ) for each modality and aggregate predictions.This approach enhances robustness but is expensive and does not leverage information from multiple modalities.Our goal is to propose a solution that facilitates the usage of multiple heterogeneous modalities without requiring multiple classifiers.Solution Overview.We build upon the SSL paradigm with the aim of capturing both intra-modality (i.e., temporal) and inter-modality (i.e., spatial) dependencies.The objective is to build a unified global embedding from available data sources.Ideally, we want to compress information captured by each modality so that the aggregated information can be efficiently and accurately used for any downstream tasks (e.g., classification).To this end, we assume that data captured by different modalities at the same time could be interpreted as natural transformations of each other, where all modalities are sensing the same phenomena.Thus, such multimodal data, even in the absence of any annotation, can be leveraged by SSL methods due to the phenomena that are shared among all of them.In our setting, in contrast to the traditional SSL approaches, the supervisory signal does not come from only "self" [12] or "other" [7] sources of data but from the aggregation of (any subset of) sources.In sum, our hypothesis is built upon the fact that (i) although individual sensors provide complementary views of the same event, they may not record the same information, or (ii) some sensors might not be relevant in some situations.However, (iii) the aggregated information offered by the global embeddings can represent the current state of the shared event.
CroSSL: Cross-modal Self-Supervised Learning.Figure 1 illustrates the overall architecture which consists of self-supervised pretraining (left panel) and fine-tuning (right panel).Mainly, our goal in CroSSL is to utilize unlabelled and asynchronous data collected from different modalities and devices, and learn modality-specific encoders E  ,  ∈ {1, 2, . . .,  }, followed by a cross-modal aggregator A. Each E  is trained to generate an informative intermediate embedding for sensor , in a way that the aggregation of all (or part of) intermediate embeddings can represent the state of the shared phenomena.To this end, the aggregator is trained to learn both spatial and temporal dependencies across input data sources.Secondly, we aim to use all pre-trained encoders and the aggregator to obtain descriptive embeddings for a small amount of labeled data and subsequently train a classifier G which maps these latent joint embeddings to the corresponding class labels.Our self-supervised pre-training is depicted in Figure 1.
Finally, in Step 2, we fine-tune the pre-trained encoder and aggregator module along with the downstream task.We use the pretrained encoders E  and the aggregator A and a labeled dataset of the given task to train the classifier G (or any other downstream task) in a supervised fashion.The evaluations (Section 4) revealed that this latent embedding masking recipe for pre-training offers greater robustness compared to other baselines in dealing with lower-quality data that contains missing information during either or both fine-tuning or inference steps.Self-supervised Objective function.The dominant SSL techniques used in computer vision and natural language processing are based on either contrastive-based or reconstruction-based objectives.In contrastive learning models, the objective is to minimize the distance between positive pairs (e.g., such as two augmented views of the same image) while repelling the representations of negative pairs apart (e.g., augmented views of different images).The primary vulnerabilities of contrastive learning models lie in the quality of the positive and negative pairs utilized during training.In applications such as human activity recognition or emotion recognition, with limited variety in the number of classes and tasks compared to vision tasks, the probability of comparing samples with fake negatives during contrastive training is high [6].Fake negative pairs are essentially positive pairs that are incorrectly labeled as negative pairs due to non-ideal sampling techniques, resulting in the model mistakenly attempting to push them apart.Although several negative mining techniques are proposed to remove fake negatives or avoid the biases caused by their existence in the final contrastive objective function [5,26], they are shown to be not effective in wearable sensor data [6].
Recently, there has been an emergence of regularization-based SSL techniques that do not require negative sampling, such as BYOL [10], Barlow-Twins [39], and the most recent one VICReg which has been shown to be more effective in representation learning [3].As we utilize a variant of the VICReg (Variance-Invariance-Covariance Regularization) loss function, we briefly introduce this loss function and how we integrated it into our multi-sensor setup.For sample   , we generate two global embeddings where  is the number of corresponding original sample in the batch and  is the dimension of embeddings coming out of the two branches of aggregator.The optimization function is based on three parts including: • Invariance: Minimize the distance (or maximize similarity) between the global embeddings of sample   : • Variance: Maintain variance of each variable of embeddings denoted by  1|2 , ,  = 1 • • •  above a threshold.The variance regularization term is defined as: where  > 0 is a constant threshold for the regularized standard deviation term  (, ) = √︁   ( + ) and  is a scalar to avoid numerical instabilities.Here,  .. , shows that the standard deviation is computed for the -th variable of the global embedding across the batch of size  .This term prevents collapsed representation by encouraging the variance across each variable of the global embedding in the batch to be equal to .
• Covariance: Minimize correlation between variables of the same embedding by minimizing the covariance as below: where  ( ) is the covariance matrix of  , and the covariance regularization term  ( ) is the sum of the squared off-diagonal coefficients of the covariance matrix with a factor of 1/.This forces the off-diagonal elements to zero to decorrelate the embedding variables and maximize the distribution across variables.
Figure 2 illustrates our objective function.The overall loss function is a weighted average of the above terms 1 : 1 All the constant values are set as suggested by [3].Hyper-parameters , , and  have been set as 10, 10, and 100 through grid search hyper-parameter tuning.

Theoretical Motivations
In CroSSL, we consider that different modalities sense common phenomena and share a common high-level semantic, despite the fact that they might not necessarily capture the same amount of information.But, not all modalities are universally available, relevant, or helpful in real-world applications, and their usefulness varies depending on the situation and the specific task.
CroSSL assumes that there are two types of information available each intermediate embedding   : (1) cross-modal information that is about the shared phenomena captured by all modalities, and (2) modality-specific information that are not necessarily relevant to the shared phenomena.CroSSL aims to produce global embedding  that carry cross-modal information between each sensor data   and at least one other sensor   .Translating the above intuition into an information-theoretic formulation, we have I(X  ; X  ) ≥  > 0 for any  ∈ [] and a  ∈ []/, where I(•; •) denote the mutual information between two random variables (e.g., two randomly-sampled time-windows of sensor data).In many downstream tasks, the data   is seen to be generated through a latent-variable generative process [8,34].Basically, X  = G  (Y), where Y is the latent variable: a common source of variation that affects the data generated by all modalities (e.g., the user's activity or wellness is the unknown-but-common source of variation for all modalities that are sensing the user).Making about such a common object or subject is usually the ultimate target of the downstream task when processing the global embedding.
With this assumption, we are interested in learning an aggregator  to generate a global embedding  that satisfies: This means that we aim to generate  such that it ideally captures only the information shared between both   and   , and not information specific to only one of them.In other words, bringing conditional mutual information I( ;   |   ) to zero implies that given   , we can recover the information that is needed to generate  without needing   .To understand this, we note that the chain rule for mutual information allows the following: .
Therefore, our assumption can be also interpreted as generating a global embedding  that satisfies the following: In this way, the generated global embedding  will ideally stay the same even if we miss one of our modalities.We emphasize that in CroSSL, we do not want to preserve all the information present in all modalities.In other words, we do not want trivial solutions where  is just a compressed version of data provided by all the modalities, and to prevent this, we provide the aggregator with two distinct masked versions of the intermediate embeddings.Our approach is closer to the Information Bottleneck principle [29], which sets an objective to build  such that it includes only a small amount of information carried by each   , which is only relevant to Y.However, the challenge, in practice, is that most of the encoders E are deterministic and can preserve an infinite amount of information.Notice that I(  ,   ) = H(  ) − H(  |  ), and for a deterministic E we have H(  |   ) = 0.
To retain cross-modal information in  and disregard modalityspecific information, we employ randomization in generating  during training.This randomization is introduced through masking strategies, where certain modalities are masked, forcing the aggregator to rely on the remaining modalities for extracting cross-modal information.This ensures that the most informative modalities are used to generate the global embedding  for downstream tasks.Note that without randomization, the aggregator would produce identical embeddings by copying information from Q to Z, without distinguishing cross-modal from modality-specific information.

EVALUATION 4.1 Datasets
To evaluate our approach, we study three different datasets with various types of sensors and applications, including PAMAP2 [25] a human activity recognition dataset based on motion and heart sensors mounted on different parts of the body; PhysioNet Sleep-EDF [9,18] dataset for sleep stage detection based on biosensor data, and WESAD [30], a dataset for stress and affect recognition using different types of biosensors.Table 2 provides details on the number of sensors, subjects, classes, and data size.

Experiments Setup
Encoders can be chosen appropriately depending on the type of sensors.In this work, we use three layers of a 1D convolution network as sensor-specific encoders and fully connected layers for the aggregator module.However, the proposed framework is encoderagnostic, which means the backbone encoder can be replaced with any other encoder model appropriately chosen according to the type of each sensor.Following the evaluation framework in [17,34], a linear classifier was used to evaluate the quality of pre-trained model and extracted embeddings.The pipeline for the downstream task training is depicted in Figure 1 Our training setup is implemented in Tensorflow 2.0.We used the TF HParams API for hyperparameter tuning and arrived at the following training hyper-parameters: {ssl learning rate: 1e-4, classification learning-rate: 1e-3,  = 0.05}.We also explored a range of coefficients for the VICReg objective function, varying between {1,10,100} for each of variance, invariance and covariance coefficients; after hyper-parameter tuning, we noticed that the combination of {variance coefficients: 10, invariance coefficients: 10, covariance coefficients: 100} provides the most stable results.
We use Adam optimizer and early stopping to stop the training after five epochs without progress.All models are trained for 100 epochs during self-supervised learning and 50 epochs for during the supervised classification.When fine-tuning the model, we must prevent the encoders from forgetting their learned parameters and being overwritten by the classifier loss function.To achieve this, we freeze the encoders and the aggregator for the first 20 epochs of fine-tuning, which allows the classifier to learn based on the representations extracted by the pre-trained model.After this initial period, we unfreeze the encoder and aggregator and fine-tune them alongside the classifier.Following prior works [7,12,17], we use the macro F1-score (unweighted mean of F1-scores over all classes) as the evaluation metric, as suggested for imbalanced datasets [23].

Baselines
To compare the effectiveness and capability in learning informative representations with no or limited labelled data, we investigate the performance of other baselines in different setups: Fully supervised: We evaluate the performance of CroSSL against two fully supervised baselines: DeepConvLSTM [22] and "Supervised".DeepConvLSTM is a widely used model architecture in HAR, and "Supervised" is the supervised equivalent of CroSSL , in that it uses the same underlying architecture as CroSSL but is trained in a fully supervised manner.Both supervised baselines were trained using an end-to-end approach.For the DeepConvLSTM method, we utilize the implementation from a pre-existing source [16].Self-supervised baselines: To evaluate the performance of CroSSL against other state-of-the-art self-supervised learning (SSL) models, we use COCOA [6] as a SOTA cross-modal SSL model.Other SSL baselines proposed for wearable sensor data, as discussed in Section 2, were not considered due to their lack of cross-modality or inferior performance when compared to COCOA, which has been shown to outperform numerous baselines.Fixed and fine-tuned SSL encoders: We examine the performance of the representations in two different setups: (1) fixed and (2) fine-tuned encoders, following the evaluation procedure described in [7,15].During pre-training, all encoders were trained in a self-supervised manner.In the Fixed setup, during the classifier training step, these encoders are frozen in order to evaluate the quality of the learned representations; while in the fine-tuned setup, the encoders are re-trained (fine-tuned) along with the classifier based on the downstream task.We evaluate the effectiveness of the learned representations using a linear classifier and a Softmax layer, following the evaluation framework outlined in [34].

Comparison at One Glance
Table 3 presents the average F1-score accompanied by their standard deviations.The optimal batch size (ranges from 8 to 64) for each baseline is reported.The results demonstrate that fine-tuning CroSSL surpasses the current state-of-the-art SSL methods and outperforms fully supervised baselines.Spatial vs. Random masking.In terms of the masking strategy, spatial masking is clearly beneficial compared to the random masking strategy by providing 2.06%, 2.04%, and 5.3% higher f1-score in Fine-tuned CroSSL and 15.0%, 2.5%, and 5.3% in Fixed CroSSL across SLEEPEDF, PAMAP2, and WESAD datasets, respectively.As we mentioned, the intuition behind spatial masking is that one or more sources of data are not available hence the whole data for those sensors will be masked.For example, some users may not have some of the sensing devices or the device is temporarily switched off.While in random masking, the sensors are available but not all the time; hence the data can be sparse.For example, data is not available or useful at a particular moment due to energy issues or noise.We hypothesize that the inferior performance of random masking in comparison to spatial masking may be attributed to the model's need for continuity in the input data to derive meaningful information.Due to the random nature of the masking process, there may be several short segments of unmasked data, which may not provide sufficient information for the model to effectively learn.
Fixed vs. Fine-tuned SSL.In the case of Fixed encoders, CroSSL with random masking strategy provides competitive results with COCOA but still cannot catch up with supervised baselines.On the other hand, once CroSSL is trained with spatial masking, it fairly improves the f1-score and can outperform the supervised baseline in the SLEEPEDF dataset.Once we fine-tune the selfsupervised pre-trained COCOA, as well as CroSSL(random masking) and CroSSL (spatial masking), they provided much higher performance compared to their corresponding Fixed setup.Based on Table 3, this fine-tuning of COCOA, CroSSL(random), and CroSSL (spatial) yields a increase in F1-score up to 7.9%, 15.5%, and 2.6% in SLEEPEDF dataset, 5.12%, 8.6%, and 8.5% over PAMAP2 dataset, and by 36.5%, 39.2%, and 38.9% over WESAD dataset, respectively.
Overall, the fine-tuned CroSSL with spatial masking outperforms the fully supervised baseline by 3.3% and 1.5%, and 4.6% across SLEEPEDF, PAMAP2, and WESAD datasets, respectively.Moreover, CroSSL improves the state-of-the-art cross-modal SSL model, CO-COA, by 9.3% and 11.2%, 1.3% across SLEEPEDF, PAMAP2, and WESAD datasets, respectively.WESAD shows the lowest delta in performance which can be attributed to the number of modalities as it includes only three sensors.On the other hand, the highest performance gain is with the PAMAP2 dataset which includes the highest number of modalities (seven sensors).This result further validates the superiority of CroSSL in datasets with a higher number of modalities and hence a higher chance of missing data.

Robustness Against Missing Data
Even though missing data is one of the most critical challenges in wearable and ubiquitous computing, it is less studied in the existing literature.In this section, we evaluate the impact of missing data at fine-tuning and inference times.The main goal is to assess the robustness of CroSSL against the permanent or temporal absence of data sources.Based on the evaluation outcome, we can decide the best approach in training downstream tasks where missing data is unavoidable.The result from the other baseline COCOA is not included in this section due to the incapability of COCOA and other SOTA multimodal SSL models to handle missing data.To investigate this, we design three sets of experiments: (1) No missing.The training and test sets are perfectly well-curated, with no missing data.We provide this as the upper-bound performance.
(2) Missing data only at inference time: In these experiments, we assume the model has access to well-curated data with all data points and sensors available.However, given there is less control over capturing the data at inference time, some devices or sensors may be absent (switched off or not present).(3) Missing data at training and inference time: We assume both training and test sets may contain missing data.To make a fair comparison, we pretrain the encoders using both random and spatial masking strategies.We apply the pretrained model to the downstream task assessing different missing data scenarios.Table 4 reports the average f1-score across five repeated runs with randomly missing data.As shown in Table 4, Fine-tuned CroSSL with spatial masking pre-training and no missing data at the fine-tuning stage achieves the best performance compared to the other setups and datasets, including the fully-supervised model.In case of missing data at inference, the fine-tuned CroSSL can reach the same level of performance on both PAMAP2 and WESAD and a small drop on SleepEDF.The results validate its robustness to missing data at inference time.On the other hand, once the model is trained with all available data, it can handle missing data much better at inference time, compared to the previous scenario which introduced instability with the (un)availability of sensors at the fine-tuning training set.
In the case of missing data at both fine-tuning and inference stage, we observe a considerable drop in the performance compared to the fully supervised model: 0.51, 0.41, and 0.58 decrease in F1-score on SleepEDF, PAMAP2, and WESAD, respectively.Although the lower performance is inevitable, focusing on the fine-tuned and fixed variants of CroSSLwith spatial masking, CroSSL is particularly effective.The performance of CroSSLwith Fixed encoders is not affected much by missing data at fine-tuning step and it outperforms both supervised and fine-tuned models, highlighting the role of a well-trained encoder for high-quality representations.

Optimal Masking Strategies
As latent masking is central to CroSSL, we investigate the impact of missing data ratio.As an alternative to entirely random masking, we also apply spatial masking in order to assess whether hiding entire modalities increases the robustness of the model.Figure 3 shows fine-tuned models outperform fixed ones across all fronts.
Random Masking.We observe a slight negative correlation between the masking rate and the quality of learnt representations across all datasets.Due to the lack of any fine-tuning in Fixed SSL, the power of the pre-trained encoders in learning useful representations comes into action.However, in the PAMAP2 dataset, this correlation is weaker, which can be attributed to the type of sensors and the distribution of data.Given that PAMAP2 contains data from 3 accelerometers, 3 gyroscopes, and a heart rate sensor, there is more shared information among the input channels.Thus, even a subset of sensors can still retain the essential information about the activity.This finding further confirms our motivations in the theory section above (3.2).An inverse trend is observed in SleepEDF, where higher masking corresponds to lower performance.Notably, the highest performance is achieved with a 50% masking rate which hints at a U-shaped relationship for the fine-tuned variant.We do not observe such trends in the fixed variant.
Spatial Masking.As discussed earlier, spatial masking achieves the best performance in our experiments.We do not observe significant differences for most datasets when varying the number of spatial modalities.In particular, the mean performance peaks with 2, 3, and 1 sensors in SLEEPEDF, PAMAP2, and WESAD datasets, respectively (however, the std overlaps do not allow for conclusive results).WESAD presents the largest differences between 1 and 2 sensors, with a decrease in performance after applying Fine-tuning but an increase after applying a Fixed encoder.In all datasets, there is a slight decrease in performance with the most sensors available.
These results validate the value of the latent masking, especially for bigger datasets.We note that the SleepEDF dataset is five times larger than PAMAP2 and we could attribute its performance results -with regards to masking-to this size difference.In other words,  SSL requires enough samples for pre-training, and in light of these results, we hypothesize that masking makes this task even by hiding information.Therefore, we expect the effect of masking to be stronger with larger pre-training datasets.

Label-efficiency
We investigate the efficacy of CroSSL in low-labelled data scenarios compared to its self-supervised and fully supervised counterparts.Figure 4 presents the average F1-score of CroSSL with fixed pretrained and also fine-tuned encoders along with the fully supervised baseline where the size of labelled data varies between 1% and 100% of the available labels for SleepEDF and PAMAP2, and 10% and 100% for the WESAD dataset.We investigated the impact of the masking strategy and reported the best one (spatial masking in all cases).For this experiment, we train the classifier over self-supervised encoders by using only a subset of labelled data.Fixed CroSSL is extremely label-efficient.Using spatial masking, Fixed CroSSL achieves its optimal performance via less than half of labels in each dataset (i.e., 10%, 20%, and 60% of the available labels in the SleepEDF, PAMAP, and WESAD).This showcases the capability of our pre-trained model, which can be combined with a simple MLP classifier to attain peak performance with limited labeled data.This feature is encouraging for the deployment on edge devices that have limited resources and the ability to train and maintain large machine-learning models.Compared to COCOA, CroSSL with fixed encoder setup shows lower performance in lower labeled data regimes.We believe this happens due to the less available data for training in CroSSL due to the applied masking.
Fine-tuned CroSSL (with both spatial and random masking) shows great improvement compared to the fully supervised model in a low-labeled data regime, specifically for SleepEDF and WESAD.Given only 1% of labels, the fine-tuned model with spatial masking almost achieves its highest F1-score while the supervised model achieves the same using at least 20% and 40% of labels in SleepEDF and WESAD, respectively.The PAMAP2 exhibits a higher standard deviation compared to the other datasets due to the variation among users in the respective test sets and its smaller size.This is because, for each evaluation fold, we only included one user per test, resulting in higher variation.In contrast, the SleepEDF dataset has multiple users dedicated to each test set, leading to lower variation.

DISCUSSION AND LIMITATIONS
CroSSL puts forward a general ML framework for sensor time series that achieves SOTA performance in multimodal benchmark tasks spanning from activity recognition to sleep stage classification.Notably, CroSSL is data and label-efficient, requiring only a small fraction of labeled samples to achieve performance on par or better than supervised models.Most importantly, our model proposed the latent masking idea that ensures that the model learns robust representations in an end-to-end manner, without requiring any data pre-processing.We showed that latent masking is effective and makes models more accurate and transferable.
In the current implementation of CroSSL, we used two masking strategies: spatial and random.Our results suggest that spatial masking is preferable.Future work could investigate temporal or spatiotemporal masking, however, it is not straightforward to design such masks due to the loss of sequence information in the latent space.Further, the design of custom masking strategies could leverage domain knowledge of the signals, such as existing interactions between the modalities [32].Another exciting property of SSL methods is transferring across various tasks.[13] have evaluated within the SSL paradigm focusing on a single modality while changing the sensor positions, activities, and sampling rates [13].

CONCLUSION
Given the pervasiveness of sensor-enabled devices, there are increasing interest in designing applications that leverage multiple modalities of data.An important first step toward this is to design learning algorithms that can extract high-quality embeddings from multimodal data, even in the absence of labels.To this end, we presented CroSSL, a novel self-supervised learning technique to train global embeddings from multimodal sensor streams by leveraging the spatiotemporal correlations in them.To address the challenge of heterogeneous ubiquitous sensor computing applications, CroSSL employs sensor-specific encoders with the possibility of taking different sample sizes.This makes the model invariant to the type of input modalities and is able to capture sensor-specific information.Our key findings are that CroSSL outperforms fullysupervised and state-of-the-art self-supervised approaches on three challenging datasets while remaining robust to missing modalities.Using a masking-based technique, CroSSL forces the model to learn representations invariant to the presence of all modalities.Moreover, CroSSL is highly label-efficient and hence can be deployed in applications where data labeling is expensive.

Figure 1 :
Figure 1: The overview of the proposed architecture.

2
Creates diverse global embeddings by maintaining the variance of each variable across the batch Extracts the information that is shared between modalities by maximizing the similarity between two embeddings produced for each sample.Forces disentanglement to the variables of global embedding by bringing the covariance of the vector to zero

Figure 2 :
Figure 2: Overview of the regularization-based objective function within the CroSSL architecture.

Figure 3 :
Figure 3: Comparing random (a,b,c) and spatial (d,e,f) latent masking across the three datasets for different masking rates.

Figure 4 :
Figure 4: Comparing the efficiency of fixed and fine-tuned CroSSL setups against fully supervised model in low-labeled data

Table 1 :
Existing SSL methods in ubiquitous sensing.

Table 3 :
Performance comparison across three different setups: fully supervised, fixed SSL and fine-tuned SSL encoders.

Table 4 :
Evaluating across various missing modality scenarios