Deep Learning for Time Series Classification and Extrinsic Regression: A Current Survey

Time Series Classification and Extrinsic Regression are important and challenging machine learning tasks. Deep learning has revolutionized natural language processing and computer vision and holds great promise in other fields such as time series analysis where the relevant features must often be abstracted from the raw data but are not known a priori. This paper surveys the current state of the art in the fast-moving field of deep learning for time series classification and extrinsic regression. We review different network architectures and training methods used for these tasks and discuss the challenges and opportunities when applying deep learning to time series data. We also summarize two critical applications of time series classification and extrinsic regression, human activity recognition and satellite earth observation.


INTRODUCTION
Time series analysis has been identified as one of the ten most challenging research issues in the field of data mining in the 21st century [1].Time series classification (TSC) is a key time series analysis task [2].TSC builds a machine learning model to predict categorical class labels for data consisting of ordered sets of real-valued attributes.The many applications of time series analysis include human activity recognition [3][4][5], diagnosis based on electronic health records [6,7], and systems monitoring problems [8].The wide variety of dataset types in the University of California, Riverside (UCR) [9] and University of East Anglia (UEA) [8] benchmark archive further illustrates the Authors' addresses: Navid Mohammadi Foumani, navid.foumani@monash.edu.com,Monash University, Australia; Lynn Miller, lynn.miller1@monash.edu,Monash University, Australia; Chang Wei Tan, chang.tan@monash.edu,Monash University, Australia; Geoffrey I. Webb, geoff.webb@monash.edu,Monash University, Australia; Germain Forestier, germain.forestier@uha.fr,Monash University, Australia and IRIMAS, University of Haute-Alsace, France; Mahsa Salehi, mahsa.salehi@monash.edu,Monash University, Australia.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted.To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.Request permissions from permissions@acm.org.
In the context of deep learning, a supervised learning model is a neural network that executes the following functions to map the input time series to a target variable: where   represents the non-linear function and   denotes the parameters at layer .For TSC the neural network model is trained to map a time series dataset  to a set of class labels  with  class labels.After training, the neural network outputs a vector of  values that estimates the probability of a series  belonging to each class.This is typically achieved using the softmax activation function in the final layer of the neural network.The softmax function estimates probabilities for all of the dependent classes such that they always sum to 1 across all classes.The cross-entropy loss is commonly used for training neural networks with softmax outputs or classification type neural networks.
On the other hand, TSER trains the neural network model to map a time series dataset  to a set of numeric values  .Instead of outputting probabilities, a regression neural network outputs a numerical value for the time series.It is typically used with a linear activation function in the final layer of the neural network.However, any non-linear functions with a single value output such as sigmoid, or ReLU can also be used.A regression neural network typically trains using the mean square error or mean absolute error loss function.However, depending on the distribution of the target variable and the choice of final activation functions, other loss functions can be used.Fig. 1.Taxonomy of Deep Learning (DL) for TSC/TSER from the perspectives of network configuration and application domains.

TSC and TSER
TSC is a fast-growing field, with hundreds of papers being published every year [8,9,15,25,26].The majority of work in TSC are non-deep learning based.In this survey, we focus on deep learning approaches and refer interested readers to Appendix A and benchmark papers [11,25,26] for more details on non-deep learning approaches.Most deep learning approaches to TSC have real-valued outputs that are mapped to a class label.TSER [10,27] is a less widely studied task in which the predicted values are numeric, rather than categorical.While the majority of the architectures covered in this survey were designed for TSC, it is important to note that it is trivial to adapt most of them for TSER.
Deep learning-based TSC methods can be classified into two main types: generative and discriminative [28].In the TSC community, generative methods are often considered model-based [25], aiming to understand and model the joint probability distribution of input series  and output labels  , denoted as  (,  ).On the other hand, discriminative models focus on modeling the conditional probability of output labels  given input series  , expressed as  ( | ).
Generative models, such as the Stacked Denoising Auto-encoders (SDAE) have been proposed by Bengio et al. [29] to identify the salient structure of input data distributions, and Hu et al. [30] used the same model for the pre-training phase before training a classifier for time series tasks.A universal neural network encoder has been developed to convert variable-length time series to a fixed-length representation [31].Also, a Deep Belief Network (DBN) combined with a transfer learning method was used in an unsupervised manner to model the latent features of time series [32].
An Echo State Network (ESN) has been used to learn the appropriate time series representation by reconstructing the original raw time series prior to training the classifier [33].Generative Adversarial Networks (GANs) are one of the popular generative models that generate new examples by learning to discriminate between real and synthetic examples.Various GANs have been developed for time series and have been reviewed in a recent survey [34].Often, implementing generative methods is more complex due to an additional step of training.Furthermore, generative methods are typically less efficient than discriminative methods, which directly map raw time series to class probability distributions.Due to these barriers, researchers tend to focus on discriminative methods.Therefore, this survey mainly focuses on the end-to-end discriminative approaches.

Taxonomy of Deep Learning in TSC and TSER
To provide an organized summary of the existing deep learning models for TSC, we propose a taxonomy that categorizes these models based on deep learning methods and application domains.This taxonomy is illustrated in Fig. 1.In section Manuscript submitted to ACM 3, we review various network architectures used for TSC, including multilayer perceptrons, convolutional neural networks, recurrent neural networks, graph neural networks, and attention-based models.We also discuss refinements made to these models to improve their performance on time series tasks.Additionally, various types of self-supervised learning pretexts, such as contrastive learning and self-prediction, are explored in section 4. We also conduct a review of useful data augmentation and transfer learning strategies for time series data in section 5 and 6.In addition to methods, we summarize key applications of TSC and TSER in section 7 of this paper.These applications include human activity recognition and satellite earth observation, which are important and challenging tasks that can benefit from the use of deep learning models.Overall, our proposed taxonomy and the discussions in these sections provide a comprehensive overview of the current state of the art in deep learning for time series analysis and outline future research directions.

SUPERVISED MODELS
This section reviews the deep learning-based models for TSC and discusses their architectures by highlighting their strengths as well as limitations.More details on deep model architectures and their adaptations to time series data are available in Appendix B.

Multi-Layer Perceptron (MLP)
The most straightforward neural network architecture is a fully connected network (FC), also called a multilayer perceptron (MLP).The number of layers and neurons are defined as hyperparameters in MLP models.However, studies such as auto-adaptive MLP [35] have attempted to determine the number of neurons in the hidden layers automatically, based on the nature of the training time series data.This allows the network to adapt to the training data's characteristics and optimize its performance on the task at hand.
One of the main limitations of using multilayer perceptrons (MLPs) for time series data is that they are not well-suited to capturing the temporal dependencies in this type of data.MLPs are feedforward networks that process input data in a fixed and predetermined order without considering the temporal relationships between the input values.Various studies used MLPs alongside other feature extractors like Dynamic Time Warping (DTW) to address this problem [36,37].
DTW-NN is a feedforward neural network that exploits DTW's elastic matching ability to dynamically align a layer's inputs to the weights instead of using a fixed and predetermined input-to-weight mapping.This weight alignment replaces the standard dot product within a neuron with DTW.In this way, the DTW-NN is able to tackle difficulties with time series recognition, such as temporal distortions and variable pattern length within a feedforward architecture [37].
Similarly, Symbolic Aggregate Approximation (SAX) is used to transform time series into a symbolic representation and produce sequences of words based on the symbolic representation [38].The symbolic time series-based words are later used as input for training a two-layer MLP for classification.
Although the models mentioned above attempt to resolve the shortage of capturing temporal dependencies in MLP models, they still have other limitations on capturing time-invariant features [16].Additionally, MLP models do not have the ability to process input data in a hierarchical or multi-scale manner.Time series data often exhibits patterns and structures at different scales, such as long-term trends and short-term fluctuations.MLP models fail to capture these patterns, as they are only able to process input data in a single, fixed-length representation.In addition, MLPs may encounter difficulties when confronted with irregularly sampled time series data, where observations are not uniformly recorded in time.Many other deep learning models are better suited to handle time series data, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and transformers, specifically designed to capture the temporal dependencies and patterns in time series data.
Manuscript submitted to ACM

CNN based models
Several improvements have been made to CNN since the success of AlexNet in 2012 [39] such as using deeper networks, applying smaller and more efficient convolutional filters, adding pooling layers to reduce the dimensionality of the feature maps, and utilizing batch normalization to improve the stability of training [40].They have been demonstrated to be very successful in many domains, such as computer vision, speech recognition, and natural language processing problems [40,41].As a result of the success of CNN architectures in these various domains, researchers have also started adopting them for TSC.See table 1 for a list of reviewed CNN models in this paper.

Adapted
CNNs for TSC and TSER.This section presents the first category, which we refer to as Adapted CNNs for TSC and TSER.The papers discussed here are mostly adaptations without any particular preprocessing or mathematical characteristics, such as transforming the series to an image or using multi-scale convolution and therefore do not fit into one of the other categories.
The first CNN for TSC was the Multi-Channel Deep Convolutional Neural Network (MC-DCNN) [42].It handles multivariate data by independently applying convolutions to each input channel.Each input dimension undergoes two convolutional stages with ReLU activation, followed by max pooling.The output from each dimension is concatenated and passed to a fully connected layer which is then fed to a final softmax classifier for classification.Similar to MC-DCNN, a three-layer convolution neural network was proposed for Human activity recognition (MC-CNN) [43].Unlike the MC-DCNN, this model applies 1D convolutions to all input channels simultaneously to capture the temporal and spatial relationships in the early stages.The 2-stage version of MC-CNN architecture was used by Zhao et al. [44] on the earliest version of the UCR Time Series Data Mining Archive.The authors also conducted an ablation study to evaluate the performance of the CNN models with differing numbers of convolution filters and pooling types.
Fully Convolutional Networks (FCN) [45], and Residual Networks (Resnet) [46] are two deep neural networks that are commonly used for image and video recognition tasks and have been adapted for end-to-end TSC [16].FCNs are a variant of convolutional neural networks (CNNs) designed to operate on inputs of arbitrary size rather than being constrained to fixed-size inputs like traditional CNNs.This is achieved by replacing the fully connected layers in a traditional CNN with a Global Average Pooling (GAP) [45].FCN was adapted for univariate TSC [16], and similar to the original model, it contains three convolution blocks where each block contains a convolution layer followed by batch normalization and ReLU activation.Each block uses 128, 256, 128 filters with 8, 5, 3 filter lengths respectively.The output from the last convolution block is averaged with a GAP layer and passed to a final softmax classifier.The GAP layer has the property of reducing the spatial dimensions of the input while retaining the channel-wise information, which allows it to be used in conjunction with a class activation map (CAM) [47] to highlight the regions in the input that are most important for the predicted class.This can provide useful insights into how the network is making its predictions and help identify potential improvement areas.Similar to FCN, the Residual Network (ResNet) was also proposed in [16] for univariate TSC.ResNet is a deep architecture containing three residual blocks followed by a GAP layer and a softmax classifier.It uses residual connections between blocks to reduce the vanishing gradient effect that affects deep learning models.The structure of each residual block is similar to the FCN architecture, containing three convolution layers followed by batch normalization and ReLU activation.Each convolution layer uses 64 filters with 8, 5, 3 filter lengths, respectively.ResNet was found to be one of the most accurate deep learning TSC architectures on 85 univariate TSC datasets [15,25].Additionally, integration of ResNet and FCN has been proposed to combine the strength of both networks [48].
Manuscript submitted to ACM In addition to adapting the network architecture, some research has focused on modifying the convolution kernel to suit TSC tasks better.Dilated convolutions neural networks (DCNNs) [49] are a type of CNN that uses dilated convolutions to increase the receptive field of the network without increasing the number of parameters.Dilated convolutions create gaps between elements of the kernel and perform convolution, thereby covering a larger area of the input.This allows the network to capture long-range dependencies in the data, making it well-suited to TSC tasks [50].
Recently, Disjoint-CNN [51] showed that factorization of 1D convolution kernels into disjoint temporal and spatial components yields accuracy improvements with almost no additional computational cost.Applying disjoint temporal convolution and then spatial convolution behaves similarly to Inverted Bottleneck [52].Like the Inverted Bottleneck, the temporal convolutions expand the number of input channels, and spatial convolutions later project the expanded hidden state back to the original size to capture the temporal and spatial interaction.
3.2.2Imaging time series.In TSC, a common approach is to convert the time series data into a fixed-length representation, such as a vector or matrix, which can then be input to a deep learning model.However, this can be challenging for time series data that vary in length or have complex temporal dependencies.One solution to this problem is to represent the time series data in an image-like format, where each time step is treated as a separate channel in the image.This allows the model to learn from the spatial relationships within the data rather than just the temporal relationships.In this context, the term spatial refers to the relationships between different variables or features within a single time step of the time series.
As an alternative to using raw time series data as input, Wang and Oates encoded univariate time series data into different types of images that were then processed by a regular CNN [53].This image-based framework initiated a new branch of deep learning approaches for time series, which consider image transformation as one of the feature engineering techniques.Wang and Oates presented two approaches for transforming a time series into an image.The first generates a Gramian Angular Field (GAF), while the second generates a Markov Transition Field (MTF).GAF represents time series data in a polar coordinate and uses various operations to convert these angles into a symmetry matrix and MTF encodes the matrix entries using the transition probability of a data point from one time step to another time step [53].In both cases, the image generation increases the time series size, making the images potentially prohibitively large.Therefore they propose strategies to reduce their size without losing too much information.Afterward, the two types of images are combined in a two-channel image that is then used to produce better results than those achieved when using each image separately.Finally, a Tiled CNN model is applied to classify the time-series images.In other studies, a variety of transformation methods, including Recurrence Plots (RP) [54], Gramian Angular Difference Field (GADF) [55], bilinear interpolation [56], and Gramian Angular Summation Field (GASF) [57] have been proposed to transfer time series to input images expecting that the two-dimensional images could reveal features and patterns not found in the one-dimensional sequence of the original time series.
Hatami et al. [54] propose a representation method based on RP [58] to convert the time series to 2D images with a CNN model for TSC.In their study, time series are regarded as distinct recurrent behaviors such as periodicities and irregular cyclicities, which are the typical phenomena of dynamic systems.The main idea of using the RP method is to reveal at which points some trajectories return to a previous state.Finally, two-stage convolution and two fully connected layers are applied to classify the images generated by RP.Subsequently, pre-trained Inception v3 [59] was used to map the GADF images into a 2048-dimensional vector space.The final stage used an MLP with three hidden layers, followed by a softmax activation function [55].Following the same framework, Chen and Shi [60] [64] 2019 Inception V1 InceptionTime [12] 2019 Inception V4 Ensemble EEG-inception [65] 2021 InceptionTime Inception-FCN [66] 2021 InceptionTime + FCN KDCTime [67] 2022 InceptionTime Knowledge Distillation, Label smoothing LITE [68] 2023 InceptionTime Multiplexing, dilated, and custom filters results showed promising performances by converting univariate time series data to 2D images using relative positions between two time stamps.Following the convention, three image encoding methods: GASF, GADF, and MTF, were used to encode MTS data into two-dimensional images [57].They showed that the simple structure of ConvNet is sufficient for classification as it performed equally well with the complex structure of VGGNet.
Overall, representing time series data as 2D images can be difficult because preserving the temporal relationships and patterns in the data can be challenging.This transformation can also result in a loss of information, making it difficult for the model to classify the data accurately.Chen and Shi [60] have also shown that the specific transformation methods like GASF, GADF, and MTF used in this process do not significantly improve the prediction outcome.

3.2.3
Multi-Scale Operation.The papers discussed here apply a multi-scale convolutional kernel to the input series or apply regular convolutions on the input series at different scales.Multi-scale CNNs (MCNN) [61] and Time LeNet (t-LeNet) [62] were considered the first models that preprocess the input series to apply convolution on multi-scale series rather than raw series.The design of both MCNNs and t-LeNet were inspired by computer vision models, which means that they were adapted from models originally developed for image recognition tasks.These models may not be well-suited to TSC tasks and may not perform as well as models specifically designed for this purpose.One potential reason for this is the use of progressive pooling layers in these models, commonly used in computer vision models, to reduce the input data size and make it easier to process.However, these pooling layers may not be as effective when applied to time series data and may limit the performance of the model.
Manuscript submitted to ACM MCNN has simple architecture and comprises two convolutions and a pooling layer, followed by a fully connected and softmax layer.However, this approach involves heavy data preprocessing.Specifically, before any training, they use a sliding window to extract a time series subsequence, and later, the subsequence will undergo three transformations: (1) identity mapping, (2) down-sampling, and (3) smoothing, which results in the transformation of a univariate input time series into a multivariate one.Finally, the transformed output is fed to the CNN model to train a classifier [61].t-LeNet uses two data augmentation techniques: window slicing (WS) and window warping (WW), to prevent overfitting [62].
The WS method is identical to MCNN's data augmentation.The second data augmentation technique, WW, employs a warping technique that squeezes or dilates the time series.WS is also adopted to ensure that subsequences of the same length are extracted for training the network to deal with multi-length time series.Therefore, a given input time series of length  is first dilated (×2) and then squeezed (×1/2) using WW, resulting in three time series of length , 2, 1/2 that are fed to WS to extract equal length subsequences for training.Finally, as both MCNN and t-LeNet predict a class for each extracted subsequence, majority voting is applied to obtain the class prediction for the full time series.
Inception was first proposed by Szegedy et al. [69] for end-to-end image classification.Now the network has evolved to become Inception-v4, where Inception was coupled with residual connections to improve further the performance [70].
Inspired by inception architecture, a multivariate convolutional neural network (MVCNN) is designed using multi-scale convolution kernels to find the optimal local construction [63].MVCNN uses three scales of filters, including 2 × 2, 3 × 3, and 5 × 5, to extract features of the interaction between sensors.A one-dimensional Inception model was used for Supernovae classification using the light flux of a region in space as an input MTS for the network [64].
However, the authors limited the conception of their Inception architecture to the first version of this model [69].The Inception-ResNet [71] architecture includes convolutional layers, followed by Inception modules and residual blocks.
The Inception modules are used to learn multiple scales and aspects of the data, allowing the network to capture more complex patterns.The residual blocks are then used to learn the residuals, or differences, between the input and output of the network, improving its performance.
InceptionTime [12] explores much larger filters than any previously proposed network for TSC to reach state-ofthe-art performance on the UCR benchmark.InceptionTime is an ensemble of five randomly initialized inception network models, each of which consists of two blocks of inception modules.Each inception module first reduces the dimensionality of a multivariate time series using a bottleneck layer with a length and stride of 1 while maintaining the same length.Then, 1D convolutions of different lengths are applied to the output of the bottleneck layer to extract patterns at different sizes.In parallel, a max pooling layer followed by a bottleneck layer are also applied to the original time series to increase the robustness of the model to small perturbations.The output from the convolution and max pooling layers are stacked to form a new multivariate time series which is then passed to the next layer.Residual connections are used between each inception block to reduce the vanishing gradient effect.The output of the second inception block is passed to a GAP layer before feeding into a softmax classifier.
The strong performance of InceptionTime has inspired a number of extensions.Like InceptionTime, EEG-inception [65] uses several inception layers and residual connections as its backbone.Additionally, noise addition-based data augmentation of EEG signals is proposed, which increases the average accuracy.InceptionFCN [66] focuses on combining two well-known deep learning techniques, namely the Inception module and the Fully Convolutional Network [66].In KDCTime [67], label smoothing (LSTime) and knowledge distillation (KDTime) were introduced for InceptionTime, automatically generated while compressing the inference model.Additionally, knowledge distillation with calibration (KDC) in KDCTime offers two calibrating strategies: KDC by translating (KDCT) and KDC by reordering (KDCR).

Recurrent Neural Network
Recurrent Neural Networks are types of neural networks built with internal memory to work with time series and sequential data.Conceptually similar to feed-forward neural networks (FFNs), RNNs differ in their ability to handle variable-length inputs and produce variable-length outputs.
Using RNNs, the input series have been classified based on their dynamic behavior.They used sequence-to-sequence architecture in which each sub-series of input series is classified in the first step.Then the argmax function is applied to the entire output, and finally, the neuron with the highest rate specifies the classification result.In order to improve the model parallelization and capacity [74] proposed a two-layer RNN.In the first layer, the input sequence is split into several independent RNNs to improve parallelization, followed by a second layer that utilizes the first layer's output to capture long-term dependencies [74].Further, RNNs have been used in some hierarchical architectures [75,76].
Hermans and Schrauwen showed a deeper version of recurrent neural networks could perform hierarchical processing on complex temporal tasks and capture the time series structure more naturally than a shallow version [76].RNNs are usually trained iteratively using a procedure known as backpropagation through time (BPTT).When unfolded in time, RNNs look like very deep networks with shared parameters.With deeper neural layers in RNN and sharing weights across different RNN cells, the gradients are summed up at each time step to train the model.Thus, gradients undergo continuous matrix multiplication due to the chain rule and either shrink exponentially and have small values called vanishing gradients or blow up to a very large value, referred to as exploding gradients [77].These problems motivated the development of second-order methods for deep architectures named long short-term memory (LSTM) [78] and Gated Recurrent Unit (GRU) [79].

Long Short Term Memory (LSTM). LSTM addresses the common vanishing/exploding gradient issue in vanilla
RNNs by integrating memory cells with gate control into their state dynamics [78].Due to its design nature, LSTM is suited to problems involving sequence data, such as language translation [80], video representation learning [81], and image caption generation [82].The TSC problem is not an exception and mainly adopts a similar model to the language translation [80].Sequence-to-Sequence with Attention (S2SwA) [83] incorporates two LSTMs, one encoder and one decoder, in a sequence-to-sequence fashion for TSC.In this model, the encoder LSTM accepts input time series of arbitrary lengths and extracts information from the raw data based on which the decoder LSTM constructs fixed-length sequences that can be regarded as automatically extracted features for classification.

Gated Recurrent Unit (GRU)
. GRU, another widely-used variant of RNNs, shares similarities with LSTM in its ability to control information flow and memorize context across multiple time steps [79].Similar to S2SwA [83] sequence auto-encoder (SAE) based on GRU has been defined to deal with TSC problem [84].A fixed-size output is produced by processing the various input lengths using GRU as the encoder and decoder.The model's accuracy was also improved by pre-training the parameters on massive unlabeled data.future values.This allows them to capture the dynamic nature of time series data and make more accurate predictions.
Combining the strengths of CNNs and RNNs makes it possible to learn spatial and temporal features from the time series data, improving the model's performance for TSC.Additionally, the two models can be trained together, allowing them to learn from each other and improve the model's overall performance.
Various extensions like MLSTM-FCN [85], TapNet [86], and SMATE [87] were proposed later to deal with time-series data.MLSTM-FCN extends the univariate LSTM-FCN model [88] to the multivariate case.Like the LSTM-FCN, the multivariate version comprises LSTM blocks and fully convolutional blocks for extracting features from input series.A squeeze and excite block is also added to the FCN block, and can execute a form of self-attention on the output feature maps of previous layers [85].Two further proposals for multivariate TSC are the Time series attentional prototype Network (TapNet) and Semi-Supervised Spatio-Temporal (SMATE) [86,87].These methods combine and seek to leverage the relative strengths of both traditional distance-based and deep-learning approaches.
MLSTM-FCN, TapNet, and SMATE were designed in dual-network architectures.The input is separately fed into the CNN and RNN models, and their output is concentrated before the fully connected layer for the final task.However, one branch cannot fully use the hidden states of the other during feature extraction since the final classification results are generated by concatenating the outputs of the two branches.That motivates different types of architecture that try layer-wise integration of CNN and RNN models.This motivates different architectures, such as GCRNN [89] and CNN-LSTM [90], which aim to integrate CNNs and RNNs in a layer-wise fashion.
While recurrent neural networks are commonly used for time series forecasting, only a few studies have applied them to TSC, mainly due to four reasons: (1) RNNs typically struggle with the gradient vanishing and exploding problem due to training on long-time series [91].( 2) RNNs are considered difficult to train and parallelize, so researchers are less likely to use them as they are computationally expensive [77].(3) Recurrent architectures are designed mainly to learn from the previous data to make predictions about the future [28].( 4) RNN models can fail to effectively capture and utilize long-range dependencies in long sequences [83].

Attention based model
Despite the excellent performance of CNN models for capturing local temporal/spatial correlations, these models can not effectively capture and utilize long-range dependencies.Additionally, they only consider the local order of data points rather than the overall order of all data points.Therefore, many recent studies have embedded recurrent neural networks (RNN) such as LSTMs alongside the CNNs to capture this information [85,86,88].The disadvantage of RNN-based models is that they are computationally expensive, and their capability to capture long-range dependencies is limited [18,92].On the other hand, attention models can capture long-range dependencies, and their broader receptive fields provide more contextual information, which can improve the models' learning capacity.The attention mechanism aims to enhance a network's representation ability by focusing on essential features and suppressing unnecessary ones.
Not surprisingly, with the success of attention models in natural language processing [92,93], many previous studies have attempted to bring the power of attention models into various domains such as computer vision [94] and time series analysis [18,19,[95][96][97]. Table 2 presents a list of the attention-based models reviewed in this paper.
3.4.1 Self-Attention.Self-attention has been demonstrated to be effective in various natural language processing tasks due to its ability to capture long-term dependencies in text [92].Recently, it has also been shown to be effective for TSC tasks [18,[98][99][100].As we mentioned, the self-attention module is embedded in the encoder-decoder models to improve the model performance.However, only the encoder and the self-attention module have been used for TSC.
Early models of TSC follow the same backbone of natural language processing models and use the Recurrent-based Manuscript submitted to ACM models such as RNN [101], GRU [98] and LSTM [102,103] for encoding the input series.For example, the Multi-View Attention Network (MuVAN) applies bidirectional GRUs independently to each input dimension as the encoder and then feeds all the representations into a self-attention bock [98].
As a result of the excellent performance of the CNN models, many studies have attempted to encode the time series using CNN before applying attention [18,99,104,105].Cross Attention Stabilized Fully Convolutional Neural Network (CA-SFCN) [18] and Locality Aware eXplainable Convolutional ATtention network (LAXCAT) [99] applied the self-attention mechanism to leverage the long-term dependencies for the MTSC task.CA-SFCN combines FCN and two types of self-attention -temporal attention (TA) and variable attention (VA), which interact to capture the long-range dependencies and variables interactions.LAXCAT also used temporal and variable attention to identify informative variables and the time intervals where they have informative patterns for classification.WaveletDTW Hybrid attEntion Networks (WHEN) [106] integrate two attention mechanisms, namely wavelet attention and DTW attention, into the BiLSTM to enhance model performance.In wavelet attention, they leverage wavelets to compute attention scores, specifically targeting the analysis of dynamic frequency components in nonstationary time series.
Simultaneously, DTW attention employs the DTW distance to calculate attention scores, addressing the challenge of time distortion in multiple time series.
Several self-attention models have been developed to improve network performance [107,108], including Squeezeand-Excitation (SE) [109], which focuses on channel attention and is often used to classify time series data [85,100,110].
The SE block allows the whole network to use global information to selectively focus on the informative feature maps and suppress less important ones [109].More importantly, the SE block can increase the quality of the shared lower-level representations in the early layers and becomes increasingly specialized when responding to different inputs in later layers.The weight of each feature map is automatically learned at each layer of the network, and the SE block can boost feature discrimination throughout the whole network.Multi-scale Attention Convolutional Neural Network (MACNN) [100] applies the different kernel size convolutions to capture different scales of information along the time axis by generating feature maps at differing scales.Then an SE block is used to enhance useful feature maps and suppress less useful ones by automatically learning each feature map's importance.

Transformers.
The impressive performance of multi-headed attention has led to numerous attempts to adapt multi-headed attention to the TSC domain.Transformers for classification usually employ a simple encoder structure consisting of attention and feed-forward layers.SAnD (Simply Attend and Diagnose) [111] architecture adopted a multi-head attention mechanism similar to a vanilla transformer [92] to classify clinical time series for the first time.
The model uses both positional encoding and a dense interpolation embedding technique to incorporate temporal order into representation learning.In another study that classified vibration signals [112], time-frequency features such as Frequency Coefficients and Short Time Fourier Transformation (STFT) spectrums are used as input embeddings to the transformers.A multi-head attention-based model was applied to raw optical satellite TSC using Gaussian Process Interpolation [113] embedding and outperformed convolution, and recurrent neural networks [114].
Gated Transformer Networks (GTN) [115] use two-tower multi-headed attention to capture the discriminative information from the input series.Also, they merged the output of two towers using a learnable matrix named gating.

Graph Neural Networks
While both CNNs and RNNs perform well on Euclidean data, many time series problems have data that are more naturally represented as graphs [118].For example, in a network of sensors, the sensors may be irregularly spaced, instead of the sensors forming a regular grid.A graph representation of data collected by this network can model this irregular layout more accurately than can be done using a Euclidean space.However, using standard deep learning algorithms to learn from graph structures is challenging [119].For example, nodes may have a varying number of neighbouring nodes, making it difficult to apply a convolution operation.
Graph Neural Networks (GNNs) [120] are methods that adapt deep learning techniques to the graph domain.Much of the early research using GNNs for time series analysis concentrated on forecasting tasks [118].However, recent works consider GNNs for TSC [121,122] and TSER [123] tasks.A list of the GNN models reviewed in this paper is provided in table 3. Time2Graph+ [124] transforms each time series into a shapelet graph.Shapelets are extracted from the time series and form the graph nodes.The graph edges are weighted based on transition probabilities between the two shapelets.Once the input graphs have been constructed, a graph attention network is used to create a representation of the time series that is fed into a classifier.
SimTSC [136] constructs a pairwise similarity graph where each time series forms a node and edge weights are computed based on the DTW distance measure.Node attributes are generated using a feature vector encoder.GNN operations are used to enhance the node features based on similarities between adjacent time series.These representations are then used for the final classification step, which produces a classification for each node.LB-SimTSC [121] replaces the expensive DTW computation with the LB-Keogh lower-bounding method [140].
Spatiotemporal GCNs are often used to analyse sensor arrays, where the graph structure models the physical layout of the sensors.A common example is electroencephalogram (EEG) data, where the location of EEG electrodes is represented as a graph that is used to analyse the EEG signal.Some of these applications are epilepsy detection [130], seizure detection [125,131], emotion recognition [126], and sleep classification [127].Besides EEG, GCNs have also been applied to engineering applications such as machine fault diagnosis [129], slope deformation prediction [128] and seismic activity prediction [123].MTPool [135] uses a spatiotemporal GCN for multivariate time series classification.In this study, each channel in the time series is represented by a node in the graph and the graph edges model the correlations between the channels.The GCN is combined with temporal convolutions and a hierarchical graph pooling technique.
Spatiotemporal GNNs have also been used for object-based image analysis [133] and semantic segmentation [137] of image time series.However, these assume the labels and spatial relationships are static over time.In many cases these may both change.Spatiotemporal graphs (STGs), which include temporal edges as well as spatial edges, can model these dynamic relationships [139].In STGs, each node represents an object at one timestamp.Spatial edges connect the object to adjacent objects and temporal edges connect two objects in consecutive images if they have common pixels.

SELF-SUPERVISED MODELS
Obtaining labeled data for large time series datasets poses significant costs and challenges.Machine learning models trained on large labeled time series datasets often exhibit superior performance compared to models trained on Manuscript submitted to ACM sparsely labeled datasets, small datasets with limited labels, or those without supervision, leading to suboptimal performance across various time series machine learning tasks [23,142].As a result, rather than depending on high-quality annotations for large datasets, researchers and practitioners are increasingly shifting their focus toward self-supervised representation learning for time series.
Self-supervised representation learning, a subfield of machine learning, focuses on learning representations from data without explicit supervision [24].In contrast to supervised learning, which relies on labeled data, self-supervised learning methods utilize the inherent structure of the data to learn valuable representations in an unsupervised manner.
The learned representations can then be used for a variety of downstream tasks including classification, anomaly detection, and forecasting.This survey specifically emphasizes classification as a downstream task.We categorized selfsupervised learning approaches for TSC into three groups based on the pretext.Table 4 shows a list of the self-supervised models reviewed in this paper.

Contrastive Learning
Contrastive learning involves model learning to differentiate between positive and negative time series examples.Time-Contrastive Learning (TCL) [143], Scalable Representation Learning (SRL or T-Loss) [144] and Temporal Neighborhood Coding (TNC) [145] apply a subsequence-based sampling and assume that distant segments are negative pairs and neighbor segments are positive pairs.TNC takes advantage of the local smoothness of a signal's generative process to define neighborhoods in time with stationary properties to further improve the sampling quality for the contrastive loss function.TS2Vec [23] uses contrastive learning to obtain robust contextual representations for each timestamp hierarchically.It involves randomly sampling two overlapping subseries from input and encouraging consistency of contextual representations on the common segment.The encoder is optimized using both temporal contrastive loss and instance-wise contrastive loss.
In addition to the subsequence-based methods, other models employ instance-based sampling [21,142,[146][147][148][149], treating each sample individually to generate positive and negative samples for contrastive loss.Time-series Temporal and Contextual Contrasting (TS-TCC) [21] uses weak and strong augmentations to transform the input series into two views and then uses a temporal contrasting module to learn robust temporal representations.The contrasting contextual module is then built upon the contexts from the temporal contrasting module and aims to maximize similarity among contexts of the same sample while minimizing similarity among contexts of different samples.Similarly, TimeCLR [147] introduces DTW data augmentation to enhance robustness against phase shift and amplitude change phenomena.
Bilinear Temporal-Spectral Fusion (BTSF) [142] uses simple dropout as the augmentation method and aims to incorporate spectral information into the feature representation.Similarly, Time-Frequency Consistency (TF-C) [148] is a selfsupervised learning method that leverages the frequency domain to achieve better representation.It proposes that the time-based and frequency-based representations, learned from the same time series sample, should be more similar to each other in the time-frequency space compared to representations of different time series samples.

Self-Prediction
The primary objective of self-prediction-based self-supervised models is to reconstruct the input or representation of input data.Studies have explored using transformer-based self-supervised learning methods for TSC [19,22,97,[150][151][152], following the success of models like BERT [93].BErt-inspired Neural Data Representations (BENDER) [97] uses the transformer structure to model EEG sequences and shows that it can effectively handle massive amounts of EEG data recorded with differing hardware.Another study, Voice-to-Series with Transformer-based Attention (V2Sa) [22], utilizes a large-scale pre-trained speech processing model for TSC.
Manuscript submitted to ACM

Other Pretext tasks
While many pretext tasks in self-supervised learning are typically contrastive or self-predictive, specific tasks are tailored for time series data.In image-based self-supervised learning, synthetic transformations (augmentation) of an image are created, and the model learns to contrast the image and its transforms with other images in the training data, which works well for object interpretation.However, time series analysis fundamentally differs from vision or natural language processing concerning the definition of meaningful self-supervised learning tasks.
Guided by this insight, Foumani et al. [24] introduce Series2Vec, a novel self-supervised representation learning approach.Unlike other contrastive self-supervised methods in time series, which carry the risk of positive sample variants being less similar to the anchor sample than series in the negative set, Series2Vec is trained to predict the similarity between two series in both temporal and spectral domains through a self-supervised task.Series2Vec relies primarily on the consistency of the unsupervised similarity step, rather than the intrinsic quality of the similarity measurement, without the need for hand-crafted data augmentation.Pre-trained H-InceptionTime (PHIT) [153] is pre-trained using a novel pretext task designed to identify the originating dataset of each time series sample.The objective is to generate flexible convolution filters that can be applied across diverse datasets.Furthermore, PHIT demonstrates its capability to mitigate overfitting in small datasets.

DATA AUGMENTATION
In the field of deep learning, the concept of data augmentation has emerged as an important tool for enhancing performance, particularly in scenarios where the availability of training data is limited [154].Originally proposed in Manuscript submitted to ACM computer vision, data augmentation involves a variety of transformations to images, such as cropping, rotating, flipping, and applying filters like blurring and sharpening.These transformations serve to introduce a diverse range of scenarios within the training data, thereby aiding in the development of more robust and generalizable models.However, the direct application of these image-based augmentation techniques to time series data often proves to be inadequate or inappropriate.Operations like rotation may disrupt the intrinsic temporal structure of time series data.
The challenge of overfitting is particularly pronounced in the field of deep learning models for TSC.These models are characterized by a high number of trainable parameters, which can lead to a model that performs well on training data but fails to generalize to unseen data.In such cases, data augmentation can be a valuable strategy.It offers an alternative to the costly and sometimes impractical approach of collecting additional real-world data.By generating synthetic samples from existing datasets, we can effectively augment the size and variety of our training data.The following details different investigated methods to produce synthetic time series for data augmentation.
Random Transformations.Several augmentations have been developed for the magnitude domain.Jittering, as explored by Um et al. [155], involves the addition of random noise to the time series.Another method, flipping [156], reverses the time series values.Scaling is a technique where the time series is multiplied by a factor from a Gaussian distribution.
Magnitude warping, which shares similarities with scaling, distorts the series along a curve that varies smoothly.For time domain transformations, permutation algorithms play a significant role.For example, the slicing transformation involves removing sub-sequence from the series.There are also various warping methods like Random Warping [157], Time Warping [155], Time Stretching [158], and Time Perturbation [159], each introducing different forms of distortion to the time series.Finally, in the frequency domain, transformations often utilize the Fourier transform.For example, Gao et al. [160] introduce perturbations to both the magnitude and phase spectrum following a Fourier transform.
Window methods.A primary approach in window methods is to create new time series by combining segments from various series of the same class.This technique effectively enriches the data pool with a variety of samples.Window slicing, as introduced by Cui et al. [161] involves dividing a time series into smaller segments, with each segment retaining the class label of the original series.These segments are then used to train classifiers, offering a detailed view of the data.During classification, each segment is evaluated individually, and a collective decision on the final label is reached through a voting system among the slices.Another technique is window warping, based on the DTW algorithm.This method adjusts segments of a time series along the temporal axis, either stretching or compressing them.This introduces variability in the time dimension of the data.Le Guennec et al. [162] work provides examples of the application of both window slicing and window warping, showcasing their effectiveness in enhancing the diversity and representativeness of time series datasets.
Averaging methods.Averaging methods in time series data augmentation combine multiple series to form a new, unified series.This process is more difficult than it might seem, as it requires careful consideration of factors like noise and distortions in both the time and magnitude aspects of the data.In this context, weighted Dynamic Time Warping (DTW) Barycenter Averaging (wDBA) introduced by Forestier et al. [163] provides an averaging method by aligning time series in a way that accounts for their temporal dynamics.The practical application of wDBA is illustrated in the study by Ismail Fawaz et al. [164], where it is employed in conjunction with a ResNet classifier, demonstrating its effectiveness.
Additionally, the research conducted by Terefe et al. [165] uses an auto-encoder for averaging a set of time series.This method represents a more advanced approach in time series data augmentation, exploiting the auto-encoder's capacity for learning and reconstructing data to generate averaged representations of time series.
Selection of data augmentation methods.The selection of the appropriate data augmentation technique is critical and must be adapted to the specific characteristics of the dataset and the architecture of the neural network being used.

Manuscript submitted to ACM
Studies like those conducted by Iwana et al. [166], Pialla et al. [167] and Gao et al [168] highlight the complexity of this task.These studies demonstrate that the effectiveness of augmentation techniques can vary significantly across different datasets and neural network architectures.Consequently, a method that proves effective in one scenario may not necessarily yield similar results in another.To this end, practitioners in the field of TSC must engage in a careful and informed process of method selection and tuning.While the array of available data augmentation techniques offers a comprehensive toolkit for tackling the challenges of limited data and overfitting, their successful application depends heavily on a nuanced understanding of both the methods themselves and the specific demands of the task at hand.

TRANSFER LEARNING
Transfer learning, initially popularized in the field of computer vision, is increasingly becoming relevant in the domain of TSC.In computer vision, this approach involves using a pre-trained network, typically on large datasets like ImageNet [169], as a starting point rather than initiating with random network weights.This method is also related to the concept of foundation or base models, which are large-scale machine learning models trained on extensive data, often using self-supervised or semi-supervised learning.These models are adaptable to a wide array of tasks, showcasing their versatility.The principle of transfer learning is also closely associated with domain adaptation which focuses on applying a model trained on a source data distribution to a different, but related, target data distribution.
This approach is crucial in leveraging pre-trained models for various applications, particularly in scenarios where data is scarce or specific to certain domains.
In the context of TSC, insights have been contributed by the work of Ismail Fawaz et al. [170], who conducted a study using the UCR archive.Their extensive experiments demonstrated that transfer learning could lead to positive or negative outcomes, depending on the chosen datasets for transfer.This finding underscores the importance of the relationship between source and target datasets in transfer learning efficacy.Ismail Fawaz et al. [170] also introduced an approach to predict the success of transfer learning in TSC by using DTW to measure similarities between datasets.
This metric serves as a guide to select the most appropriate source dataset for a given target dataset, thereby enhancing accuracy in a majority of cases.
Other researchers have also explored transfer learning in TSC.Spiegel [171] work on using dissimilarity spaces to enrich feature representations in TSC set a precedent for employing unconventional data sources.This approach of enhancing learning with diverse data types finds a parallel in Li et al. [172] method, which leverages sensor modality labels from various fields to train a deep network, emphasizing the importance of versatile data in transfer learning.
Building on the concept of data diversity, Rotem et al. [173] pushed the boundaries further by generating a synthetic univariate time series dataset for transfer learning.This synthetic dataset, used for regression tasks, underscores the potential of artificial data in overcoming the limitations of real-world datasets.Furthermore, Senanayaka et al. [174] introduced the similarity-based multi-source transfer learning (SiMuS-TL) approach.By establishing a 'mixed domain' to model similarities among various sources, Senanayaka et al. demonstrated the effectiveness of carefully selected and related data sources in transfer learning.Finally, Kashiparekh et al. [175] with their ConvTimeNet (CTN) focused on the adaptability of pre-trained networks across diverse time scales.
While the explored studies collectively advance our understanding of transfer learning in TSC, the field remains open for further investigation.A key challenge lies in determining the most suitable source models for transfer, a task complicated by the relative scarcity of large, curated, and annotated datasets in time series analysis compared to the field of computer vision.This restricts the utility of transfer learning in TSC, as the availability of extensive and diverse datasets is crucial for developing robust and generalizable models.Furthermore, the question of developing filters that Manuscript submitted to ACM are generic enough to be effective across a wide range of applications remains unresolved.This aspect is critical for the success of transfer learning, as the applicability of a pre-trained model to new tasks depends on the universality of its learned features.Additionally, the strategy of whether to freeze certain layers of the network during transfer or to fine-tune the entire network is another area that warrants deeper exploration.

APPLICATIONS -RECENT DEVELOPMENTS AND CHALLENGES
TSC and TSER techniques have been used to analyze and model time-dependent data in a wide range of applications.
Due to the extensive range of applications that use TSC and TSER, it is infeasible to cover them all in detail in a single review.Therefore, in this survey, we focus on just two applications -human activity recognition and satellite Earth observation.(References to recent reviews have been provided for the other applications mentioned above.)These are two important but quite different domains and were chosen to give the reader an idea of the diverseness of time series use in deep learning.The following sections provide an overview of the use of TSC and TSER, the latest developments, and challenges in these two applications.

Human Activity Recognition
Human activity recognition (HAR), is the identification or monitoring of human activity through the analysis of data collected by sensors or other instruments [185].The recent growth of wearable technologies and the Internet of Things has resulted not only in the collection of large volumes of activity data [186], but also easy deployment of applications utilising this data to improve the safety and quality of human life [5,185].HAR is therefore an important field of research with applications including healthcare, fitness monitoring, smart homes [187], and assisted living [188].
Devices used to collect HAR data can be categorised as visual or sensor-based [4,5].Sensor-based devices can be further categorised as object sensors (for example RFIDs embedded into objects), ambient sensors (motion sensors, WiFi or Bluetooth devices in fixed locations) and wearable sensors [4], including smartphones [3].However, the majority of HAR studies use data from wearable sensors or visual devices [185].Additionally, human activity recognition from visual device data requires the use of computer vision techniques and is therefore out of scope for this review.Accordingly, this section reviews wearable sensor-based methods of HAR.For reviews of vision-based HAR, refer to Kong and Fu [189] or Zhang et al. [190].
The main sensors used in wearable devices are accelerometers, gyroscopes and magnetic sensors [191], which each collect three-dimensional spatial data over time.Inertial measurement units (IMUs) are wearable devices that combine all three sensors in one unit [192,193].Wearable device studies typically collect data from multiple IMUs located on different parts of the body [194,195].To create a dataset suitable for HAR modelling, the sensor data is split into (usually equally size) time windows [196].The task is then to learn a function that maps the multi-variate sensor data for each time window to a set of activities.Thus, the data forms multi-variate time series suited to TSC.
Given the broad scope of our survey, this section necessarily only provides a brief overview of the studies using deep learning for HAR.However, there are several surveys that provide a more in-depth review of machine learning and deep learning for HAR.Lara and Labrador [196] provide a comprehensive introduction to HAR, including machine learning methods used and the principal issues and challenges.Both Nweke et al. [3] and Wang et al. [4] provide a summary of deep learning methods, highlighting their advantages and limitations.Chen et al. [5] discuss challenges in HAR and the appropriate deep learning methods for addressing each challenge.They also provide a comprehensive list of publicly-available HAR datasets.Gu et al. [197] focus on deep learning methods, reviewing preprocessing and evaluation techniques as well as the deep learning models.
The deep learning methods used for HAR include both CNNs and RNNs, as well as hybrid CNN-RNN models.While some of the models include an attention module, we did not find any studies proposing a full attention or transformer model.A summary of the studies reviewed and the type of model built is provided in table 5. Hammerla et al. [198] compared several deep learning model for HAR, include three LSTM variants, a CNN model, and DNN model.They found a bi-directional LSTM performed best on naturalistic datasets where long-term effects are important.However, they found some applications need to focus on short-term movement patterns and suggested CNNs are more appropriate for these applications.Thus, research across all model types is beneficial for the on-going development of models for HAR applications.
Manuscript submitted to ACM Ronao et al. [202] performed a comprehensive evaluation of CNN models for HAR, evaluating the effect of changing the number of layers, filters and filter sizes.The input data was collected from smartphone accelerometer and gyroscope sensors.Ignatov [206] used a one-layer CNN, and augmented the extracted features with statistical features before being passed to fully-connected layers.The architecture was effective with short time series (1 second) so useful for real time activity modelling.
One drawback of the above method is that it forces weight-sharing across all the input features.This may not be optimal, especially when using data collected from multiple devices.In this case, using a separate CNN for each device [207] allows independent weighting of the features.Similarly, as each sensor is typically tri-axial, a separate CNN can be used for each axis [199,212].The features extracted by each CNN are then concatenated and processed either by fully-connected layers [199] or an attention head [212].
While the above two methods are the most common, other studies have proposed alternative CNNs for HAR.
DCNN [200] pre-processes the sensor data using a Discrete Fourier Transform to convert IMU data to frequency signals, then uses two-dimensional convolutions to extract combined temporal and frequency features.Lee et al. [204] pre-processed the tri-axial accelerometer data to a magnitude vector, which was then processed in parallel by CNNs with varying kernel sizes, extracting features at different scales.Xu et al. [221] used deformable convolutions [222] in both a 2D-CNN and a ResNet model and found these models performed better than their non-deformable counterparts.
Yao et al. [208] proposed a fully convolutional model using two-dimensional temporal and feature convolutions.Their model has two advantages as (1) it handles arbitrary length input sequences and (2) it makes a prediction for each timestep, which avoids the need to pre-process the data into windows and can detect transitions between activities.
7.1.2Recurrent neural networks.Several long short-term memory (LSTM) models have been proposed for HAR.Murad and Pyun [205] designed and compared three multi-layered LSTMs, a uni-directional LSTM, a bi-directional LSTM, and a "cascading" LSTM, which has a bi-directional first layer, followed by uni-directional layers.In each case the output from all time steps is used as input to the classification layer.Zeng et al. [209] added two attention layers to an LSTM, a sensor attention layer before the LSTM and a temporal attention layer after the LSTM.They include a regularisation term they called "continuous attention" to smooth the transition between attention weights.Guan and Plötz [203] created an ensemble of LSTM models by saving the models at every training epoch, then selecting the best "M" models based on validation set results, thus aiming to reduce model variance.

Hybrid models.
Many recent studies have focussed on hybrid models, combining both CNNs and RNNs.Deep-ConvLSTM [191] comprises four temporal convolutional layers followed by two LSTM layers, which the authors found to perform better than an equivalent CNN (replacing the LSTM layers with fully-connected layers).As the LSTM layers have fewer parameters than fully-connected layers, the DeepConvLSTM model was also much smaller.Singh et al. [219] used a CNN to encode the spatial data (i.e. the sensor readings at each timestamp) followed by a single LSTM layer to encode the temporal data, then a self-attention layer to weight the time steps.They found this model performed better than an equivalent model using temporal convolutions in the CNN layers.Challa et al. [213] proposed using three 1D-CNNs with different kernel sizes in parallel, followed by 2 bi-directional LSTM layers and a fully-connected layer.
Nafea et al. [218] also used 1D-CNNs with different kernel sizes and bi-directional LSTMs.However, they used separate branches for the CNNs and LSTMs, merging the features extracted in each branch for the final fully connected layer.
Mekruksavanich and Jitpattanakul [216] compared a 4-layer CNN-LSTM model with a smaller CNN-LSTM model and LSTM models, finding the extra convolutional layers improved performance over the smaller models.DEBONAIR [215] is another multi-layered model.It uses parallel 1D-CNNs, each having different kernel, filter, and pooling sizes to extract different types of features associated with different types of activity.These are followed by a combined 1D-CNN, Manuscript submitted to ACM then two LSTM layers.Mekruksavanich and Jitpattanakul [217] ensembled four different models: a CNN, an LSTM, a CNN-LSTM, and a ConvLSTM model.They aimed to produce a model for boimetric user identification that could not only identify the activity being performed, but also the participant performing the activity.
A few hybrid models use GRUs instead of LSTMs.InnoHAR [211] is a modified DeepConvLSTM [191], replacing the four CNN layers with inception layers and the two LSTM layers with GRU layers.The authors found this inception model performed better than both the original DeepConvLSTM model and a straight CNN model [201].AttnSense [210] uses a Fast Fourier transform to generate frequency features which are then convolved separately for each time step.
Attention layers are used to weight the extracted frequency features.These are then passed through a GRU with temporal attention to extract temporal features.CNN-BiGRU [214] uses a CNN layer to extract spatial features from the sensor data, then one or more GRU layers extract temporal features.The final section of the model is a fully-connected module consisting of one or more hidden layers and a softmax output layer.

Satellite Earth Observation
Ever since NASA launched the first Landsat satellite in 1972 [223], Earth-observing satellites have been recording images of the Earth's surface, providing 50 years of continuous Earth observation (EO) data that can be used to estimate environmental variables informing us about the state of the Earth.Instruments on board the satellites record reflected or emitted electromagnetic radiation from the Earth's surface and vegetation [224].The regular, repeated observations from these instruments form satellite image time series (SITS) that are useful for analysing the dynamic properties of some variables, such as plant phenology.The main modalities used for SITS analysis are multispectral spectrometers and spectroradiometers, which observe the visible and infrared (IR) frequencies and Synthetic Aperture Radar (SAR) systems which emit a microwave signal and measure the backscatter.A list of the main satellites and instruments used in the studies reviewed is provided in section C.2 of the Appendix.
Raw data collected by satellite instruments needs to be pre-processed before being used in machine learning.This is frequently done by the data providers to produce analysis ready datasets (ARD).With the increasing availability of compatible ARD datasets from sources such as Google Earth Engine [225] and various data cubes [226,227], models combining data from multiple data sources (multi-modal) are becoming more common.These data sources make it straightforward to obtain data that are co-registered (spatially aligned and with the same resolution and projection), thus avoiding the need for complex pre-processing.
Satellite image time series can be processed either (1) as two-dimensional temporal and spectral data, processing each pixel independently and ignoring the spatial dimensions, or (2) as four-dimensional data, including the two spatial dimensions, thus models extract spatio-temporal features.This latter method allows estimates to be made at pixel, patch, or object level, however it requires either more complex models, or spatial features to be extracted in a pre-processing step.Feature extraction can be as simple as extracting the mean value for each band.However, both clustering (TASSEL, [228]), and neural-network based methods, such as the Pixel-Set Encoder [229] have been used for more complex feature extraction.
The most common use of SITS deep learning is for the classification of the Earth's surface by land cover and agricultural land by crop types.The classes used can range from very broad land cover categories (such as forest, grasslands, agriculture) through to specific crops types.Other classification tasks include identifying specific features, such as sink-holes [230], burnt areas [231], flooded areas [232], roads [233], deforestation [234], vegetation quality [235] and forest understory and litter types [236].
Manuscript submitted to ACM Extrinsic regression tasks are less common than classification tasks, but several recent studies have investigated methods of estimating water content in vegetation, as measured by the variable Live Fuel Moisture Content (LFMC) [237][238][239][240].Other regression tasks include estimating the wood volume of forests [241] by using a hybrid CNN-MLP model combining a time series of Sentinel-2 images with a single LiDAR image and crop yield [242] which uses a hybrid of CNN and LSTM.
Many different approaches to learning from SITS data have been studied, with studies using all the main deep learning architectures, adapting them for multi-modal learning, and combining architectures in hybrid and ensemble models.The rest of this section reviews the architectures that have been used to model SITS data.A summary of these papers and architectures is provided in table 6.
Rao et al. [237] used an extrinsic regression LSTM model to estimate LFMC in the western United States.
More commonly, however, RNNs are combined with an attention layer to allow the model to focus on the most important time steps.The OD2RNN model [261], used separate GRU layers followed by attention layers to process Sentinel-1 and Sentinel-2 data, combining the features extracted by each source for the final fully-connected layers.
HOb2sRNN [260] refined OD2RNN by using a hierarchy of land cover classifications; the model was pretrained using broad land cover classifications, then further trained using the finer-grained classifications.DCM [246] and HierbiLSTM [247] both use a bi-directional LSTM, processing the time series in both directions, followed by a selfattention transformer for a pixel-level crop-mapping model.All these studies found adding the attention layers improved model performance over a straight GRU or LSTM model.

Convolutional Neural Networks (CNNs)
. While many authors have claimed that RNNs out-perform CNNs for land cover and crop type classification, most of these comparisons are to 2-dimensional CNNs (2D-CNN), that ignore the temporal ordering of SITS data [253].However, other studies show using 1-dimensional CNNs (1D-CNNs) to extract temporal information or 3-dimensional CNNs (3D-CNNs) to extract spatio-temporal information are both effective methods of learning from SITS data.TempCNN [253] consists of three 1D convolutional layers.The output from the final convolutional layer is passed through a fully-connected layer, then the final softmax classification layer.TASSEL [228], is an adaptation of TempCNN for OBIA classification, using TempCNN models to process features extracted from the objects, followed by an attention layer to weight the convolved features.TempCNN has also been adapted for extrinsic regression [238] and used for LFMC estimation [238][239][240].
2D-CNNs are mainly used to extract spatial or spatio-temporal features for both pixel and object classification.The model input is usually 4-dimensional and the data is convolved spatially, with two main methods used to handle the temporal dimension.In the first method, each time step is convolved separately and the extracted features are merged in later stages of the model [244].In the second method, the time steps and channels are flattened to form a large multivariate image [241,252].FG-UNet [258] is a fully-convolutional model that combines both the above methods, first grouping time steps by threes to produce images with 30 channels (10 spectral × 3 temporal), which are passed through both U-Net and 2D-CNN layers.
Ji et al. [245] used a three-dimensional CNN (3D-CNN) to convolve the spatial and temporal dimensions together, combining the strengths of 1D-CNN and 2D-CNNs.The study found a 3D-CNN crop classification model performed significantly better than the 2D-CNN, again showing the importance of the temporal features.Another study, SSTNN [263] obtained good results for crop yield prediction by using a 3D-CNN to convolve the spatial and spectral dimensions, extracting spatio-spectral features for each time step.These features were then processed by LSTM layers to perform the temporal modelling.

Transformer and Attention Models.
As an alternative to including attention layers with a CNN or RNN, several studies have designed models that process temporal information using only attention layers.PSE-TAE [229] used a modified transformer called a temporal attention encoder (TAE) for crop mapping and found the TAE performed better than either a CNN or an RNN.L-TAE [248] replaced the TAE with a light-weight transformer which is both computationally efficient and more accurate than the full TAE.Ofori-Ampofo et al. [249] adapted the TAE model for multi-modal inputs, using Sentinel-1 and Sentinel-2 data for crop type mapping.Rußwurm and Körner [264] compared Manuscript submitted to ACM a self-attention model with RNN and CNN architectures.They found that this model was more robust to noise than either RNN or CNN and suggested self-attention is suitable for processing raw, cloud-affected satellite data.
Building on the success of pre-trained transformers for natural language processing (NLP) such as BERT [93], pre-trained transformers have been proposed for EO tasks [250].Earth observation tasks are particularly suited for pre-trained models as large quantities of EO data are readily available, while labelled data can be difficult to obtain [265], especially in remote locations.SITS-BERT [250] is an adaptation of BERT [93] for pixel-based SITS classification.For the pretext task, random noise is added to the pixels, and the model is trained to identify and remove this noise.The pre-trained model is then further trained for required tasks such as crop type or land cover mapping.SITS-Former [262] modifies SITS-BERT for patch classification by using 3D-Conv layers to encode the spatial-spectral information, which is then passed through the temporal attention layers.The pretext task used for SITS-Former is to predict randomly masked pixels.

Hybrid Models.
A common use of hybrid models is to use a CNN to extract spatial features and an RNN to extract temporal features.Garnot et al. [266] compared a straight 2D-CNN model (thus ignoring the temporal aspect), a straight GRU model (thus ignoring the spatial aspect) and a combined 2D-CNN and GRU model (thus using both spatial and temporal information) and found the combined model gave the best results, demonstrating that both the spatial and temporal dimensions provide useful information for land cover mapping and crop classification.DuPLO [256] was one of the first models to exploit this method, running a CNN and ConvGRU model in parallel, then fusing the outputs using a fully-connected network for the final classifier.During training, an auxiliary classifier for each component was used to enhance the discriminative power.TWINNS [255] extended DuPLO to a multi-modal model, using time series of both Sentinel-1 (SAR) and Sentinel-2 (Optical) images.Each modality was processed by separate CNN and convGRU models, then the output features from all four models were fused for classification.
Other hybrid models include Li et al. [243], who used a CNN for spatial and spectral unification of Landsat-8 and Sentinel-2 images which were then processed by a GRU.MLDL-Net [242] is a 2D-CNN extrinsic regression model, using CNNs to extract time step features, which are then passed through an LSTM model to extract temporal features.
Fully connected layers combine the feature sets to predict crop yield.Rußwurm and Körner [257] extracted temporal features first, using a bi-directional LSTM, then used a fully-convolutional 2D-CNN to incorporate spatial information and classify each pixel in the input patch.
7.2.5 Ensemble Models.One of the easiest ways to ensemble DL models is to train multiple homogeneous models, that vary only in the random weight initialisation [267].Di Mauro et al. [251] ensembled 100 LULC models with different weight initialisations by averaging the softmax predictions.They found this produced a more stable and stronger classifier that outperformed the individual models.Multi-tempCNN [239], a model for LFMC estimation, is an ensemble of homogeneous models for extrinsic regression.The authors suggested that as an additional benefit, the variance of the individual model predictions can be used to obtain a measure of uncertainty of the estimates.TSI [254] also ensembles a set of homogeneous models, but instead of relying on random weight initialisation to introduce model diversity, the time series are segmented and models trained on each segment.
Other methods create ensembles of heterogeneous models.Kussul et al [252] compared ensembles of 1D-CNNs and 2D-CNNs models for land cover classification.Each model in the ensemble used a different number of filters, so finding different feature sets useful for classification.Xie et al. [240] ensembled three heterogeneous models -a causal temporal convolutional neural network (TCN), an LSTM, and a hybrid TCN-LSTM model -for an extrinsic regression model to estimate LFMC.The ensembles were created using stacking [268].The authors compared this method to Manuscript submitted to ACM boosting their TCN-LSTM model, using Adaboost [269] to create a three-member ensemble, and found that stacking a diverse set of models out-performed boosting.

EO Surveys and
Reviews.This survey is one of very few that include a section focusing specifically on deep learning TSC and TSER tasks using SITS data.However, there are other reviews that provide further information about related topics.Gomez et al. [270] is an older review highlighting the importance role of SITS data for land cover classification.Zhu et al. [271] reviewed the advances and challenges in DL for remote sensing, and the resources available that are potentially useful to help DL address some of the major challenges facing humanity.Ma et al. [272] studies the role of deep learning in Earth observation using remotely sensed data.It covers a broad range of tasks including image fusion, image segmentation and object-based analysis, as well as classification tasks.Yuan et al. [273] provide a review of DL applications for remote sensing, comparing the role of DL versus physical modelling of environmental variables and highlighting challenges in DL for remote sensing that need to be addressed.Chaves et al. [274] reviewed recent research using Landsat 8 and/or Sentinel-2 data for land cover mapping.While not focused on SITS DL methods, the review notes the growing importance of these methods.Moskolai et al. [275] is a review of forecasting applications using DL with SITS data that provides an analysis of the main DL architectures that are relevant for classification as well as forecasting.

CONCLUSION
In conclusion, this survey paper has discussed a variety of deep network architectures for time series classification and extrinsic regression tasks, including multilayer perceptrons, convolutional neural networks, recurrent neural networks, and attention-based models.We have also highlighted refinements that have been made to improve the performance of these models on time series tasks.Additionally, we have discussed two critical applications of time series classification and regression, human activity recognition and satellite Earth observation.Overall, using deep network architectures and refinements has enabled significant progress in the field of time series classification and will continue to be essential for addressing a wide range of real-world problems.We hope this survey will stimulate further research using deep learning techniques for time series classification and extrinsic regression.Additionally, we provide a carefully curated collection of sources, available at https://github.com/Navidfoumani/TSC_Survey, to further support the research community.

APPENDIX A NON-DEEP LEARNING TIME SERIES CLASSIFICATION
In this section, we aim to give a brief introduction to the field of TSC and discuss its current status.We refer interested readers to the 'bake-off' papers [11,25,26] that describes TSC methods in much details and benchmark them.
Research in TSC started with distance-based approaches that find discriminating patterns in the shape of the time series.Distance-based approaches usually consist of coupling a 1-nearest neighbour (1NN) classifier with a time series distance measure [276,277].Small distortions in the time series can lead to false matches when measuring the distance between time series using standard distance measurements such as Euclidean distance [276].A time series distance measure aims to compensate for these distortions by aligning two time series such that the alignment cost between the two are minimised.There are many time series distances proposed in the literature; among these, the Dynamic Time Warping ( ) distance is one of the most popular choices for many time series tasks, due to its intuitiveness and effectiveness in aligning two time series.The 1NN- has been the go-to method for TSC for decades.However, by comparing several time series distance measures, the work in [276] showed that as of 2015, there was no single distance that significantly outperformed  when used with a 1NN classifier.The recent Amerced  [278] distance is the first distance that is significantly more accurate than  .These individual 1NN classifiers with different distances can be ensembled together to create an ensemble, such as the Ensemble of Elastic distances (EE), that significantly outperforms each of them individually [276,277].However, since most distances have a complexity of  ( 2 ) where  is the length of the series, performing a nearest neighbour search becomes very costly.Hence, distance-based approaches are considered to be one of the slowest methods for TSC [279,280].
Taking advantage of this notion led to the development of the Hierarchical Vote Collective of Transformation-based Ensembles (HIVE-COTE) [280,283].HIVE-COTE is a meta ensemble for TSC and forms its ensemble from ensemble classifiers of multiple domains.Since its introduction in 2016 [283], HIVE-COTE has gone through a few iterations.
Recently, the latest HIVE-COTE version, HIVE-COTEv2.0(HC2) was proposed [280].It is comprised of 4 ensemble members, each of them being the then state of the art in their respective domains.It is currently one of the most accurate classifiers for both univariate and multivariate TSC tasks [280].Despite being accurate on 26 multivariate and 142 univariate TSC benchmark datasets, that are relatively small, HC2 scales poorly on large datasets with long time series as well as datasets with large numbers of channels.
Various work has been done on speeding up TSC methods without sacrificing accuracy [14,277,[290][291][292][293][294].A recent breakthrough is the development of Rocket [14] that was able to process 109 univariate time series datasets under 4 hours while the previous fastest took days.Rocket leverages large number of random convolutional filters to extract features from each series that might be relevant to classifying a series.These features are then passed to a linear model for classification.Rocket has been improved to be faster (Minirocket [290]) and more accurate (Multirocket [291] and Hydra [292]).Hydra when combined with Multirocket is now one of the fastest and most accurate method for TSC.

B DNN ARCHITECTURES FOR TIME SERIES
In this section, we provide a descriptive overview of deep learning-based models for TSC.The focus is on clarifying their architectures and outlining their adaptations to the specific characteristics of time series data.

B.1 Multi-Layer Perceptron (MLP)
The simplest Neural Network architecture is a fully connected network (FC), also known as a multilayer perceptron (MLP).As shown in Fig. 2 all neurons of one layer  − 1 are connected to all neurons of the following layer  with  ∈ [1, ].The weights model these connections in a neural network.A general equation for applying a non-linearity to an input   −1 is: where   the activation of the neurons in layer  where  1 is equal to input series  .Also,  and  are the neuron weights and biases, and  is the nonlinear activation function.
One of the main limitations of using multilayer perceptrons (MLPs) for time series data is that they are not well-suited to capturing the temporal dependencies in this type of data.MLPs are feedforward networks that process input data in a fixed and predetermined order without considering the temporal relationships between the input values.As shown in Fig. 2, each time step is weighted individually, and time series elements are treated independently from each other.

B.2 Convolution Neural Networks (CNNs)
The convolutional neural network (CNN) was first proposed by Kunihiko Fukushima in 1982 [295].It was inspired by the structure and function of the visual cortex in animals, specifically the cat's cortex, as described by David Hubel and Torsten Wiesel in their influential work from 1962 [296].Convolutional neural networks have been widely used for visual pattern recognition, but their ability to process large images was constrained by computational limitations  convolution layer is composed of several convolution kernels (or filters) used to compute different feature maps.In particular, each neuron of a feature map is connected to a region neighboring neurons in the previous layer called the receptive field.Feature maps can be created by first convolving inputs with learned kernels and then applying an element-wise nonlinear activation function to the convolved results.It is important to note that all spatial locations of the input share the kernel for each feature map, and several kernels are used to obtain the entire feature map.
The feature value of the  th layer of  th feature map at location (, ) is obtained by: Where W   and    are the weight vector and bias term of the  th filter of the  th layer, respectively, and A  −1 , is the input patch centered at location (, ) of the  layer.Note that the kernel W   generates the feature map   :,:, is A weight-sharing mechanism has several advantages, such as reducing model complexity and making the network easier to train.Let  (.) denote the nonlinear activation function.The activation value of convolutional feature   ,, can be computed as: The most common activation functions are sigmoid, tanh and ReLU [298].As shown in Fig. 3, a pooling layer is often placed between two convolution layers to reduce the resolution of the feature maps and to achieve shift-invariance.
Following several convolution stages −the block comprising convolution, activation, and pooling is called convolution  − there may be one or more fully-connected layers that aim to perform high-level reasoning.As discussed in section 3.1, each neuron in the previous layer is connected to every neuron in the current layer to generate global semantic information.In the final layer of CNNs, there is the output layer in which the Softmax operators are commonly used for classification tasks [40].

B.3 Recurrent Neural Networks (RNN)
RNNs are types of neural networks that are specifically designed to process time series and other sequential data.
RNNs are conceptually similar to feed-forward neural networks (FFNs).While FFNs map from fixed-size inputs to fixed-size outputs, RNNs can process variable-length inputs and produce variable-length outputs.This capability is enabled by sharing parameters over time through directed connections between individual layers.RNN models for TSC can be classified as sequence to sequence or sequence-to-one based on their outputs.Fig. 4 shows sequence to sequence architectures for RNN models, with an output for each input sub-series.On the other hand, in sequence-to-one architecture, decisions are made using only   and ignoring the other outputs.
Manuscript submitted to ACM At each time step , RNNs maintain a hidden vector ℎ which updates as follows [299,300]: Where  =  1 , ...,   −1 ,   , ...,   contains all of the observation, ℎ denotes the hyperbolic tangent function, and the recurrent weight and the projection matrix are shown by  and  , respectively.The hidden-to-hidden connections also model the short-term time dependency.The hidden state ℎ is used to make a prediction as: where   is a softmax function and provides a normalized probability distribution over the possible classes.As depicted in Fig. 4, the hidden state ℎ can be used to stack RNNs in order to build deeper networks: where  is the logistic sigmoid function.As an alternative to feeding each time step to the RNN, the data can be divided into time windows of  observations, with the option for variable overlaps.Each time window is labeled with the majority response labels within the  window.

B.3.1 Long Short Term Memory (LSTM)
. LSTM deals with the vanishing/exploding gradient problem commonly found in standard recurrent neural networks through the incorporation of gate-controlled memory cells into their state dynamics [78].As shown in Fig. 5 (a) LSTM uses a hidden vector ℎ and a memory vector  to control state updates and outputs for each time step.Specifically, the computation at time step  is formulated as follows [301]: Manuscript submitted to ACM where Γ  is a cell state gate and Γ  , Γ  and Γ  are the activation vector of the input, forget and output gate respectively.
is the logistic sigmoid function and ⊗ shows the element-wise product.  ,  ,  ,  represent the recurrent weight matrices, and   ,   ,   ,   represent the projection matrices.
Where   and   are the weight matrices associated with gates, and Γ  and Γ  represent the update and reset gates, respectively.The function  denotes the logistic sigmoid, and ⊗ shows the element-wise product.

B.4 Attention Based Model
B.4.1 Self-Attention.The attention mechanism was introduced by [302] for improving the performance of encoderdecoder models [303] in neural machine translation.The encoder-decoder in neural machine translation encodes a source sentence into a vector in latent space and decodes the latent vector into a target language sentence.As shown in Fig. 6, the attention mechanism allows the decoder to pay attention to the segments of the source for each target through a context vector   .For this model, a variable-length attention vector   , equal to the number of source time steps, is derived by comparing the current target hidden state ℎ  with each source hidden state ℎ  as follows [304]: The term  is referred to as an alignment model and used to compare the target hidden state ℎ  with each of the source hidden states ℎ  , and the result is normalized to produced attention weights (a distribution over source positions).
There are various choices of the scoring function: Manuscript submitted to ACM These scores influence the attention distribution, impacting how the model attends to different parts of the input sequence during predictions.As shown above, the score function is parameterized as a feedforward neural network that is jointly trained with all the other components of the model.The model directly computes soft attention, allowing the cost function's gradient to be backpropagated [302].
Given the alignment vector as weights, the context vector   is computed as the weighted average over all the source hidden state: Accordingly, the computation path goes from ℎ  →   →   → ℎ  then make a prediction using a    function [304].Note that ℎ  is a refined hidden state that incorporates both the original hidden state ℎ  and the context information   obtained through attention mechanisms.
B.4.2 Transformers.Similar to self-attention and other competitive neural sequence models, the original transformer developed for NLP (hereinafter the vanilla transformer) has an encoder-decoder structure that takes as input a sequence of words from the source language and then generates the translation in the target language [92].Both the encoder and decoder are composed of multiple identical blocks.Each encoder block consists of a multi-head self-attention module and a position-wise feed-forward network (FFN), while each decoder block inserts cross-attention models between the multi-head self-attention module and the position-wise feed-forward network (FFN).Unlike RNNs, Transformers do not use recurrence and instead model sequence information using the positional encoding in the input embeddings.
The transformer architecture is based on finding associations or correlations between various input segments using the dot product.As shown in Fig. 7, the attention operation in transformers starts with building three different linearly-weighted vectors from the input   , referred to as query (  ), key (  ), and value (  ): where   ,  and   learnable weight matrices.The output vectors z  are given by: Note that the weighting of the value vector v  depends on the mapped correlation between the query vector q  at position  and the key vector k  at position .The value of the dot product tends to grow with the increasing size of the query and key vectors.As the softmax function is sensitive to large values, the attention weights are scaled by the square root of the size of the query and key vectors   .The input data may contain several levels of correlation information, and the learning process may benefit from processing the input data in multiple different ways.Multiple attention heads are introduced that operate on the same input in parallel and use different weight matrices   ,  , and   to extract various levels of correlation between the input data.

B.5 Graph Neural Networks
A graph consists of a set of nodes and a set of edges, each of which connects two nodes.Both nodes and edges may have attributes associated with them.The edges may be directional or unidirectional, and may be weighted.Graphs are useful for representing data that cannot be represented in Euclidean space, such as molecular structures, social networks, and spatial temporal data (for example, electroencephalogram (EEG) or traffic monitoring networks).
Graph neural networks (GNNs) were first proposed by Scarcelli et al. [120] to learn directly from graph representations of data.Prior to the use of GNNs, techniques such as recursive neural networks and Markov chains (random walk models) were used to incorporate graph structures.However, these methods required a pre-processing step, rather than learning directly from the graph structure.GNNs take as input the graph structure and any associated node and edge attributes.Depending on the required task, the GNN output can be per node, per edge or a single output per graph [119].Graph convolutional networks (GCN) were proposed by Bruna et al. [305] and extend CNNs to graph structures.
Bruna et al. proposed two methods of constructing the graph, spatial and spectral.The spatial technique simply applies the convolution operator to the local neighbourhood of each node, followed by a pooling operator.Although this reduces the spatial resolution, successive layers compensate for this by increasing the number of filters.Spectral construction first transforms the graph to a matrix V, which consists of the eigenvectors of the graph Laplacian, order by eigenvalue.
The eigenvectors represent frequency components of the original graphs, so lower-order eigenvectors modulate slowly, thus neighbouring nodes have similar values.Higher-order eigenvectors modulate more rapidly and connected nodes are likely to have dissimilar values [306].
Many real-world graph datasets evolve over time -edges and nodes may come into existence, disappear or attributes may change value.Dynamic or temporal graphs build this information into the graph structure, for instance by the use of temporal nodes and edges, that include initial and final timestamps [307].Alternatively, spatiotemporal GNNs model the spatial and temporal aspects in separate layers, using GCN layers to learn spatial representations and RNN or 1D-CNN layers for the temporal representations [119].

C DATASETS C.1 HAR Datasets
Many of the studies reviewed in the subsection reviewing the use of deep learning for human activity recognition time series use publicly available datasets.Some of the most commonly used datasets are listed in table 7, together with the number of participants, the sensors used to collect the data, a description of the activities recorded, and references to the studies using each dataset.Larger lists of datasets are provided in [5,186].Common activity sets include activities of daily living (ADL) or basic activities (e.g.walking, running, sitting, standing, ascending/descending stairs.However, activities for more specialised events such as gait freezing in Parkinson's Disease patients [308], falls [309], and manufacturing activities [310] are also collected.

C.2 Earth observation satellites and instruments
Table 8 lists the main satellites and instruments used in the studies reviewed for this survey.The table lists references for each source, which provide more details about the data collected, plus a list of the studies using each source.

3. 3 . 4
Hybrid Models.CNN's and RNNs are often combined for TSC because they have complementary strengths.As mentioned previously, CNNs are well-suited for learning from spatial relationships in data, such as the patterns and correlations between the channels of different time steps in a time series.This allows them to learn useful features from the time series data that can help improve the classification performance.RNNs, on the other hand, are well-suited for learning from temporal dependencies in data, such as the past values of a time series that can help predict its Manuscript submitted to ACM until the emergence of GPU technology.Following the development of Graphics Processing Unit (GPU) technology, Krizhevsky et al.[39] implemented an efficient GPU-based program and won the ImageNet competition in 2012, bringing the convolution neural network back into the spotlight.Many variants of CNN architectures have been proposed in the literature, but their primary components are very similar.Using the LeNet-5[297] as an example, it consists of three types of layers: convolutional, pooling, and fully connected.The purpose of the convolutional layer is to learn feature representations of the inputs.Fig.3shows the architecture of the t-LeNet network, which is a time series-specific version of LeNet.This figure shows that the Manuscript submitted to ACM Input

Fig. 3 .
Fig. 3.The architecture of the t-LeNet network (time series specific version of LeNet)

Fig. 4 .
Fig. 4. The architecture of two layer Recurrent Neural Network

Fig. 7 .
Fig. 7. Multi-head attention block: the example consists of eight heads, and the input sequence comprises two time steps.
Scarcelli et al.'s proposed GNN combines recursive neural networks and Markov chains to deal directly with the graph structure without any pre-processing requirement.While the network structure is predefined, the edge weights Manuscript submitted to ACM are parameters learned during training.During training, units exchange information and update their states until reaching equilibrium.

Table 1 .
Summary of CNN models for time series classification and extrinsic regression

Table 2 .
Summary of Attention-based Models for Time Series Classification and Extrinsic Regression TSC.Based on the limitations of the current position encodings for time series, they introduced two novel ones named tAPE and eRPE for absolute and relative positions, respectively.Integrating these proposed position encodings into a transformer block and combining them with a convolution layer, they presented a novel deep-learning framework for multivariate time series classification-ConvTran.

Table 3 .
Summary of graph neural network models for time series classification and extrinsic regression

Table 4 .
Summary of self-supervised models for time series classification and extrinsic regression

Table 5 .
Summary of HAR deep learning models

Table 6 .
[259]ry of SITS deep learning modelsManuscript submitted to ACM 7.2.1 Recurrent Neural Networks (RNNs).One of the first papers to use RNNs for land cover classification was Ienco et al.[259], who showed an LSTM model out-performed non deep learning methods such as Random Forest (RF) and