RITA: Group Attention is All You Need for Timeseries Analytics

Timeseries analytics is of great importance in many real-world applications. Recently, the Transformer model, popular in natural language processing, has been leveraged to learn high quality feature embeddings from timeseries, core to the performance of various timeseries analytics tasks. However, the quadratic time and space complexities limit Transformers' scalability, especially for long timeseries. To address these issues, we develop a timeseries analytics tool, RITA, which uses a novel attention mechanism, named group attention, to address this scalability issue. Group attention dynamically clusters the objects based on their similarity into a small number of groups and approximately computes the attention at the coarse group granularity. It thus significantly reduces the time and space complexity, yet provides a theoretical guarantee on the quality of the computed attention. The dynamic scheduler of RITA continuously adapts the number of groups and the batch size in the training process, ensuring group attention always uses the fewest groups needed to meet the approximation quality requirement. Extensive experiments on various timeseries datasets and analytics tasks demonstrate that RITA outperforms the state-of-the-art in accuracy and is significantly faster -- with speedups of up to 63X.

Effective feature extraction [40] lies at the core of almost all these timeseries analytics tasks.Recently researchers [61] have started leveraging the self-supervised pre-training methodology of Transformers [4,16,52], which have proven remarkably successful in natural language processing (NLP), to automatically learn high quality feature embeddings from timeseries.In NLP, self-supervised pre-training exploits the sequential patterns (correlations) among the words in sentences to produce contextualized feature embeddings.Timeseries bear similarity to natural language, because in timeseries data the sequential order among the values (stock price, volume, etc.) over time matters.That is, each value is highly correlated with other values observed before or after it.Therefore, * Corresponding Author pre-training a Transformer model which takes the correlations among different observations into account is a natural idea to learn feature embeddings from timeseries.Indeed, the experiments in [61] confirm that Transformer-based methods outperform traditional timeseries analytics techniques.
However, existing work [61] that directly applies Transformers to learn features from timeseries data have been shown not to be scalable to long timeseries [30].The idea of self-attention [52] is central to pre-training methods in NLP: It computes pairwise correlations among different semantic units in a sequence (in NLP, a sentence); as such, it has quadratic time and space complexity in the length of the input sequence.Such an approach places limits on the model's scalability, especially when handling large sequences, which are common in real-world timeseries applications such as IoT, medical AI, and finance [6,34,62].Predictions about timeseries may need to look at months or years of historical data to make accurate predictions, spanning hundreds of thousands of samples.As an example, in collaboration with a research hospital we have been developing a seizure classifier that automatically detects seizures based on EEG signals (timeseries) collected during the clinical observation of patients.As seizures last only a few seconds, we chunk long EEG data into many 2 second segments and detect seizures at a segment level.However, the classification of a particular segment depends on up to 12 hours of prior signal to determine if one 2 second segment indicates seizure or not, because seizure diagnosis needs to consider long-term trends in the EEG data [6].The number of segments in 12 hours is more than 21k.This is far larger than the number of semantic units the typical NLP tasks expect.For example, BERT [16] limits the number of units to 512 and even massive models like GPT-3 [4] limit the number of units to 2048.
Although in NLP some lower-complexity methods have been proposed to approximately compute self-attention [10,26,54], their performance degrades dramatically when used on timeseries, due to the gap between natural language and timeseries, as we will show in our experiments.Proposed Approach.To tackle the aforementioned problem, we develop RITA, a Transformer-based timeseries analytics tool, which uses a novel attention mechanism, called group attention, to scale to long timeseries.
Leveraging the periodicity of timeseries, RITA chunks the input timeseries into segments and dynamically clusters the segments into a small number (denoted as  ) of groups.Segments in the same group possess similar feature embeddings during the current training iteration, thus enabling them to approximately share the computation of attention.As the timeseries increases in length, more sharing opportunities become available.RITA then computes the self-attention at a group level and produces a compressed group attention matrix.In this way, group attention eliminates both computation and memory bottlenecks in Transformer-style models and thus more scalable to long timeseries.
However, making this idea effective and efficient in Transformer architectures is challenging for several reasons: • Efficiently Producing High Quality Feature Embeddings.Although RITA computes the attention matrix at a group level, to preserve the quality of the feature embeddings, it still has to produce different embeddings for different segments.This is because even if some segments share the attention score temporally, it does not mean they should have the same feature embedding.However, using the group attention matrix, the existing self-attention mechanism will only produce a single feature vector for each group.A naive solution would be to restore the original attention matrix from the group attention matrix.However, in this case we again get an attention matrix with quadratic space complexity.Because GPUs have limited memory, GPU memory will remain a bottleneck in group attention.
• The Number of Groups N. In RITA, the number of groups  is a crucial factor that balances the speed up and the quality of attention approximation.A small  will lead to a large speedup, but the approximation errors can also be significant.On the other hand, although a large  tends to produce high-quality approximations, it inevitably slows down the training process.Therefore, an appropriate  is essential to the performance of group attention.However,  depends on the distributional properties of the dataset.Furthermore, like the classical transformer models, RITA stacks multiple attention layers to produce better embeddings.Ideally, different layers should also use different values of  .In addition, during the model training phrase, group attention should use different values of  at different iterations to adapt to the varying feature embeddings.This makes manually setting appropriate  almost impossible.
• Batch Size.Moreover, as we want to dynamically adjust  during training, a fixed batch size is sub-optimal: as  decreases, the memory usage of a single sample decreases.This allows a larger batch size which is beneficial, because: (1) it makes full use of GPU memory; (2) high-parallelism across the samples in a big batch brings better performance.Our experimental study shows that doubling the batch size reduces the training time by 30%, while still preserving the quality of the model.Thus, RITA should dynamically adjust batch size as  changes.
To address the above problems, we first propose an embedding aggregation strategy and a customized group softmax function to replace the classical softmax function [52].Together they ensure RITA is able to directly use the compressed attention matrix to produce different feature embeddings for different segments.We theoretically show the embeddings RITA produces in this way are identical to those produced by first re-storing the original large attention matrix.Thus RITA is able to produce high quality embeddings without introducing extra overhead.Further, we design a GPU friendly algorithm to group the segments in parallel, effectively minimizing the grouping cost.Second, we design an adaptive scheduler which dynamically decides an appropriate  for each group attention layer during the training process.It starts with a large  and iteratively merges groups that are similar to each other.Guided by an error bound on the approximated self-attention that users can tolerate, it automatically determines if two groups are mergeable, performing merging efficiently in a GPU-friendly way.

Scale & Input
Moreover, we propose a learning-based method to model the correlation between the number of groups  and the batch size .This model is used to predict  for a given  when training RITA.Specifically, we first sample some  values in a reasonable range.For each sampled  , we find a batch size that consumes up to a certain percentage of GPU memory in a cost-efficient way.Using a small set of mathematical functions as a prior, RITA learns a model with only a few <N, B> pairs as ground truth labels.
Our experiments on public timeseries benchmarks and the MGH EEG data [6] confirm that RITA outperforms state-of-the-art methods in accuracy on various timeseries analytics tasks, while our group attention mechanism achieves a 63X speedup with much less memory required, compared to existing self-attention mechanisms [10,52,54].Contributions.The key contributions of this work include: • Our group attention mechanism leverages the periodicity of timeseries, reducing the time and space complexity of the selfattention mechanism with accuracy guarantees, allowing RITA to scale to long timeseries data.
• Guided by an approximation error bound, our adaptive scheduler dynamically adapts the number of groups and the batch size to the distribution properties of the evolving feature embeddings, making group attention efficient and easily tunable.
• We conduct experiments on various datasets and different analytics tasks, demonstrating that RITA is 4 to 63 times faster than the state-of-the-art while achieving better accuracy when handling long timeseries (length ≥ 2000).

BACKGROUND
We provide some background on the canonical self-attention module in the Transformer [52].A self-attention module takes  hidden embedding vectors  ∈ R  *  ℎ as input, then projects them to queries (), keys () and values ( ) and performs Scaled-dot Product Attention, which given input hidden state  , is computed by: Where Given a matrix  ∈ R  *  , the softmax function normalizes  to ensure the sum of each row equals to 1, as shown below.
Note the attention matrix A is an × matrix, where  represents the number of elements in the input sequence (e.g.words in NLP).

RITA OVERVIEW
Given a collection of unlabeled timeseries, RITA first pre-trains a Transformer-style model to produce high quality feature embeddings for timeseries data.This pre-trained model is then used to support various downstream tasks, similar to BERT [16].Next, we overview the model architecture of RITA.We show how RITA supports various downstream tasks in Appendix A.7.
As shown in Fig. 1, RITA is consist of two components: (1) Timeaware Convolution Layer (2) RITA Encoder.Time-aware Convolution Layer fills the gap between timeseries and natural language.Despite their high-level similarity, there is a big gap between timeseries and natural language.First, in natural language each word, as a discrete semantic unit, has an independent meaning, while each element in a timeseries is a continuous, numerical value and does not necessarily constitute an independent event.Furthermore, the input sequences are single-channeled in NLP, but often multi-channeled in timeseries (i.e., sensor data often consists of several related channels).
RITA leverages the classical convolution [28] strategy to solve this problem.Convolution is widely used to capture the local structures of an image.We use convolution to chunk one input timeseries into a sequence of windows and learn the local structure of each window, similar to the discrete semantic units in natural language.It also discovers the correlations across different channels, thus naturally solving the multi-channel problem.
More specifically, treating a multi-variate timeseries of length  and with  variables as an n × m matrix  , RITA uses  convolution kernels to chunk  into n windows and produce one d-dimensional embedding per window using the convolution operation [28].Each convolution kernel corresponds to a w × m matrix, where  defines the number of timestamps that each convolution kernel covers, identical to the window size in sliding window.RITA Encoder functions as Transformer Encoder as described in the original Transformer work [52].It takes the embeddings of  semantic units  1 ,  2 , ...,   (  ∈   ) as input (e.g.embeddings of  windows for a timeseries), then models the correlations between the semantic units and outputs  1 , ...,   (  ∈   ) as the contextaware embedding of each unit.
What makes RITA Encoder different from Transformer Encoder is that: at the core of Transformer Encoder lies self-attention mechanism which incurs a  ( 2 ) time complexity and memory usage.This quadratic cost becomes prohibitive for long timeseries and limits the scalablity of Transformer-based models.To make the attention computation efficient yet high-quality, we replace the canonical self-attention with our proposed group attention.Self-supervised Pretraining.Inspired by the "cloze text" pretraining task in NLP, we designed a mask-and-predict task as the pretraining task for our model.The timeseries is randomly masked and the model should recover the masked values based on corresponding contextual information.
To be specific, we generate masks on time-stamps, with a mask rate .The timeseries is scaled to be non-negative and the values across all the channels on the masked timestamps are set to be -1, an impossible value on normal timestamps.Then the masked timeseries is fed into RITA and the output representation is translated to the recovered timeseries by a Transpose Convolution layer.

GROUP ATTENTION MECHANISM
Group attention, a novel and efficient approximate attention mechanism, addresses the performance bottleneck of self-attention in the vanilla Transformer.In this section, we first introduce the framework of group attention and then theoretically establish the bound of its approximation error.

The Idea of Group Attention
As periodicity is a natural property of timeseries [56], similar windows frequently occur.Similar windows result in similar queries/keys for attention computation, bringing opportunities for saving computation.
As discussed in Sec. 2,    , the attention score of window  onto window , is determined by the inner product between the query vector of window  and the key vector of window , that is,   •   .Given another window , if window  has the similar key vector to window , that is, This observation inspires our group attention mechanism.That is, we group the windows by their similarity in keys.Assuming all windows in the same group have the same attention score onto another window , we then only compute the attention once by using one single key to represent this group, for example the centroid of the group of keys.This thus saves significant computation cost.
Better yet, after grouping  windows into  groups, group attention compresses the attention matrix from an × matrix to an × matrix.Because  (number of groups) tends to be much smaller than  (number of windows) due to the periodicity of timeseries, group attention consumes much less memory than the original self-attention mechanism, successfully eliminating the memory bottleneck.Note that it also doesn't hurt quality all that much, as confirmed in our experiments (Sec.6.2).We now discuss how to efficiently compute the output feature embeddings using the small compressed group attention matrix.

Problem: Producing Embeddings w/ Group Attention Matrix
As described in the Background, once we have acquired the attention matrix , canonical self-attention computes the output embedding  as O = AV .Because  is an  ×  matrix and  is an  ×   matrix, the matrix product operation still produces an  ×   matrix .That is, it produces a   dimensional feature vector for each window.However, our group attention will produce an  ×  attention matrix  , where  corresponds to the number of groups.In this case the matrix product will produce a  ×  matrix .That is, it produces a feature vector for each group.However, our goal is to produce different embeddings for different windows, because even if some windows share the attention score temporally, it does not mean they should have the same feature embedding.A Naive Solution.A naive solution would be to restore the full attention matrix  from the group attention matrix .For example, given one group composed of   and   , we map its group attention vector in  into two rows that correspond to   and   in .However, in this case we again get a  ×  attention matrix; and GPU memory remains a bottleneck in group attention.

Solution: Embedding Aggregation and Group SoftMax
Using an embedding aggregation operation and a group softmax function, RITA produces  embeddings without restoring the full attention matrix.Fig. 2 shows the workflow of group attention.Embedding Aggregation.The idea is inspired by the observation on the matrix product operation O = AV conducted on the fully restored attention matrix .
Given an element  , of  corresponding to the  ℎ dimension of   's feature vector,  , =   •  , where vector a i ∈ R n denotes the  ℎ row of the attention matrix  and vector v j ∈ R n denotes the  ℎ dimension of all the  feature vectors.Given As an example, assume  1 and  2 belong to the same group As an immediate generalization of the above analysis, if we aggregate up the windows that belong to the same group and convert the n-dimensional feature vector   into a  -dimensional group feature vector   beforehand, we could directly use the group attention vector   and the group feature vector   to compute  , .
Using embedding aggregation, RITA is able to produce the feature embedding  that is identical to the embedding  produced by using the full attention matrix  and the embedding matrix  .Group Softmax Function.In canonical self-attention the atten- ).To compute , we have to first compute   (denoted as ) which is an  ×  matrix.Then normalizing the  matrix with softmax produces the attention matrix .
Group attention follows the same procedure.But after grouping keys into ,    produces an  ×  matrix .Due to the nonlinearity of the softmax function, applying softmax directly on  will result in a group attention matrix  from which we are not able to recover a full attention matrix that is identical to first restoring  to  and then applying softmax on .The  matrix produced by the latter is desirable, as we want to approximate the original attention matrix as accurately as possible.However, restoring the small  ×   matrix is not memory efficient, as it will end up with a full  ×  matrix .
To solve the above problems, we introduce a new group softmax function to replace the original softmax function (Eq.2).

𝐺𝑟𝑜𝑢𝑝𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥
In Eq. 3,   represents the number of windows that Group   contains.Compared to the original softmax, our group softmax considers each group   as   elements and counts it   times when summing up the exponential of each group's  , .In this way, the group softmax function operating on the small  matrix will produce exactly the same result to the softmax function operating on the full  matrix.Theoretical Guarantee.In Appendix A.4, we prove that the group softmax function and the embedding aggregation operation produce the same output feature embedding with the naive method that has to first restore the big full attention matrix.
We show an efficient implementation of the embedding aggregation operation and group softmax function in Appendix A.2, Alg. 1.Time Complexity.The time complexity of Alg. 1 is  () and the space complexity is  ( ), while the time and space complexity of the original self-attention mechanism are  ( 2 ) and  ( 2 ).

Error Bound
Group attention produces a group attention matrix  which approximates the attention matrix  produced by the classical self-attention with a bounded error, as shown in Lemma 1.
Lemma 1.Let  be the radius of the ball where all key vectors live;   be the representative of the group that contains key   .Let  denote the full attention matrix restored from .Suppose the distance between   and Lemma 1 shows that the error bound  of the group attention is determined by the distance .As discussed in Sec.5.1, it inspires us to design a strategy to dynamically determine the number of groups  -the most critical parameter of group attention.Please refer to Appendix A.5 for the proof.

GPU Friendly Grouping Method
In this section, we discuss the implementation of a grouping method.To make group attention efficient and effective, the grouping method has to satisfy the following requirements: (1) Tight distance bound: to ensure the approximation quality, the distance between each key and its group representative should be minimized according to Lemma 1.
(2) Lightweight: to ensure the performance gain, the grouping method must be lightweight, at worst not exceeding the complexity of group attention itself ( ()).
(3) GPU friendly: to take advantage of GPUs, we prefer a grouping method that mainly consists of matrix operations, which can be efficiently executed on a GPU.
To satisfy the above requirements, after thorough investigation on various clustering algorithms, we design a GPU friendly Kmeans [35] as the grouping method.
First, K-means minimizes the overall distance between any object and its cluster center, hence naturally satisfying Requirement 1.
Second, given  centers, in each iteration the time and space complexity of K-means is  ( ).Usually, the iteration goes until convergence.However, we observe that rather than seeking a perfect K-means clustering, training a few iterations is sufficient to get a good grouping for group attention, because typically the later iterations only slightly update the clustering and group attention is robust to such imperfection.
Third, we design a GPU-friendly implementation of K-means.The performance bottleneck of K-means comes from the distance computation between each vector and its center, that is, We instead use a different formulation: . This is because in this formulation, the performance bottleneck is   •   , which could be implemented as a matrix product operation.Although the complexity of the two formulations is the same, in GPUs matrix product is much more efficient than pairwise difference.

ADAPTIVE SCHEDULER
Next, we present the adaptive scheduler of RITA which addresses the challenges of determining an appropriate number of groups  and accordingly the batch size , as described in Introduction.
Using a dynamic scheduling method we propose, the scheduler automatically determines and adjusts  and  based on the distributional properties of the feature embeddings produced over the iterative training process, while guaranteed to produce high quality attention approximation that meets the requirement of users.In Sec.5.1 we show how RITA automatically determines  .Then we introduce in Sec.5.2 the learning-based method which given an  , immediately predicts a good batch size.

Dynamically Determining the Number of Groups N
Without loss of generality, we use one group attention module as an example to show how RITA automatically gets an appropriate  .The adaptive scheduler of RITA starts with a large  and decreases it dynamically.This is because in the training process of RITA, the feature embeddings produced epoch by epoch tend to get stabler and stabler and gradually converge, thus no need to increase  .RITA reduces the number of groups by merging similar groups.Intuitively, given two groups, we could measure their similarity based on the distance of their centers.If the distance between their centers is smaller than a distance threshold, then the two groups could be merged.However, setting an appropriate distance threshold seems hard -as difficult as setting an appropriate  .
To solve this problem, RITA leverages the error bound of group attention introduced in Sec. 4 merging them into one cluster still meets the error bound .
Please refer to Appendix A.6 for the proof.Finding the Mergable Clusters.We formulate the problem of finding mergeable clusters using graph theory: (1) each cluster is a node in the graph; (2) if   and   satisfy: there is an undirected edge between   and   ; In this scenario, finding the maximum number of mergeable clusters is equivalent to finding the minimal clique cover in the corresponding graph, which is an NP-hard problem [24].Such heavy computation overhead is not acceptable for RITA.We thus offer a simplified solution: (1) Halve the clusters into two sets  1 ,  2 ; (2) If   ∈  1 and   ∈  2 satisfy: is marked.
(3) Decrease the number of clusters by counting the masks in  2 .In this solution, clusters in  1 can be regarded as transfer nodes.If (5) holds for , respectively, we have, Thus (4) holds when merging several clusters in  2 with one cluster in  1 .As a result, we can greedily merge clusters in  2 , as illustrated in step (3).
Assume the number of clusters decreases by  after merging, we apply a momentum update [42] on the number of clusters  , as is commonly used in machine learning to smooth the changing of  and avoid sample selection bias.To be specific:   =  ( − ) + (1 − ) , where  is a hyper-parameter for momentum.

Dynamically Determining the Batch Size
Because of the dynamic grouping operation, the computational graph in deep learning training [1] varies from sample to sample.As a result, it is impossible to precisely compute a batch's GPU memory usage without indeed feeding it into the model.To overcome this problem, RITA learns a batch size prediction function offline; then at the RITA training time, given a number of groups  , RITA uses this function to predict a proper batch size.
When the model architecture and hardware are fixed, the batch size depends on the length of the timeseries  and the average group number among all attention module  .So RITA samples several (  ,   ) pairs and estimate a proper batch size for each pair.
More specifically, given a user-defined timeseries maximal length   , we randomly sample integral points (  ,   ) from plane {1 ≤  ≤   , 1 ≤  ≤ }.Then we use a binary search based algorithm to find the maximal batch size   that consumes less than 90% available GPU memory, aiming to avoid wasting GPU memory and the risks of out of memory (OOM).
Treating these pairs as ground truth labels, we use function fitting [18] to learn the batch size predicting function B = f (L, N ), where B is a function of two variables  and  .Learning the Prediction Function.We apply curve fit from SciPy [53] as the function fitting tool to fit the two-variable function We observe that applying one function to the whole plane incurs a huge estimation error.So we develop a dynamic-programming (DP) method to divide the plane into several sub-planes and apply a distinct function to each sub-plane respectively.It is optimal in minimizing the total estimation error on all sub-planes With the learned prediction function  , we can estimate a proper batch size for any (,  ) during training, even if it is not seen in the sampled (  ,   ) pairs.The Algorithms and Optimality Proof.Please refer to Appendix A.3 for the pseudo code of the binary search-based algorithm and the description of the DP method for plane-division and the proof for its optimality.

EVALUATION
Our experimental study focuses on the following questions: 1. Effectiveness and efficiency of RITA: How does RITA compare with other Transformer-based methods and traditional timeseries representation learning methods in accuracy and efficiency?
2. Ablation Study: How do the key techniques of RITA work?

Experimental Setup
Datasets.We evaluate RITA on classification and imputation tasks using 5 multi-variate and 3 uni-variate timeseries datasets.
• WISDM [55] is a popular multivariate timeseries dataset generated from the accelerometer in the mobile phone.The subjects performed 18 daily activities (e.g.walking, jogging).The dataset was collected from 51 subjects and the sampling rate is 20 Hz.
• HHAR dataset [46] contains sensing data of accelerometer collected from 9 users performing 5 activities with 12 different smartphones (varying in sampling rate).This increases the complexity of the task and thus can test the model's robustness.
• RWHAR RealWorld HAR dataset [48] covers 15 subjects performing 8 locomotion-style activities.Each subject wears the sensors for approximately ten minutes.The sampling rate is 50 Hz.
• ECG dataset [34] consists of 10,000 EEG recordings for arrhythmia classification.Each recording has an uncertain length ranging from 6 to 60 seconds sampled at 500 Hz.The ECG recordings correspond to 9 types of heart problems such as atrial fibrillation (AF) and premature atrial contraction (PAC), etc.
• MGH [6] is a EEG dataset collected by Mass.General Hospital.Each timeseries corresponds to the EEG data observed from one patient during their stay in ICU for a couple of days.The EEG monitoring produced data with 20 channels.The sampling rate is 200 HZ.So it produces very long timeseries.
• WISDM*/HHAR*/RWHAR* are three uni-variate datasets derived by picking one channel from WISDM/HHAR/RWHAR.Training/Validation Data Generation.We apply a sliding window on the raw timeseries to get training/validation samples.The size of the sliding window is set as 200 on small datasets (WISDM, HHAR, RWHAR), 2000 on medium size dataset (ECG), and 10,000 on the large dataset (MGH).Table 1 shows the statics of the generated datasets.They are randomly split into training/validation set in a proportion of 0.9/0.1.In "pretraining + few-label finetuning" scenario, we use 100 labeled data per class for finetuning.We guarantee that training set does not overlap with the validation set.To evaluate our group attention (referred to as Group Attn.), we develop three baselines by replacing the group attention component in RITA with the classic vanilla Self-Attention [52](referred to as Vanilla) and two SOTA methods that reduce the complexity of self-attention by approximation in NLP, namely, Performer [10] (referred to as Performer) and Linformer [54] (referred to as Linformer).Similar to our proposed Group Attn., Vanilla, Performer, Linformer all use RITA's time-aware convolution operation (Sec.3) to turn timeseries segments into input feature vectors.We also compare Group Attn.against GRAIL [40], which is the SOTA of the non-deep learning methods for timeseries representation learning.GRAIL supports classification tasks by feeding the learned representations into a Support-Vector Machine [12] or K-Nearest Neighbor [17] classifier.Note GRAIL only targets uni-variate timeseries and cannot support imputation tasks.Methodology.We mainly focus on two downstream tasks: (1) Classification.First, we train Group Attn.and the baselines with full labels from scratch to test the effectiveness of RITA framework and the approximation quality of our group attention.
Second, to measure the effectiveness of self-supervised pretraining, we evaluate the accuracy of training on few labeled timeseries with/without pretraining on large scales of unlabeled timeseries.To be specific, we split the training set into a pretraining set and a finetuning set, with very few data in the latter (100 labeled samples per class in our experiment).We train the model on the cloze pretraining task with a mask rate  = 0.2.Then we train two classification models using the finetuning set, either based on the pretrained version or from scratch.We repeat the experiment 5 times with random data splits and report the median accuracy.
(2) Imputation.We run the imputation task on the datasets used in classification as well as the large unlabeled MGH dataset, and measure the mean square error and absolute imputation error.To get timeseries with missing values, we randomly mask the values with an expected mask rate of  = 0.2.The masked values are replaced with a special value.
Finally, to evaluate Group Attn.'s benefit on efficiency, the total time of forward computation, backward propagation, and grouping are measured for all methods in all the experiments.
To save space, we only report the average training time per epoch here and refer readers to Appendix A.8 for the inference time.
We first compare against the Transformer-based methods on multi-variate datasets (sec.6.2, 6.3), then compare against the nondeep learning method GRAIL on uni-variate datasets (sec.6.4).Configuration.Please refer to Appendix A.1 for the experiment configuration and hyper-parameter settings.

Effectiveness: Transformer-Based Methods
We first evaluate the quality of the models trained with full labels from scratch.We then show how the pretraining of RITA increases the accuracy of the downstream tasks.

full-label training (Multi-variate classification)
Results shown in Figure 3(a) get us the following observations: (1) RITA's advantage over TST.On all four datasets for the classification tasks, Group Attn.and the other three baselines that use RITA architecture (Vanilla, Performer, and Linformer) outperform TST.In particular, Group Attn.outperforms TST by 49 percentage points on the ECG dataset (88.48% vs 39.93%) with long timeseries.Two deficiencies in TST may cause its poor performance on the long timeseries.Firstly, TST concatenates the output embedding vector of each time stamp, then uses a linear classifier to do classification on the concatenated vector.When the timeseries is long, the linear classifier has so many parameters that it tends to overfit easily.Secondly, TST replaces Layer Normalization in vanilla Transformer with Batch Normalization.When the timeseries is long, it can only accommodate a small number of timeseries in each batch, leading to bias in Batch Normalization.
(2) Group-attention's advantage over other attention mechanisms.Group Attn. is better than Performer and Linformer on 3 out of 4 datasets for classification.Although Linformer works slightly better than Group Attn. on the ECG dataset (90.37% vs 88.84%), its performance is the worst in all other cases compared to any other RITA-based methods.Vanilla computes the attention scores precisely.Thus it is expected to work well.However, Group Attn.outperforms Vanilla on WISDM (87.50% vs 86.95%) and is very close to it on other 3 datasets.This suggests that group attention's approximation quality is good.

pretraining + few label finetune (Multi-variate classification)
The results shown in Table 3 get us the following observation: (1) Pretraining is effective.Pretraining always leads to better accuracy than training with a few labels from scratch.In particular, on WISDM data all the methods using RITA architecture increase the accuracy by at least 10%.This is impressive considering we do not have a very large unlabeled pre-training set to use.
(2) RITA's advantage over TST.our Group Attn.and other three baselines using RITA architecture (Vanilla, Performer, and Linformer) significantly outperform TST on all four classification datasets by 25 percentage points.
(3) Group Attention's advantage over other attention mechanisms.Group Attn. is better than Performer and Linformer on 3 out of 4 datasets.When compared to Vanilla, Group Attn. is better on HHAR and ECG, and comparable on the other two, further confirming its high quality on approximation.Further, we notice that Linformer struggles in this setting: in average its accuracy is worse than Vanilla by 8.22% and Group Attn.by 8.01%.This is because the low-rank projection operation introduces extra model parameters, making Linformer more easily overfit, while overfitting is especially harmful when there are only a few labeled training samples.

full-dataset training (Multi-variate imputation)
Similar to classification tasks, the results of imputation tasks (Table .2) show that Group Attn.consistently outperforms the baselines in training time while achieving comparable/better MSE.Again, on the large dataset MGH (length = 10,000), TST and Vanilla fail due to out of memory (OOM) errors.Methods using RITA framework (Group Attn., Performer, Linformer) all achieve very low MSE (are highly accurate).Among them Linformer is the worst.

Efficiency: Transformer-based Methods
We measure the efficiency by the average training time per epoch including the cost of the forward computation + backward propagation and the grouping overhead.We first show the results on all the 5 datasets in Sec.6.3.1.We then vary the length of the timeseries on the MGH dataset to show group attention's scalability on long timeseries in Sec.6.3.2.

Training Time: All Multi-variate Datasets
The results in Fig. 3

(b) and Table 2 lead to the below observations:
(1) Vanilla Self-Attention is not scalable.In average, it takes 2-3 minutes to train one epoch when the length of the timeseries is only 200 (WISDM, HHAR, RWHAR), takes over 15 minutes when the length increases to 2,000 (ECG), and fails on the long MGH data when the length reaches 10,000 due to out of GPU memory.
(2) Group Attn.'s advantage over all other attention mechanisms.As we have shown in Sec.6   than Performer and Linformer in classification and imputation tasks, while Group Attn. is always faster than Performer, Linformer, and all other baselines on all 5 multi-variate datasets, thus a win-win.
(3) The longer the timeseries, the larger the speedup.On the medium sized ECG dataset with a length of 2,000, Group Attn.has a speedup of 3.86/1.36/2.27compared to Vanilla/Performer/Linformer.When the length increases to 10,000, the speedup on the MGH dataset increases to 6.59/7.48compared to Performer/Linformer (Vanilla and TST failed in this case) on imputation task (Table .2).However, even on the short WISDM, HHAR, RWHAR datasets, Group Attn.still consistently outperforms other methods, confirming that it does not introduce much overhead.This is because when the length of the timeseries gets longer, Group Attn.gets more opportunities to find windows with similar properties.

Training time: Varying the Length
In this experiment, we truncate the original MGH timseries into sequences with the lengths at 2000/4000/6000/8000/10000, and compare Group Attn.against Vanilla and other attention mechanisms.Vanilla cannot handle sequences longer than 8000.
The results in Fig. 4 again show that the longer the timeseries, the larger the speed up.With comparable MSE, Group Attn.outperforms Vanilla by 63X.Moreover, as the length increases from 2000 to 10000, the training time of Group Attn.only increases from 31.2 seconds to 54.4 seconds per epoch.The reason is that as the timeseires becomes longer, there are more grouping opportunities because of the similarity of the timeseries segments.

Comparison to Non-deep Learning Methods
We compare against GRAIL, the SOTA of non-deep learning timeseries representation learning.We use the three uni-variate datasets, because GRAIL only targets uni-variate timeseries.Results in Fig. 5 show that on all 3 datasets RITA significantly outperforms GRAIL in accuracy by 45, 16, and 21 percentage points because of the expressive power of Transformer.Moreover, thanks to the GPU-friendly design of RITA, it is at least 2× faster than GRAIL in training time.

Adaptive Scheduler
To evaluate the effectiveness of RITA's adaptive scheduler (Sec.5), we compare it against a baseline using a fixed group number  .We vary  and the error bound threshold  used by RITA.
From the results in Table 4 we get the following observations: (1) Adaptive Scheduler is better than fixed  .Training with Adaptive Scheduler already achieves better or comparable performance compared to the best performing  .More specifically, on the MGH dataset, dynamic scheduler always achieves better accuracy and is much faster compared to fixed  .On the ECG dataset, although fixed  is slightly better than adaptive scheduler in accuracy when setting the N as 512, it runs much slower than adaptive scheduler.Of course, finding the best  that balances the accuracy and running time requires careful tuning.
(2) Adaptive Scheduler is tuning free.It is robust on both accuracy and running time when  varies, while the results of fixed  vary significantly when the value of  changes.Therefore, Adaptive Scheduler frees the users from tuning the  threshold, while it is hard to find an appropriate  for a given dataset.Table 5: RITA Pretraining: increasing sizes of pretrain set.

The Sizes of the Pretraining Data
Next, we evaluate how the number of unlabeled data influences the effectiveness of pretraining.To get empirical results, we pretrain RITA on WISDM dataset with 20%/40%/60%/80% of the pretraining data and finetune each pretrained model with 100 labels per class.The results in Table 5 show that: (1) The more pretraining data, the larger the improvement.The accuracy increases with the sizes of the pretraining data; (2) Marginal utility diminishing.

RELATED WORK 7.1 Timeseries Analytics
There is a great deal of prior work on timeseries analytics methods.This work can be divided into three categories: (1) non-deep learning methods; (2) CNN/RNN-based deep learning methods; and (3) Transformer-based deep learning methods.Traditional Methods.These methods, such as TS-CHIEF [45], HIVE-COTE [33], ROCKET [15] have achieved notable performance on public datasets.Despite that, traditional methods suffer from one or more issues: they (1) rely on expert knowledge for feature extraction; (2) incur heavy computation cost and are inappropriate for GPU devices; (3) support only uni-variate timeseries; (4) perform classification solely.Some work [61] shows that the transformedbased methods outperform these traditional methods especially on multi-variate timeseries.
In particular, as the SOTA of timeseries representation learning, GRAIL [40] extracts landmarks from data and computes the representations with the combination of the landmarks.However, GRAIL only supports uni-variate timeseries.Our experiments (Sec.6.4) show that RITA significantly outperforms GRAIL in both effectiveness and efficiency on uni-variate timeseries.
CNN/RNN-based Deep Learning Methods.CNN-based methods, such as InceptionTime [21] and Resnet [19], are good at classification tasks, but can not handle generative tasks such as forecasting because of the inductive bias of convolution networks.RNN-based methods, such as Brit [7] and deepAR [44], are capable for classification, regression and generation.However, the recurrent structure brings a lot of problems: (1) limiting the model's ability in capturing long-range correlation; (2) notoriously difficult to train [41] because of gradient vanishing and exploding problem.As a result, such methods can hardly scale to very long timeseries.Transformer-based Deep Learning Methods.Given that Transformer is the best choice for backbone in almost all sequence modeling tasks, some effort has been made to apply Transformer to timeseries analytics.Targeting forecasting of uni-variate timeseries, LogTrans [30] introduced a log sparsity assumption to attention computation.Informer [62] pushes LogTrans a step further and scales forecasting to multi-variate timeseries.Autoformer [57] performs forecasting by decomposing timeseries into two parts, i.e. the trend part and the seasonal part.
For imputation tasks, CDSA [37] outperforms statistical methods and the SOTA of RNN-based method Brit [7] on 3 public and 2 competition datasets.For timeseries classification, AutoTransformer [43] performs architecture search to adapt to the tasks in different domains.For timeseries anomaly detection, Anomaly Transformer [58] outperforms many widely-used methods such as OmniAnomaly [47], assuming the attention score maps show Gaussian distribution.
All of these works are designed for specific tasks, rather than functioning as a representation learning framework to serve different downstream tasks.To fill this gap, some researchers proposed a Transformer-based architecture, called TST [61].Like RITA, TST supports regression, classification, and unsupervised learning through the "cloze test" pretraining task on timeseries.However, TST directly uses the classical Vanilla self-attention, thus not scalable to long timeseries as shown in our experiments (Sec.6.3.2).

Efficient Transformers
The need of improving the scalability of Transformers has led to more efficient variations of Transformers, especially for accommodating long text data in NLP [49].
Introducing fixed/random patterns to self-attention mechanism is an intuitive idea.Sparse Transformer [9] and Longformer [3] only compute attention at fixed intervals.ETC [2] and BigBird [60] use global-local attention: the attention computation is limited within a fixed radius, while some auxiliary tokens are added to attend/get attended globally.The deficiencies of fixed attention patterns are obvious: it heavily depends on users to give an optimal setting.
To decrease the reliance on human labor, some works seek to introduce learnable/adaptive attention patterns instead of fixed patterns.Reformer [26] proposed only computing the dominant attention terms based on their observation of sparsity in attention matrix from language/image data.Such sparsity is intuitive in language data, in which a word's attention mainly focuses on the nearby sentences.However, attention in timeseries data shows strong seasonal patterns rather than sparse patterns, mainly as result of the periodicity of timeseries data.Therefore, such works do not work well for timeseries.
Apart from introducing attention patterns, some works seek to solve this problem with applied mathematics techniques.Linformer [54] performs a projection to decrease the size of query, key and value matrices before attention computation, because the attention matrix tends to be low-ranked.Performer [10] uses linear functions to approximate the kernel function softmax, making attention computation commutative.When the sequence length is far greater than the dimension of embedding vectors, Performer benefits from changing the order of matrix multiplication.Linformer and Performer do not depend on the unique properties of language data, thus potentially fitting timeseries better than other techniques, which is why we compared against them in our experiments.However as shown in Sec.6, our group attention significantly outperforms them in both accuracy and efficiency (training time), because group attention fully leverages the periodicity of timeseries.

A APPENDIX: SUPPLEMENTARY MATERIAL A.1 Experiment Configuration and
Hyper-parameter Settings Configuration.All models were trained on an NVIDIA Tesla V100 16GB GPU.All the methods are optimized with AdamW [36] of which the starting learning rate and weight decay parameter are both 1 −4 .In full-label training scenario, we train the models for 100 epochs.In "pretraining + few-label finetuning scenario", as the pretrained models require fewer epochs to converge [61], we train the model for 50 epochs.For a fair comparison, the baselines use a maximal batch size within GPU's capacity during training.
As for model hyper-parameter setting, RITA and the baselines use a Transformer structure balancing Vanilla 's accuracy and efficiency: 8-layer stack of 2-head attention with hidden vectors in dimension of 64.Convolution kernel size is set to 5 by default.We set the error bound threshold (, Sec.5.1) of Group Attention to 2, as it balances the accuracy and the efficiency in general on all datasets.Because Linformer requires the users to set the sizes of projection matrix, in different settings we choose an accuracyefficiency balancing one among {64,128,256,512}.

A.2 Efficient Computation of Group Attention Algorithm 1 Efficient Computation of Group Attention
Require: ,  , ,   ,   Ensure: ,  ∈ R  *  , ∈ R  *  ,  ∈ N  ,  ∈ N  1: function group_attention(,  , ) for  = 0 →  − 1 do 3: ←   5: for  = 0 →  − 1 do 6: for  = 0 →  − 1 do for  = 0 →  − 1 do 9: for  = 0 →  − 1 do 11: return In Alg. 1, we denote    to be the size of the  ℎ group,  to be the number of groups, r  to be the representative key of the  ℎ group and R to be the matrix consisting of all r  ,   to be the group that k  belongs to.,  are the packing matrices of query vectors and value vectors as described in Sec.2.Alg. 1 outputs the packing matrix  for new feature emebddings { 1 , ...,   }, where   corresponds to the feature embedding of   .Lines 2-3 implement the embedding aggregation operation, while   We describe Alg. 3 and intuitively show its optimality.We assume that Scipy [53] learns an optimal function in Line 4 so that function COST gives the optimal estimation error when fitting the points in set .When fitting very few points, we assign an infinite cost to prevent a biased fitting function (Line 2).() denotes the minimal estimation error for points in sub-plane { 2 ≤  ≤  1 ,  ≤ }.In Lines 11-13, we enumerate all possible ways of cutting { 2 ≤  ≤  1 ,  ≤ } horizontally into two sub-plane { 2 ≤  ≤  1 ,  ≤ } and { 2 ≤  ≤  1 ,  ≤  ≤ } by iterating  from 1 to n. Choosing the cutting strategy that minimizes estimation error gets us a ( 1 ) with minimal estimation error for sub-plane { 2 ≤  ≤  1 ,  ≤  1 }, which is recorded as   1 , 2 in Line 14.  () denotes the minimal estimation error for sub-plane { ≤  }.We enumerate all the possible ways of cutting { ≤  } vertically into two sub-plane { ≤ } and { ≤  ≤  } by iterating  from 1 to  (Line 17 -19).Finally, we have the minimal estimation error for the whole plane as  (  ).Based on the above discussion, this algorithm guarantees to not miss any better solution, hence optimal.
A. 4 The Correctness of Group Attention Lemma 3. Assuming the windows belonging to the same group   have the same key vector, i.e.   =   (  ∈   ), then the feature embedding  produced by the original self-attention mechanism is identical to the output of our group attention mechanism implemented in Algorithm 1.
Proof.Denote   to be the representative vectors of   , i.e.   =   =   (  ∈   ).Algorithm 1 gives that By the canonical self-attention mechanism introduced in Sec. 2, we get: With 7 and 8, we have Further, Combining ( 7), (9) (10), we have o i = N −1 j=0 P i,j s i v j = o i .This concludes that the output of our group attention is identical to vanilla self-attention's.□ A.5 The Proof of Error Bound (Lemma 1) Proof.We have So Then we have: A i,j ≤ .This proves Lemma 1.
A. 6 The Proof of Merge Operation (Lemma 2) Proof.Denote the cluster size of   to be   .After mergeing, the new center will be:

A.7 Downstream Tasks
RITA supports a variety of downstream tasks.In this section, we show that with minimal modification RITA can effectively support classification, imputation and forecasting tasks.Other unsupervised tasks such as similarity search or clustering are naturally supported by extracting feature embeddings from RITA.

A.7.1 Classification
To classify timeseries, we input timeseries to the model as described in Sec.

A.7.3 Forecasting
Forecasting can be regarded as a special case of imputation, in which all missing values are at the end of timeseries.
So like in imputation task, we scale the timeseries to nonnegative and use a special value (-1) to indicate the values to be predicted: Where    is the observed timestamp.Then the output representations are fed into a Transpose Convolution layer using Mean Squared Error as loss function, as described above.
A.7.4 Other Unsupervised Tasks RITA naturally supports other unsupervised tasks, such as similarity search and clustering [25,31,32], by producing the embedding of one timeseries (output representation of the special token [CLS]).
Clustering can be performed on the embeddings with flexible choice of distance metrics.Similarly, a high dimensional similarity search system [22,23,38] can be built on the embeddings.

A.8 Inference Time
Dataset Length TST [61]    In this section, we present the average inference time on validation sets.The results in Table.6 and 7 correspond to the average inference time on validation sets of classification and imputation tasks, respectively.Consistent with the results in Section.6.3, our method Group Attn.outperforms the baselines on both classification and imputation tasks, particularly on the datasets comprising long timeseries (ECG and MGH).

Figure 1 :
Figure 1: RITA Architecture Figure 2: Group Attention 4.2 Computing the Output Feature Embedding

7 :
, ←  (  , )   8: 3 and attach a special token [CLS] as the first input embedding.[CLS]'s embedding acts as the embedding for the entire timeseries, and the output representation of [CLS] is fed into a classifier: y = Softmax (W cls Z [CLS] + B cls ), where  [ ] ∈ R  is the output representation of [CLS], C is the number of classes, and W cls ∈ R C×d , B cls ∈ R C are learnable parameters for classification task.The result vector  ∈ R  represents the possibility that the input timeseries belongs to each class.We apply Cross Entropy Loss as the loss function of the classification task[13]:L = 1 C C i=1 −ŷ(i)log(y(i)), where ŷ is a binary indicator for ground truth label:ŷ () = 1  is ground truth label 0 ℎ (16)A.7.2 Imputation Timeseries are mainly generated by sensors, a common problem of which is missing values.This becomes a challenge when many downstream analytics require the missing values to be recovered.The recovering task is imputation.Denote the real timeseries as   ∈ R  × , the observed timeseries with missing values as   ∈ R  × , and the set of missing values' positions as .We scale the values of all timeseries to non-negative and use a special value (-1) to indicate missing values:  (, ) = −1 (, ) ∈    (, ) (, ) ∉ (17)  is fed into the RITA as input, and the output representations are concatenated and fed into a Transpose Convolution layer which decodes the output embedding vectors from hidden space to timeseries values, corresponding to the convolution operation in the input stage, i.e., Y = TransposeCNN (Z 1 + ○Z 2 + ○... + ○Z n ), where  ∈ R  × is the recovered timeseries, and   ∈ R  is the output of each position.Here Mean Square Error is chosen as the loss function [51]:  = 1 | | (, ) ∈ ( (, ) −   (, )) 2 .
.3.It only requires users to set an error bound , and then uses Lemma 1 to translate  to a distance threshold .RITA then uses Lemma 2 to determine if merging some given clusters still meets the error bound threshold .Lemma 2. Denote   to be the cluster center of   .Assume the existing grouping satisfies ∀k, max x ∈cluster k |c k − x | ≤ d , thus satisfying an error bound  by Lemma 1.If there exist  clusters, namely,   1 ,   2 , ...,    , satisfying that:

Table 1 :
[61]statistics of the datasets Alternative Methods.We compare RITA against the SOTA Transformer based timeseries representation learning method TST[61].

Table 2 :
.2, Group Attn. is more accurate Imputation results (multi-variate data).The best results are marked with bold.

Table 3 :
Pretrain + few-label finetuning results.The best results are marked with bold.

Table 4 :
Adaptive Scheduling VS Fixed N.