Sparse Binary Transformers for Multivariate Time Series Modeling

Compressed Neural Networks have the potential to enable deep learning across new applications and smaller computational environments. However, understanding the range of learning tasks in which such models can succeed is not well studied. In this work, we apply sparse and binary-weighted Transformers to multivariate time series problems, showing that the lightweight models achieve accuracy comparable to that of dense floating-point Transformers of the same structure. Our model achieves favorable results across three time series learning tasks: classification, anomaly detection, and single-step forecasting. Additionally, to reduce the computational complexity of the attention mechanism, we apply two modifications, which show little to no decline in model performance: 1) in the classification task, we apply a fixed mask to the query, key, and value activations, and 2) for forecasting and anomaly detection, which rely on predicting outputs at a single point in time, we propose an attention mask to allow computation only at the current time step. Together, each compression technique and attention modification substantially reduces the number of non-zero operations necessary in the Transformer. We measure the computational savings of our approach over a range of metrics including parameter count, bit size, and floating point operation (FLOPs) count, showing up to a 53x reduction in storage size and up to 10.5x reduction in FLOPs.


INTRODUCTION
The success of deep learning can largely be attributed to the availability of massive computational resources [27,32,53].Models such as the Transformer [59] have changed machine learning in fundamental ways, producing state-of-the-art results across fields such as natural language processing (NLP), computer vision [8,57], and time series learning [67].Much effort has been aimed at scaling these models towards NLP efforts on large datasets [7,16], however, such models cannot practically be deployed in resource-constrained machines due to their high memory requirements and power consumption.
Parallel to the developments of the Transformer, the Lottery Ticket Hypothesis [20] demonstrated that neural networks contain sparse subnetworks that achieve comparable accuracy to that of dense models.Pruned deep learning models can substantially decrease computational cost, and enable a lower carbon footprint and the democratization of AI.Subsequent work showed that we can find highly accurate subnetworks within randomly-initialized models without training them [47], including binary-weighted neural networks [17].Such "lottery-ticket" style algorithms have mostly experimented with image classification using convolutional architectures, however, some work has shown success in pruning NLP Transformer models such as BERT [9,21,30].
In this work, we extend the Lottery Ticket Hypothesis to time series Transformers, showing that we can prune and binarize the weights of the model and still maintain an accuracy similar to that of a Dense Transformer of the same structure.To achieve this, we employ the Biprop algorithm [17], a state-of-the-art technique with proven success on complex datasets such as ImageNet [15].The combination of weight binarization and pruning is unique from previous efforts in Transformer compression.Moreover, each compression technique offers separate computational advantages: neural network pruning decreases the number of non-zero floating point operations (FLOPs), while binarization reduces the storage size of the model.The Biprop algorithm's two compression methods rely on each other during the training process to identify a high-performing subnetwork within a randomly weighted neural network.The combination of pruning and weight binarization is depicted in Figure 1a.
We apply our approach to multivariate time series modeling.Research has shown that Transformers achieve strong results on time series tasks such as classification [67], anomaly detection [58,64], and forecasting [39,69].Time series data is evident in systems such as IoT devices [13], engines [41], and spacecraft [2,54], where new insights can be gleaned from the large amounts of unmonitored information.Moreover, such systems often suffer from resource constraints, making regular deep learning models unrealistic -for instance, in the Mars rover missions where battery-powered devices are searching for life [5].Other systems such as satellites contain thousands of telemetry channels that require granular monitoring.Deploying large deep learning models in each channel can be extremely inefficient.As a result, lightweight Transformer models have the potential to enhance a wide variety of applications.
In addition to pruning and binarizing the Transformer architecture, we simplify the complexity of the attention mechanism by applying two modifications.For anomaly detection and forecasting, which we model using overlapping sliding window inputs, we apply an attention mask to only consider attention at the current time step instead of considering attention for multiple previous time steps.For classification tasks, we apply a static mask to the query, key, and value projections, showing that only a subset of activations is needed in the attention module to achieve the same accuracy as that obtained using all the activations.
Finally, we estimate the computational savings of the model in terms of parameters, storage cost, and non-zero FLOPs, showing that pruned and binarized models achieve comparable accuracy to dense models with substantially lower computational costs.Our contributions are as follows: • We show that sparse and binary-weighted Transformers achieve comparable accuracy to Dense Transformers on three time series learning tasks (classification, anomaly detection, forecasting).To the best of our knowledge, this is the first research examining the efficacy of compressed neural networks on time series related learning.• We examine pruning and binarization jointly in Transformerbased models, showing the benefits of each approach across multiple computational metrics.Weight binarization of Transformer based architectures has not been studied previously.
These findings provide new potential applications for the Transformer architecture, such as in resource-constrained environments that can benefit from time series related intelligence.

RELATED WORK
In this section, we describe existing research related to Transformers in time series modeling, neural network pruning and compression, and finally efficient Transformer techniques.

Transformers in Time Series
Various works have applied Transformers to time series learning tasks [61].The main advantage of the Transformer architecture is the attention mechanism, which learns the pairwise similarity of input patterns.Moreover, it can efficiently model long-range dependencies compared to other deep learning frameworks such as LSTM's [39].Zerveas et al. [67] showed that we can use unsupervised pretrained Transformers for downstream time series learning tasks such as regression and classification.Additional work in time series classification has proposed using a "two tower" attention approach with channel-wise and time-step-wise attention [38], while other work has highlighted the benefits of Transformers for satellite time series classification compared to both recurrent and convolutional neural networks [49].
For anomaly detection tasks, Transformers have shown favorable results compared to traditional ML and deep learning techniques.Notably, Meng et al. [42] applied the model to NASA telemetry datasets and achieved strong accuracy (0.78 F1) in detecting anomalies.TranAD [58] proposed an adversarial training procedure to exaggerate reconstruction errors in anomalies.Xu et al. [64] achieve state-of-the-art results in detecting anomalies in multivariate time series via association discrepancy.Their key finding is that anomalies have high association with adjacent time points and low associations with the whole series, accentuating anomalies.
Finally, Transformer variations have been proposed for time series forecasting to lower the attention complexity of long sequence time series [37,39,69,70], add stochasticity [63], and incorporate traditional time series learning methods [62,70].Li et al. [37] introduce LogSparse attention, which allows each cell to attend only to itself and its previous cells with an exponential step size.The Informer method [69] selects dominant queries to use in the attention module based on a sparsity measurement.Pyraformer [39] introduces a pyramidal attention mechanism for long-range time series, allowing for linear time and memory complexity.Wu et al. [63] use a Sparse Transformer as a generator in an encoder-decoder architecture for time series forecasting, using a discriminator to improve the prediction.

Compressed Neural Networks
Pruning unimportant weights from neural networks was first shown to be effective by Lecun et al. [35].In recent years, deep learning has scaled the size and computational cost of neural networks.Naturally, research has been directed at decreasing size [25] and energy consumption [65] of deep learning models.
The Lottery Ticket Hypothesis [20] showed that randomly initialized neural networks contain sparse subnetworks that, when trained in isolation, achieve comparable accuracy to a trained dense network of the same structure.The implications of this finding are that over-parameterized neural networks are no longer necessary, and we can prune large models and still maintain the original accuracy.
Subsequent work found that we do not need to train neural networks at all to find accurate sparse subnetworks; instead, we can find a high performance subnetwork using the randomly initialized weights [10,24,40,47].Edge-Popup [47] applied a scoring parameter to learn the importance of each weight, using the straightthrough estimator [4] to find a high accuracy mask over randomly initialized models.Diffenderfer and Kailkhuram [17] introduced the Multi-Prize Lottery Ticket Hypothesis, showing that 1) multiple accurate subnetworks exist within randomly initialized neural networks, and 2) these subnetworks are robust to quantization, such as binarization of weights.In this work, we use the Biprop algorithm proposed in [17] to binarize the weights of Transformer models.

Compressed and Efficient Transformers
Large-scale Transformers such as the BERT (110 million parameters) are a natural candidate for pruning and model compression [21,56].Chen et al. [8] first showed that the Lottery Ticket Hypothesis holds for BERT Networks, finding accurate subnetworks between 40% and 90% sparsity.Jaszczur et al. [29] proposed scaling Transformers by using sparse variants for all layers in the Transformer.Other works have reported similar findings [18,36], showing that sparsity can help scale Transformer models to even larger levels.
Other works have proposed modifications for more efficient Transformers aside from pruning [56].Most research has focused on improving the O ( 2 ) complexity of attention, via methods such as fixed patterns [46], learnable patterns [31], low rank/kernel methods [12,60], and downsampling [3,68].Various other methods have been proposed for compressing BERT networks such as pruning via post-training mask searches [33], block pruning [34], and 8-bit quantization [66].We refer readers to Tay et al. [56] for details.
Despite the various works compressing Transformers, we were not able to find any research using both pruning and binarization.Utilizing both methods allows for more efficient computation (measured using FLOPs) as well as a significant decrease in storage (due to binary weights).Additionally, we find that our proposed model is still a fraction of the size of compressed NLP Transformers models when trained on time series tasks.For instance, TinyBERT [30] contains 14.5 million parameters and 1.2 billion FLOPs, compared to our models which contain less than 1.5 million binary parameters and 38 million FLOPs.

METHOD
Our model consists of a Transformer encoder [59] with several modifications.We base our model off of Zerveas et al. [67], who propose using a common Transformer framework for several time series modeling tasks.To begin, we describe the base architecture of the Transformer as applied to multivariate time series.Subsequently, we describe the techniques used for pruning and binarization.Finally, we describe the two changes applied to the attention mechanism.

Dense Transformer
We denote fully trained Transformers with no pruning and floating point 32 (FP32) weights as Dense Transformers.Let X t ∈ R × be a model input for time  with window size  and  features.Each input contains  feature vectors x ∈ R  : X t ∈ R × = [x t−w , x t−w+1 , ..., x t ], ordered in time sequence of size .In classification datasets  is predefined at the sample or dataset level.For anomaly detection and forecasting tasks, we fix  to 50 or 200 and use an overlapping sliding window as inputs.
The standard architecture (pre-binarization) projects  features onto a -dimensional vector space using a linear module with learnable weights W p ∈ R  × and bias b p ∈ R  .We use the standard positional encoder proposed by Vaswani et al. [59], and we refer readers to the original work for details.For the Dense Transformer classification models, we use learnable positional encoder [67].Zerveas et al. [67] propose using batch normalization instead of layer normalization used in traditional Transformer NLP models.They argue that batch normalization mitigates the effects of outliers in time series data.We found that for classification tasks, batch normalization performed the best, while in forecasting tasks layer normalization worked better.For anomaly detection tasks we found that neither normalization technique was needed.
Each Transformer encoder layer consists of a multi-head attention module followed by ReLU layers.The self-attention module takes input Z t ∈ R × and projects it onto a Query (Q), Key (K), and Value (V), each with learnable weights W ∈ R  × and bias b ∈ R  .
Attention is defined as V. Queries, keys, and values are projected by the number of heads (ℎ) to create multi-head attention.The resultant output Z t ′ undergoes a nonlinearity before being passed to the next encoder layer.The Transformer consists of  encoder layers followed by a final decoder layer.For classification tasks, the decoder outputs  classification labels: X ′ t ∈ R × , which are averaged over .For anomaly detection and forecasting, the decoder reconstructs the full input:

Sparse Binary Transformer
Central to our binarization architecture is the Biprop algorithm [17], which uses randomly initialized floating point weights to find a binary mask over each layer.Given a neural network with weight matrix W ∈ R  ×  initialized with a standard method such as Kaiming Normal [26], we can express a subnetwork over neural network  (; W) as  (; W ⊙ M), where M ∈ {0, 1} is a binary mask and ⊙ is an elementwise multiplication.
To find M, parameter S ∈ R  ×  is initialized for each corresponding W ∈ R  ×  .S acts as a score assigned to each weight dictating the importance of the weights contribution to a successful subnetwork.Using backpropagation as well as the straight-through estimator [4], the algorithm takes pruning rate hyperparameter  ∈ [0, 1], and on the forward pass computes M  at layer  as where Masks are computed by taking the absolute value of scores for each layer, and setting the mask to 1 if the value falls above the top  ℎ percentile.
To convert each layer to binary weights Biprop introduces gain term  ∈ R, which is common to Binary Neural Networks (BNN's) [45].The gain term utilizes floating-point weights prior to binarization during training.During test-time, the alpha parameter scales the binarized weight vector.The parameter rescales binary weights B ∈ {−1, 1} to {−,  }, and the network function becomes  (;  (B ⊙ M)). is calculated as with M being multiplied by  for gradient descent (the straightthrough estimator is still used for backpropagation).This calculation was originally derived by Rastegari et al. [48].
In our approach we create sparse and binary modules for each linear and layer normalization layer.Our model consists of two linear layers at the top most level: one for projecting the initial input (embedding in NLP models) and one used for the decoder output.Additionally, each encoder layer consists of six linear layers: Q, K, and V projections, the multi-head attention output projection, and two additional layers to complement multi-head attention.

Attention Modifications
In this section we describe two modifications made to the attention module to reduce its quadratic complexity.Several previous works have proposed changes to attention in order to lessen this bottleneck, such as Sparse Transformers [11], ProbSparse Attention [69], and Pyramidal Attention [39].While each of these works present quality enhancements to the memory bottleneck of attention, we instead seek to evaluate whether simple sparsification approaches can retain the accuracy of the model compared to canonical attention.Our primary motivation for the following attention modifications are to test whether a compressed Transformer can retain the same accuracy as a Dense Transformer.

3.
3.1 Fixed Q,K, and V Projection Mask.To reduce the computational complexity of the matrix multiplications within the attention module, we apply random fixed masks to the Q, K, and V projections.We hypothesize that we can retain the accuracy of full attention by using this "naive" activation pruning approach, which requires no domain knowledge.We argue that the success of this approach provides insight into the necessity of full attention computations.In other words, Transformers are expressive and powerful enough for certain tasks that we can prune the models in an unsophisticated way and maintain accuracy.Moreover, many time series datasets and datasets generated at the edge are often times simplistic enough that we can apply this unsophisticated pruning [22,23].
To apply this pruning, on model initialization we create random masks with prune rate   ∈ {0, 1} for each attention module and each projection Q,K, and V. Attention heads within the same module inherit identical Q, K, or V masks.The mask is applied to each projection during train and test.In each of our models we set the prune rate   of the attention module equal to the prune rate of the linear modules (  = ).

3.3.2
Step-t Attention Mask.For anomaly detection and single-step forecasting tasks, the Sparse Binary Transformer (SBT) algorithm relies on reconstructing or predicting outputs at the current time step  for each feature , despite  time steps of data being provided to the model.Specifically, the SBT model is only interested in input vector x t ∈ R  .For anomaly detection, the model reconstructs x t from the input, while in forecasting tasks the model masks x t = 0 prior to model input, reconstructs the actual values during training and inference.
In both tasks, vector x t contains the only values necessary for the model to learn, and our loss function reflects this by only computing error for these values.As a result, computing attention for each other time step adds unnecessary computation.As depicted in Figure 2, we pass a static mask to the attention module to compute attention only at step-T.We additionally exclude attention computation at step-T with itself, forcing the variable to attend to historical time points for prediction.Finally, we add diagonal ones to the attention mask at all past time points to add stability to training.This masking method allows us to propagate the full input sample to multiple attention layers, helping us retain relevant historical information for downstream layers that would not be possible by changing the sizes of Q, K, and V to only model the  time step.

EXPERIMENTS
In this section we detail our experiments for time series classification, anomaly detection, and forecasting.Common to each learning task, we normalize each dataset prior to training such that each feature dimension  has zero mean and unit variance.We use the Transformer Encoder as described in Section 3, training each learning task and dataset using the Dense Transformer and the SBT to compare accuracy.Finally, we run each experiment three times with a different weight seed, and present the average result.For the SBT model, varying the weight seed shows evidence of the robustness to hyperparameters.Specific modifications to the model are made for each learning task, which we describe in the following sections.Additional training and architecture details can be found in the Appendix.

Classification
For our first time series learning task we select several datasets from the UCR Time Series Classification Repository [1] (204-30,000), number of features , and window size .We choose three datasets with the largest test set size (Insect Wingbeats, Spoken Arabic Digits, and Face Detection) as well as two smaller datasets (JapaneseVowels, Heartbeat).Each dataset contains a set window size except for Insect Wingbeats and Japanese Vowels, which contain a window size up to 30 and 29, respectively.In these datasets, we pad samples with smaller windows to give them consistent window sizes.The decoder in our classification architecture is a classification head, rather than a full reconstruction of the input as is used in anomaly detection and forecasting tasks.
The SBT classification model is trained and tested using the fixed Q,K,V projection mask as described in Section 3.3.

Results
In Table 1, we show that SBTs perform as well as, or similar to, the Dense Transformer for each dataset at  = 0.5 and  = 0.75.Our models are averaged over three runs with different weight seeds.When comparing our model to state-of-the-art approaches, we find that the SBT achieves strong results across each dataset, with the highest reported performance on three out of the five datasets.Further, the SBT models perform consistently across datasets while models such as Rocket [14] and Fran et al. [19] have lower performance on one or more datasets.Surprisingly, the SBT model achieves stronger average accuracy than the Dense Transformer (80.2% versus 78.8%), indicating that the pruned and binarized Transformer achieves a robust performance across datasets.Despite this, Insect Wingbeats and Japanese Vowels datasets achieved a slightly lower performance at  = 0.5 with a more substantial dropoff at  = 0.75, indicating the model may lose some of its power on certain tasks.

Anomaly Detection
For the anomaly detection task we test the SBT algorithm on established multivariate time series anomaly detection datasets used in previous literature: Soil Moisture Active Passive Satellite (SMAP) [28], Mars Science Labratory rover (MSL) [28], and the Server Machine Dataset (SMD) [54].SMAP and MSL contain telemetry data indicate that when given time to stabilize after an anomalous event, our SBT framework can detect new anomalies with high accuracy.We evaluate our results using a manual threshold (=0.5% for SMD, 1% for others) and the POT automatic threshold selector.
such as radiation and temperature, while SMD logs computer server data such as CPU load and memory usage.The datasets contain benign samples in the training set, while the test set contains labeled anomalies (either sequences of anomalies or single point anomalies).
Our model takes sliding window data as input and reconstructs data at x t given previous time points.We use MSE to reconstruct each feature in x t .We use the step-T attention mask as described in Section 3. To evaluate our results, we adopt an adjustment strategy similar to previous works [51,54,58,64]: if any anomaly is detected within a successive abnormal segment of time, we consider all anomalies in this segment to have been detected.The justification is that detecting any anomaly in a time segment will cause an alert in real-world applications.
To flag anomalies, we retrieve reconstruction loss x ′ t and threshold , and consider anomalies where x ′ t > .Since our model is trained with benign samples, anomalous samples in the test set should yield a higher x ′ t .We compute  using two methods from previous works: A manual threshold [64] and the Peak Over Threshold (POT) method [52].For the manual threshold, we consider proportion  of the validation set as anomalous.For SMD  = 0.5%, and for MSL and SMAP  = 1%.For the POT method, similar to Om-niAnomaly [54] and TranAd [58], we use the automatic threshold selector to find .Specifically, given our training and validation set reconstruction losses, we use POT to fit the tail portion of a probability distribution using the generalized Pareto Distribution.POT is advantageous when little information is known about a scenario, such as in datasets with an unknown number of anomalies.We compare our SBT framework with several stateof-the-art algorithms on the anomaly detection task.The table is ordered by average F1 accuracy across each dataset.We evaluate our algorithm using the traditional method (different from Table 2), where each sample can contain anomalous events in its input window.We use a manual threshold to report results for the SBT model.

Results
In Table 2 we report the unique findings of our single-step anomaly detection method using Precision, Recall, and F1-scores.Specifically, we find that when only considering inputs with fully benign examples in window , both the SBT and the Dense Transformer achieve high accuracy on all three datasets (F1 between 90.6 and 100).In other words, we find that our model performance is best when we filter examples that have an anomalous sequence or data point in [x t−w , x t−w+1 , ..., x t−1 ].For SMD,  = 200 and for SMAP and MSL  = 50.This observation implies that the model needs time to stabilize after an anomalous period.Intuitively, if an anomaly occurred recently, new benign observations will have a higher reconstruction loss as a result of their difference with the anomalous examples in their input window.We argue that this validation metric is logical in real-world scenarios, where monitoring of a system after an anomalous period of time is necessary.We additionally report F1-scores compared to state-of-the-art time series anomaly detection models in Table 3.To accurately compare our model against existing methods, we use the full test set without filtering out benign inputs with anomalies in the near past.SBT results are much more modest, with F1-scores between 70 and 88.Despite this, our method still performs stronger than non-temporal algorithms such as the Isolation Forest, as well as other deep-learning based approaches such as Deep-SVDD and BeatGan.

Forecasting
We test our method on single-step forecasting using the Step-T attention mask.Specifically, using the framework outlined by Zerveas et al. [67], we train our model by masking the input at the forecasting time-step .For example, input X t containing  features and  time-steps [x t−w , x t−w+1 , ..., x t ] is passed through the network with x t = 0. We then reconstruct this masked input with the Transformer model, using mean squared error between the masked inputs reconstruction and the actual value.The masking method simulates unseen future data points during train time, making it compatible with the forecasting task during deployment.
We test our model on three datasets used in previous works: ECL contains electricity consumption of 321 clients in Kwh.The dataset is converted to hourly consumption values due to missing data.Weather contains data for twelve hourly climate features for 1,600 location in the U.S. ETTm1 (Electricity Transformer Temperature) contains 15-minute interval data including oil temperature and six additional power load features.Additional training details are available in the Appendix.
We compare our method against the Informer [69] and the Pyra-former [39] trained with single-step forecasting.Both are current state-of-the-art models that have shown robust results compared against a variety of forecasting techniques.Importantly, each method is compatible with multivariate time series forecasting as opposed to some research.We note that these models are built primarily for long-term time series forecasting (LSTF), which we do not cover in this work.Results We evaluate results in Table 4 using MSE and MAE on the test set of each dataset.Results indicate that the SBT model achieves accuracy comparable to the Dense architecture in each dataset at  = 0.5.Interestingly, the Weather at ETTm1 SBT models achieved better accuracy than the dense model at  = 0.5.Both models additionally showed robustness to higher prune rates, with accuracy dropping off slowly.ECL on the other hand showed some sensitivity to prune rate, with a slight drop off when increasing the prune rate.We find that datasets with a higher dimensionality performed the worst: ECL contains 321 features, while Insect Wingbeats contains 200.Increasing the dimensionality of the model () mitigated some of these effects, however it was at the cost of model size and complexity.Despite this, we find that the SBT model is able to predict the general trend of complex patterns in data, as depicted in Figure 3.
Compared to state-of-the-art approaches such as the Pyraformer and Informer architectures, our general purpose forecasting approach performs comparably, or slightly worse, on the single-step forecasting task.Metrics were not substantially different for any of the models except for the ECL dataset, where Pyraformer was easily the best model.Comparing the architectures, we find that the SBT model achieves substantially lower computational cost than both the Informer and Pyraformer models.For example, on the ECL dataset, Pyraformer contains 4.7 million parameters and the Informer 12.7 million parameters (both FP32, while the SBT model contains 1.5 million binary parameters.

Architecture
Each model in our framework consists of 2 encoder layers each with a multi-head attention module containing two heads.The feedforward dimensionality for each model is 256 with ReLU used for nonlinearity.Classification models had the best results using Batch Normalization layers, similar to [67], while forecasting models used Layer Normalization typical of other Transformer models.For anomaly detection we did not use Batch or Layer Normalization.For the output of our models, anomaly detection and forecasting rely on a single decoder linear layer which reconstructs the output to size (, ), while classification outputs size (, .)and takes the mean of  to formulate a final classification prediction.Further details are included in the Appendix and the code repository.

COMPUTATIONAL SAVINGS
In this section we estimate the computational savings achieved by using the SBT model.We will begin by introducing the metrics used to estimate computational savings, and will then summarize the results of these metrics for each model and task.
We note that several works (highlighted in Section 2) have proposed modifications to the Transformer in order to make attention more efficient.In this section, we concentrate on the enhancements achieved by 1) creating a sparsely connected Transformer with binary weights, and 2) simplifying the attention module for time series specific tasks such as single-step prediction and classification.We argue that these enhancements are independent of the achievements made by previous works.

Metrics
FLOPs (Non-zero).In the field of network pruning, FLOPs, or the number of multiply-adds, is a commonly used metric to quantify the efficiency of a neural network [6].The metric computes the number of floating point operations required for an input to pass through a neural network.We use the ShrinkBench tool to calculate FLOPs, a framework proposed by Blalock et al. [6] to perform standardized evaluation on pruned neural networks.
Our Transformer architecture contains FP32 activations at each layer along with binary weights scaled to {−,  }.As a result, no binary operations are performed, and our total FLOPs count is a function of prune rate .For example, a linear module with a standard FLOPs count of  × has a new FLOPs count of  × × , where  ∈ [0, 1].Linear layers outside of attention do not need window size added to the matrix multiply because the inputs are permuted such that batch size is the second dimension of the layer input.Each equation counts the number of nonzero multiply-adds necessary for the neural network.Furthermore, we modify the FLOPs for the attention module to account for step-t attention mask and the fixed Q, K, V mask, as summarized in Table 5.In the standard attention module where Q, K and V are equal sized projections, matrix multiply operations (QV ⊺ , AV) for each head equate to  ′  2 , where  ′ = /ℎ.For step-t attention, we only require computation at the current time step (the last row in Figure 2), while each each of the identities for past time steps equates to one.AV requires double the computations because V contains FP32 activations multiplied by the diagonal in A. For the fixed mask, since Q and K are sparse projections, we only require (  ) 2 nonzero computations in the matrix multiply.Since A is a dense matrix, we require  2 FLOPs to multiply sparse matrix V.
A simplified equation for network FLOPs becomes 2 +  (2 + ), where  is a linear layer,  is the number of attention layers, and  is the multihead attention FLOPs (details described in we include in our code, including positional encoding, -scaling, and layer and batch norm.Storage Size.We measure the size of each model in total bits.Standard networks rely on weights optimized with the FP32 data type (32 bits).We consider each binarized module in our architecture to contain single bit weights with a single FP32  parameter for each layer.Anomaly detection and classification datasets contain 14 binarized modules, and forecasting contains 18 with the additional binarization of the layer normalization.We note that the binarized quantities are only theoretical as a result of the PyTorch framework not supporting the binary data type.Hardware limitations are also reported in other works [20].

Model Size Selection
Important to our work is tuning the size of each model.We analyze whether we can create a Dense Transformer with a smaller number of parameters and still retain a performance on par with a larger model.Our motivation for model size selection is two-fold: 1) Previous research has found that neural networks need to be sufficiently overparameterized to be pruned and retain the same accuracy of the dense model and 2) The time series datasets studied in this paper have a smaller number of dimensions than the vision datasets studied in most pruning and model compression papers.The effect of model overparameterization is that we need a dense model with enough initial parameters in order to prune it and still retain high performance.Theoretical estimates on the number of required parameters are proposed by the Strong Lottery Ticket Hypothesis [43,44] and are further explored in other pruning papers [10,17].On the other hand, the limited features of some time series datasets (such as Weather with 7 features) leads us to wonder whether we could simply create a smaller model.
To alter the model size, we vary the embedding dimension  of the model.To find the ideal size of the model, we start from a small embedding dimension (such as 8 or 16), and increase the value in the Dense Transformer until the model performance on the validation set stops increasing.With this value of , we test the SBT model.
Our results show that in each dataset, Dense Transformers with a smaller embedding dimension  either a) perform worse than the SBT at the optimized size, b) contain more parameters (as measured in total bits), c) have more FLOPs, or d) some combination of the above.In almost every dataset, the smaller Dense Transformer performs worse than the SBT while also requiring more size and FLOPs.The exception to this was Spoken Arabic Digits, where the smaller Dense Transformers ( = 16 and  = 32) performed slightly better than the SBT with  = 64.Additionally, these models had a lower FLOPs count.The advantage of the SBT model in this scenario was a substantially lower storage cost than both smaller Dense models.Even if both Dense Transformer models were able to be quantized to 8-bit weights, the storage of the SBT would still be many times lower.The ETTm1 dataset additionally had high performance Dense Transformers with a smaller size ( = 16,  = 32).However, both models were substantially more costly in terms of storage and additionally had a higher FLOPs count.Detailed results are provided in the Appendix.

Analysis
Results in Table 6 highlight the large computational savings achieved by SBT.We find that layer pruning reduces FLOPs count (due to the added nonzero computations), while binarization helps with the storage size.
Notably, all models have a FLOPs count at least two times less than the original Dense model.FLOPs are dramatically reduced in the anomaly detection and forecasting datasets, largely due to the step-t masking.Classification datasets have a dense attention matrix, leading to a smaller FLOPs reduction due to the softmax operation and the  calculation (where  is sparse).We note that using a higher prune rate can reduce the FLOPs more, however we include results at 50% prune rate for classification since these models achieved slightly better accuracy.
We highlight the storage savings of SBT models by measuring bit size and parameter count.Table 6 summarizes the substantial reduction in bit size for every model, with only two SBT models having a bit size greater than 1 million (Insect Wingbeats and ECL).The two models with a larger size also had the highest dimensionality , and consequently .
We note that SBT models contain a small number of FP32 values due to the single  parameter in each module.Additionally, we forego a learnable encoding layer in SBT classification models, leading to a smaller overall count.Finally, no bias term is added to the SBT modules, leading to a smaller number of overall parameters.
Compared to other efficient models, our model generally has a lower FLOPs count.For example, MobileV2 [50] has 16.4 million FLOPs when modeling CIFAR10, while EfficientNetV2 [55] has 18.1 million parameters.

DISCUSSION
We show that Sparse Binary Transformers attain similar accuracy to the Dense Transformer across three multivariate time series learning tasks: anomaly detection, forecasting, and classification.We estimate the computational savings of SBT's by counting FLOPs as well as total size of the model.

Applications
SBTs retain high performance compared to dense models, coupled with a large reduction in computational cost.As a result, SBTs have the potential to impact a variety of new domains.For example, sensors and small embedded systems such as IoT devices could employ SBTs for intelligent and data-driven decisions, such as detecting a malicious actor or forecasting a weather event.Such devices could be extended into new areas of research such as environmental monitoring.Other small capacity applications include implantable devices, healthcare monitoring, and various industrial applications.
Finally, lightweight deep learning models can also benefit larger endeavors.For example, space and satellite applications, such as in the MSL and SMAP telemetry datasets, collect massive amounts of data that is difficult to monitor.Employing effective and intelligent algorithms such as the Transformer could help in the processing and auditing of such systems.

Limitations and Future Work
Although SBTs theoretically reduce computational costs, the method is not optimized for modern libraries and hardware.Python libraries do not binarize weights to single bits, but 8-bit counts.Special hardware in IoT devices and satellites could additionally make implementation a burden.Additionally, while our implementation shows that sparse binarized Transformers exist, the Biprop algorithm requires backpropagation over a dense network with randomly initialized FP32 weights.Hence, finding accurate binary subnetworks requires more computational power during training than it does during deployment.This may be a key limitation in devices seeking autonomy.In addition to addressing these limitations, a logical step for future work would be to implement SBTs in state-of-the-art Transformer models such as the Pyramformer for forecasting and the Anomaly Transformer for time series anomaly detection.
SBTs have the potential to enable widespread use of AI across new applications.The Transformer stands as one of most powerful deep learning models in use today, and expanding this architecture into new domains provides promising directions for the future.

SUPPLEMENTAL MATERIALS A ABLATION STUDIES
We conduct two ablation studies testing the effects of removing the individual pruning mechanisms from the attention computation.We note that the attention pruning methods complement Biprop -Biprop mainly reduces the model size, whereas attention pruning does a better job at reducing the FLOPs.Each ablation experiment is averaged over three experimental runs with different seeds.
Table 1 highlights the effects of removing random pruning from the time series classification models.Notably, Biprop plus random pruning performs comparably to, or better than, Biprop on its own.Adding random pruning even outperforms using only Biprop with the Japanese Vowels dataset.
Table 2 highlights the results of attention variations for both anomaly detection and forecasting tasks.Specifically, we look at our proposed approach (Biprop+Step-T Mask), Biprop plus an identity matrix mask in the attention layers, and finally Biprop only.We report results using mean squared error (MSE) loss averaged over three runs.
Results show that Biprop plus the Step-T mask performs comparably to using Biprop only.For anomaly detection tasks, the MSE is even lower compared to just using Biprop.Comparing both methods to the Biprop plus the identity matrix attention mask, we can see a significant difference in the results: the identity matrix attention mask attains a higher loss in each case.

B TRAINING DETAILS
Each model is trained with Adam optimization with a learning rate of 1e-3 except for InsectWingbeats, where we use a learning rate of 1e-4.For Dense Transformer classification models we use a learnable positional encoding, while in all other models we use a standard positional encoding.We found that SBT models sometimes take slightly longer to converge, hence we train the models for more epochs in the forecasting and classification tasks.These numbers are specified in the configuration files in the code repository.Batch Normalization is used for classification tasks, layer normalization is used for forecasting tasks, and no normalization is used for anomaly detection.

C ANALYSIS C.1 Attention Magnitude Pruning versus Random Pruning
As apart of our attention pruning analysis, we also applied magnitude pruning to the attention layers.However, this method requires extra computation as a result of the sorting required to take the top activation's for each input.Below we compare the results of magnitude pruning versus random pruning, finding that random pruning achieves similar accuracy to magnitude pruning at a lower computational cost.

E MODEL SIZE SELECTION
We measure model performance as well as computational cost at varying sizes for each model.To vary the size, we increase the embedding dimension  for each model and dataset combination.Tables 8 and 9 show the results for each model size and dataset combination.Overall, we find that the SBT generally performs better than the smaller Dense Transformer in terms of performance, except in a few cases.In all scenarios, the SBT model has at least one computational advantage in terms of storage size or FLOPs count.
Additionally we find that, common with our intuition, datasets with a higher dimensionality  need a higher embedding dimension, while simpler datasets are successful with a smaller embedding dimension.For example, Insect Wingbeats ( = 200), Face Detection ( = 144), and ECL ( = 321) require  ≥ 128 to achieve optimal performance.

Dense Transformer
Sparse

Figure 1 :
Figure 1: A sparse binary linear layer (left) and various attention modules (right).a) An example of a sparse and binary linear module, with binary weights B scaled to {−,  }. b) A fully-connected attention module, where each point represents a time step ( = 6).c) The Step-T attention module, where each past time point attends to itself and the latest time point  attends to all past time points.d) An attention module with sparse Query (Q), Key (K), and Value (V) activations.

Figure 2 :
Figure 2: Step-t Attention Mask Left: For the forecasting task we mask inputs during training in order to simulate unknown future time points.Right: The Step-T attention mask used to calculate attention only at the current timestep versus past values.Using this mask rather than setting our Query dimension to one enables us to pass time window vectors along multiple encoder layers.

Table 1 :
[19,67]atasets contain diverse characteristics including varying training set size Accuracy of time series classification models on five datasets.Results are obtained from[19,67].SBT models achieve higher accuracy than prior works (excluding the Dense Transformer) in each case, except for the Japanese Vowels dataset.Additionally, SBT models achieve accuracy within 2.7% of the Dense Transformer for each dataset.

Table 2 :
Anomaly detection results with benign sample windows.We evaluate Precision (P), Recall (R), and the F1 score using both manual threshold and POT threshold technique.We find that the single time step prediction window achieves high accuracy when each past time-step in  is benign. = 200 for SMD and  = 50 for SMAP and MSL.These results

Table 3 :
F1 scores of various time series anomaly detection models.

Table 5 :
Non-zero FLOPs equations for various attention modules.These calculations assume Q, K and V are equal sized projections in R × , and  ′ = /ℎ.QV ⊺ and AV are additionally multiplied by ℎ.Q-scaling and softmax FLOPs excluded from this table.

Table 5 )
. Several FLOP counts are omitted from this equation, which

Table 6 :
Computational savings for Dense Transformers compared to SBTs.SBT models achieve a substantial reduction in size and FLOPs count across all models.We denote parameters in thousands and size and FLOPs in millions, with savings calculated by dividing the Dense values by the SBT values.

Table 1 :
We compare Biprop with Biprop plus random pruning on classification tasks.We find that random pruning of the attention activations does not hurt classifcation accuracy, and in fact helps it in the case of the Japanese Vowels dataset.

Table 2 :
We compare Biprop plus the Step-T attention mask with two other methods.We find that Biprop with the Step-T mask performs similarly to using Biprop with full attention (Biprop Only).Biprop with an Identity Mask on the attention computation performs worse than the other two methods.We report results using MSE loss averaged across three runs.

Table 3 :
Random pruning versus activation magnitude pruning.We find that random pruning achieves similar accuracy to magnitude pruning with lower computational cost.C.2Modelsizesavings of Biprop versus PruningIn Table4we compare the model size savings of Biprop compared to 32-bit pruning as well as pruning plus quantization (8-bit).We show that, even compared to pruning plus 8-bit quantization, Biprop achieves substantially lower model size.

Table 4 :
Comparison of the size between Biprop, 32-bit pruning, and 32-bit pruning + quantization.Biprop achieves the greatest model size compression by a large degree.D DATASET DETAILSWe report the details of datasets used for each task below.For anomaly detection and forecasting tasks, we set the window size  to a fixed value, while in classification,  is predefined.

Table 5 :
A summary of classification datasets.

Table 6 :
A summary of forecasting datasets.

Table 7 :
A summary of anomaly detection datasets.

Table 8 :
Classification Model size selection: Performance of various sized models on each classification dataset.We include the parameter count as well as FLOPs for both the dense and sparse binary Transformer models.Parameters are floating-point 32 in the Dense Transformer and Binary in the SBT.

Table 9 :
Forecasting Model size selection: Performance of various sized models on each forecasting dataset.We include the parameter count as well as FLOPs for both the dense and sparse binary Transformer models.Parameters are floating-point 32 in the Dense Transformer and Binary in the SBT.