Learning Autoregressive Model in LSM-Tree based Store

Database-native machine learning operators are highly desired for efficient I/O and computation costs. While most existing machine learning algorithms assume the time series data fully available and readily ordered by timestamps, it is not the case in practice. Commodity time series databases store the data in pages with possibly overlapping time ranges, known as LSM-Tree based storage. Data points in a page could be incomplete, owing to either missing values or out-of-order arrivals, which may be inserted by the imputed or delayed points in the following pages. Likewise, data points in a page could also be updated by others in another page, for dirty data repairing or re-transmission. A straightforward idea is thus to first merge and order the data points by timestamps, and then apply the existing learning algorithms. It is not only costly in I/O but also prevents pre-computation of model learning. In this paper, we propose to offline learn the AR models locally in each page on incomplete data, and online aggregate the stored models in different pages with the consideration of the aforesaid inserted and updated data points. Remarkably, the proposed method has been deployed and included as a function in an open source time series database, Apache IoTDB. Extensive experiments in the system demonstrate that our proposal LSMAR shows up to one order-of-magnitude improvement in learning time cost. It needs only about 10s of milliseconds for learning over 1 million data points.


INTRODUCTION
IoT data are often collected in a preset frequency, e.g., in every second, leading to time series with regular intervals.However, the data arrivals are often out-of-order, owing to various network delays in the IoT scenarios [11,16].Moreover, some points could be missing and inserted back by data imputation program [21,25].Furthermore, dirty values are also detected and updated later by requesting point re-transmission or data repairing program [26].
To handle the aforesaid out-of-order arrival, point insertion for imputation, value update for repairing and so on [18], most commodity time series databases, such Apache IoTDB [3], employ a Log-Structured Merge-Tree (LSM-Tree) [20] based storage.As illustrated in Figure 1, data points are batched in disk pages referring to their arrivals.Some points such as the one at time 11:00:07 are delayed and batched with other data points in page 2. The corresponding position in page 1 is thus denoted by a hollow circle.Likewise, the missing point at time 11:00:02 is imputed, again in page 2.Moreover, the point at time 11:00:05 with dirty value in page 1 is repaired later by the one at the same time in page 2, while the point at time 11:00:09 is updated by data re-transmission.
To learn models over the time series scattered in different pages with possibly overlapping time ranges, a straightforward idea is to first load the data in disk pages to memory.Then, they are merged by inserting the delayed or imputed points, and updating the retransmitted or repaired points.Existing learning algorithms [8] can thus be applied over the merged time series ordered by time.It is obviously inefficient in I/O cost to read many stale data points and merge them online.Moreover, the insertion and update of data points also prevent directly utilizing the pre-computed models in individual page when it is written to disk.
In this paper, we propose to design efficient schemes for online aggregating the pre-computed models locally in each page.The autoregressive (AR) is considered, since it is simple enough to learn during the flush of the corresponding page to disk in database ingestion.As illustrated in Figure 1, the model is pre-trained over the incomplete time series in page 1 and stored.Intuitively, when

Symbol Description
x time series of n = ∥x∥ data points (t ℎ , x ℎ ) time and value of ℎ-th data point in x p the autoregressive model order   the -th order auto-covariances validity of time series x   the -th autoregressive model coefficient new data points appear, e.g., in page 2, the pre-computed models could be fine-tuned rather than learning from scratch.
Our major contributions in this study are as follows.
(1) We devise the learning process of models in a single page in Section 3, including learning models with imputation, and updating models with modified points.
(2) We propose the efficient aggregation of models learned in different pages, with the consideration of inserted and updated values, in Section 4. For pages with various cases, i.e., adjacent, disjoint and overlapped in time ranges, we derive the corresponding model aggregation strategies, respectively.The theoretical results in Propositions 4.1, 4.3, 4.5 and 4.7 guarantee the correctness of model aggregation, i.e., equivalent to the baseline of learning over the merged time series.
(3) We present the algorithm for learning models on time series scattered in different pages in Section 5.The complexity analysis illustrates that our proposal is more efficient than the baseline of learning from scratch over the online merged time series.Moreover, we provide the details about system deployment, as a function in Apache IoTDB [3], an open-source time series database.The document is available in the product website [4], and the code is included in the product repository by system developers [1].
(4) We conduct extensive experiments for evaluation in Section 6.Our proposal LSMAR shows up to one order-of-magnitude improvement in learning time cost, compared to the aforesaid baseline of online merging data and learning from scratch.It needs only about tens of milliseconds for learning over 1 million data points, while the baseline takes hundreds.The experiment code and public data are available anonymously in [2] for reproducibility.
Finally, we discuss related work in Section 7 and conclude the paper in Section 8. Table 1 lists the notations used in the paper.

PRELIMINARY
For a better comprehension of our proposal, we first introduce the baseline autoregressive model in Section 2.1, and we present the structure about LSM-Tree based storage with an example in Section 2.2.Section 2.3 introduces the problem of learning autoregressive models in LSM-Tree based storage.

Autoregressive Model
Autoregressive model fits a point x l by utilizing its past few points, i.e., x l−1 , x l−2 , . . ., x l−p , where parameter p determines the number of the past points used for fitting autoregressive models.Definition 2.1 (Autoregressive Model [8]).An autoregressive model of order p, denoted as (p), is defined as where  1 , • • • ,   are the parameters of models, and  l denotes the white noise  (0,  2 ) at timestamp l.
It is worth noting that not all time series can be modeled by autoregressive models.Generally, autoregressive models are applied on weak stationary time series.Definition 2.2 (Weak Stationary Time Series [15]).Given a time series x, by considering the value x l at timestamp l as a continuous random variable, x is called a weak stationary time series, if where  is a constant and  2 s only depends upon the lag s.
For simplicity, we consider all time series mentioned in the article as zero-mean time series.For time series with mean value not equal to 0, we replace the value x l at timestamp l with x l − .

LSM-Tree Storage
To support frequent and extensive writing and reading of time series data, the Log-Structured Merge-Tree (LSM-Tree) [20] is often employed in time series database.We follow the convention of Apache IoTDB, an LSM-Tree based time series database.
Figure 2 presents an overview of the LSM-Tree storage structure, where a time series is stored into multiple pages, i.e., P 1 to P 6 , with possibly overlapped time intervals.Each page consists of PageHeader (denoted by blue rectangles), recording metadata, e.g., StartTime and EndTime, and PageData (denoted by red rectangles), storing the batched data received in a time period.Note that each page is associated with a version, and the higher the version is, the later the batched data points are received.If pages have overlapped time intervals, e.g., P 1 and P 2 , the page with the higher version would overwrite the page with the lower version.
Example 2.3.Figure 1 presents a time series stored in two pages, page 1 and page 2, recording the oil temperatures in the tank of a sailing ship.The vertical axis denotes the page version, and the value of each point is suggested by the relative height in the dotted rectangle with range from 320 to 327.Due to transmission issues, Figure 2: Aggregating models of pages in different cases some points may be missing or delayed.For instance, the point at 11:00:02 in page 1 is not received on time, while the delayed point arrives after the data points in page 1 are flushed in disk.Thus, the delayed point is batched with other points in a page with a higher version 2, leading to overwriting.Besides, the point at 11:00:07 in page 1 is a missing point (denoted by hollow circles), imputed by linear interpolation for model learning when flushing to the disk, referring to Section 3.1.
Moreover, from the perspective of LSM-Tree storage structure, the pages in Figure 1 correspond to the overlapped pages P 1 and P 2 in Figure 2. In addition to the case where pages have overlapped time intervals, pages may also be disjoint (e.g., the pages P 3 and P 4 in Figure 2) or adjacent (e.g., the pages P 5 and P 6 in Figure 2), which will be further introduced in Section 4 in detail.

Learning Models in LSM-Tree based Storage
We are now ready to introduce the model learning process in LSM-Tree based storage.With considering overwriting mentioned in Section 2.2, our aim is to efficiently aggregate the models on multiple pages by utilizing the page metadata.
To address the overlapped pages and out-of-order data points, a straightforward method is to merge all pages into one series, and then learn model coefficients from scratch.However, by utilizing the property of LSM-Tree storage, a more efficient method is to learn the aggregated coefficients from the metadata of each page.
Example 2.4 (Example 2.3 continued).Figure 3 shows the series stored in two pages, the same as illustrated in Figure 1, while with the vertical axis denoting the value.The red thick line denotes the merged series, considering that some points in page 2 may overwrite that in page 1. Notably, the overwriting only occurs from 11:00:02 to 11:00:09, that is, there is an efficient way to update the auto-covariances by only considering the influence of the points from 11:00:02 to 11:00:09.Specifically, given model order p = 1, we only need to update the auto-covariances of the segment from 11:00:01 to 11:00:13, referring to Formula (1) in Section 2.1.However, for the baseline method, it merges online the pages, and then learns the model on the merged series from scratch, which could be costly when the page size is large.

LEARNING MODELS IN A PAGE
In this section, we consider two possible cases when learning models in a page.Section 3.1 elaborates the learning process with missing points.Section 3.2 introduces the updating process when some points are modified, which is further utilized to re-calculate the auto-covariances of updated segments in Section 4.3.

Learning Models with Imputation
Due to the harsh environment in which the sensors operate, the sensor sometimes goes off-line, leading to the missing values in the sensor data stream [17].It may affect the learning processing of models, since timestamps are not consecutive.Moreover, transmission failures may also cause missing values.We follow the similar convention in [12] to learn model coefficients with imputation in a single page, that is, utilizing a simple imputation method, linear interpolation, to fill in missing values for each missing value in the page.Again, too complicated imputation methods may not be affordable, since it is done during database ingestion.With imputation, the timestamps of data points in a page are consecutive, and we could follow the preliminary in Section 2.1 to learn models.That is, we calculate the auto-covariances of data points in a page, and store the auto-covariances in the metadata, e.g., PageHeader in Figure 2. Then we solve Equation (2) to obtain the model coefficients.
Example 3.1 (Example 2.3 continued).Consider again the example in Figure 1.For simplicity, we use x l to denote the value at timestamp 11:00:l hereinafter, e.g., x 3 denotes the value at timestamp 11:00:03.During the learning process of each individual page, the missing points at 11:00:02 and 11:00:07 in page 1 are linearly interpolated by their temporally nearest points, i.e., x2 = x 1 +x 3 2 and x7 = x 6 +x 8 2 .Then the model is learnt on the imputed complete series in page 1, and the same process for page 2.

Updating Models with Modified Points
Apart from the missing values, due to sensor data re-transmission or low quality data repairing [26], some points may be modified by other points with the same timestamps.With one point modified, the auto-covariances  i would change accordingly referring to the following proposition.Proposition 3.2.Consider a point h with value x h in segment x[1 : n], with the auto-covariances of the segment  0 ,  1 , . . .,  p obtained.If the value of the point is modified to x ′ h , the auto-covariances of the segment could be updated by where x h−i would be set to 0 if h − i < 1, and x h+i would be set to 0 if h + i > n.
Thereby, if some points are modified, the auto-covariances could be updated individually by applying Proposition 3.2 on each modified point, instead of from scratch.
Example 3.3.Consider again the merged series in Figure 3.The point at 11:00:09, is modified from x 9 = 323.7 to x ′ 9 = 324.1,owing to the re-transmission as aforesaid.If we only consider the influence of modifying this point, the first order auto-covariance can be updated by

AGGREGATING MODELS OF TWO PAGES
In this section, we consider three different cases for aggregating models of two pages x[1 : h] and x[g : n], i.e., adjacent pages in Section 4.1, disjoint pages in Section 4.2 and overlapped pages in Section 4.3.Overlapped pages refer to two pages with overlapped time intervals, i.e., the start time of the high-version page is earlier than the end time of the low-version page.Moreover, if the start time of the high-version page is much later than the end time of the low-version page (more than 1 sampling interval), such two pages are considered as disjoint pages.Otherwise, if the gap between start time of the high-version page and the end time of the low-version page is exactly 1 sampling interval, such two pages are adjacent pages.For instance, Figure 2 presents 6 pages P 1 to P 6 stored in the database, covering the three situations mentioned above.Pages P 1 and P 2 are overlapped pages with overlapped time intervals, pages P 3 and P 4 are disjoint pages, and pages P 5 and P 6 are adjacent pages.
Note that once the auto-covariances are obtained, the model coefficients could be estimated by solving Equation (2) in O (p 2 ) time [10].Thereby, we focus on the time-consuming part, i.e., the calculation of auto-covariances  i in Formula (1).We show below how  i can be efficiently updated in various scenarios of pages, in Propositions 4.1, 4.3, 4.8 and so on.

Aggregation of Adjacent Pages
Two adjacent non-overlapped segments could be aggregated easily, by calculating the weighted sum of two segments and the autocovariances between the tail of the former page and the head of the latter page.Proposition 4.1 below gives the expression of the aggregated auto-covariances.Proposition 4.1.For two adjacent non-overlapped segments x[1 : h], x[h + 1 : n] , the i-th order auto-covariances of x[1 : n] can be obtained by With pre-computed metadata, the first order auto-covariances of both pages could be directly obtained, i.e., (1) 1 = 0.49 and (2) 1 = −0.052.Since the point at 11:00:05 has value x 5 = 0.2 and the point at 11:00:06 has value x 6 = 0.6, referring to Proposition 4.1, the aggregated  1 is equal to (5 (1)

Aggregation of Disjoint Pages
In practice, not all pages are adjacent and could be aggregated by Proposition 4.1 in Section 4.1.Generally, due to the machines or sensors going-offline in a short period, there may be a piece of missing values between pages, i.e., pages are disjoint.In this section, we thus consider the aggregation of two disjoint pages under different conditions: large disjoint length in Section 4.2.1, and small disjoint length in Section 4.2.2.
The aggregation result of two disjoint pages x[1 : h], x[g : n] actually depends upon the imputation methods.For general imputation methods, it takes O (pd) extra time for aggregation, in addition to the imputation time, where d denotes the disjoint length, i.e., d = g − h.However, if we use linear interpolation for imputation, the aggregation process could be accelerated by utilizing the properties of linear interpolation.

Large Disjoint Length.
Consider two disjoint pages with disjoint length d larger than i + 1, where i is the order of the autocovariance, ranging from 1 to p.In such a case, the disjoint length is large enough to ensure that the tail of x[1 : h] and the head of x[g : n] will not influence each other.Thus, we only need to calculate the i-th auto-covariance of the imputed series, and aggregate x[1 : h], x[g : n] and the imputed series as adjacent pages.The following proposition formalizes the aggregation process and utilizes the property of linear interpolation for acceleration.Proposition 4.3.For two disjoint segments x[1 : h], x[g : n], denoting the disjoint length by d = g − h, if d − i − 1 > 0, then the i-th order auto-covariances of x[1 : n] can be obtained by , and  ′ i has the following form, where  ′ i denotes the i-order auto-covariances of the imputed series, and Δx = x g − x h denotes the difference between x g and x h .
Example 4.4.Consider the disjoint pages in Figure 5, with linear interpolation applied on the missing points between page 1 and page 2 in Figure 5(a).The imputed points are denoted by red hollow circles, and for simplicity, we denote the imputed point at 11:00:l by xl .The disjoint length between two pages is d = 7 − 3 = 4.
Given i = 1, the formula d − i − 1 = 4 − 1 − 1 > 0 satisfies the condition in Proposition 4.3, that is, the disjoint length is large enough to ensure that two pages will not influence each other.Since x 3 = −0.9 and x 7 = −0.1, the first auto-covariances of the imputed series is 4.2.2Small Disjoint Length.Proposition 4.3 considers two disjoint pages with a relatively large disjoint length, i.e., d > i + 1, while Proposition 4.5 considers its complement, i.e., d ≤ i + 1.In such case, the disjoint length is so small that the i-th auto-covariance of the imputed series is 0. Besides, the tail of x[1 : h] and the head of x[g : n] will influence each other, unfortunately.Proposition 4.5 aggregates disjoint pages by considering the effect between the tail of x[1 : h], the imputed series and the head of x[g : n] where  (l) is the characteristic function with the following form, Example 4.6.Consider the disjoint pages in Figure 5 again with disjoint length d = 4, while given i = 3.Since d − i − 1 = 4 − 3 − 1 = 0 satisfies the condition in Proposition 4.5, the tail of page 1 and the head of page 2 will influence each other.Given the pre-computed (1) )/(12 − 3) = 0.17.

Aggregation of Overlapped Pages
We next consider the aggregation of overlapped segments.For a time series x[1 : n] stored in two segments x[1 : h], x[g : n] with overlapping time intervals, we denote the updated segments by x ′ whose points are merged from multiple original segments.For the non-updated segments, whose points can be directly obtained from one original segment, we denote such segments by x with different ending indexes from the original segments.Intuitively, the overlapped segments x[1 : h], x[g : n] can be split into two nonoverlapped segments x ′ [1 : h], x[u : n].Then the auto-covariances of x[1 : n] can be derived by aggregating the non-overlapped segments x ′ [1 : h], x[u : n], referring to the following proposition.In real unordered scenarios, only few points are involved in the overlap and modified.Thus, we can utilize Proposition 3.2 to calculate the auto-covariances of the updated segment x ′ [1 : h], by considering each modified point in turn.
Moreover, since sensors usually re-transmit values in a short period, the overlap length is small in general.Therefore, there is an intuitive way for efficiently calculating the auto-covariances of the split non-updated segments, that is, eliminating the influence of the discarding points, referring to the following proposition.Proposition 4.8.For a non-updated segment x[u : v] ⊂ x[g : n], with g ≤ u ≤ v ≤ n, its auto-covariances can be obtained by eliminating the influence of the discarding points, i.e., Example 4.9.Consider the overlapped pages in Figure 6, where the points from 11:00:04 to 11:00:07 in page 2 overwrite those in page 1.As aforesaid, the overlapped pages could be split into two non-overlapped segments, i.e., updated segment x ′ [1 : h], and nonupdated segment x[u : n], split by two dotted red lines in the figure .For the updated segment x ′ [1 : h], we only need to consider the influence of the points from 11:00:04 to 11:00:07 by Proposition 3.2, and calculate the first order auto-covariance

IMPLEMENTATION IN LSM-TREE STORE
In this section, we focus on the implementation of our proposal in LSM-Tree based database.Section 5.1 proposes the algorithm for learning models on multi-segment time series.The corresponding complexity analysis of the algorithm is given in Section 5.2.Besides, we introduce the system deployment in Section 5.3.

Performance Analysis
Consider M segments with N data points in each segment.Among M segments, suppose there are Q segments overlapped with other segments with average overlapped length L 1 , and R segments disjoint from other segments with average disjoint length L 2 .
For each pair of overlapped segments, our proposal takes O (pL 1 ) for aggregation in average.For each pair of disjoint or adjacent segments, the i-order auto-covariance  i can be aggregated in O (i) time.Thus, aggregating all auto-covariances takes O (p 2 ) time.Moreover, it takes O (p 2 ) time to solve Yule-Walker Equation by utilizing the Levinson-Durbin algorithm.Thereby, the overall time cost is The baseline learning process takes O (RL 2 ) time for imputation, and O (p(MN + RL 2 )) for learning from scratch.The overall time cost of the baseline is thus O (RL 2 +p(MN +RL 2 )) = O (pMN +pRL 2 ).

System Deployment
The autoregressive model learning measure has been deployed and included as a function in Apache IoTDB [3], an open-source time series database management system.The document is available in the website [4].By executing the following SQL statement, users can obtain the learned model coefficients of time series root.test.d0.s0.

EXPERIMENTS
In this section, we conduct extensive experiments for evaluating the efficiency of our proposal, including (1) scalability in data sizes in Section 6.2, (2) evaluation with different model orders, i.e., query parameters p in Section 6.3, (3) evaluation under various data loads, including different page sizes in Section 6.4, different disjoint length in Section 6.5, and different overlap length in Section 6.6.

Setup
We implement the baseline autoregressive model for comparison.The baseline loads all possibly overlapped pages, merges data online, and then learns model parameters from scratch, i.e., not utilizing the pre-computed statistics.Since the result of our proposed LSMAR is exactly the same as the result of the baseline, the evaluation mainly focuses on the learning efficiency.Table 2 lists six datasets used in the evaluation, with the first three private datasets collected from our industrial partners, and the last three public datasets.The default model order p is also listed in Table 2, which is determined by the pattern of the dataset.The default page size is set to 1024 for the experiments in Section 6.2 and 6.3, and 10240 for the experiments in Section 6.5 and 6.6.
All the experiments run on a machine with Intel Core 8 CPU (2.3 GHz) and 16 GB of memory, with Apache IoTDB v0.13.3 installed.The algorithm code has been included in the system repository of Apache IoTDB [1].The experiment related code is available in [2].

Scalability in Data Sizes
Figure 7 reports the time cost under different data sizes.For each dataset, we linearly vary the number of data points involved in queries and measure the corresponding time costs.When the number of data points increases, the baseline needs to learn models from more data points, leading to higher time cost.However, LSMAR only needs to aggregate more pages.Since the number of pages is much smaller than the number of data points (1 page contains 1024 data points, as stated in Section 6.1), the time cost of LSMAR increases much more slowly than the baseline, with the increase of data size.Our proposal shows great efficiency compared to the baseline, with 1-2 order of magnitude improvement.slightly increases with the model order p increasing, while much lower than that of the baseline.This is because LSMAR just needs to aggregate each pair of consecutive pages in O (p 2 ) time, and the number of pages M is far smaller than that of data points N , which is consistent with the analysis in Section 5.2.

Evaluation with Different Page Sizes
Figure 9 evaluates the performance under different page sizes.Page size is a parameter for LSM-Tree based storage configuration, which determines the number of data points in a page.Note that if the data size is fixed, the larger the page size is, the smaller the number of pages is.Therefore, with the increase of page size, the number of pages M decreases, and the time cost of LSMAR thus decreases.However, for the baseline method, it merges data online and learns models from scratch regardless of the configuration.Thereby, its time complexity is not affected by the page sizes, and the time cost of the baseline method in Figure 9 keeps constant.

Evaluation with Different Overlap Length
Figure 11 varies the overlap length between two consecutive pages.When the overlap length increases, the time cost of LSMAR increases, since it takes more time to calculate the auto-covariances of the updated segments and non-updated segments, as stated in Section 4.3.However, for the baseline, it always merges all data points into one series regardless of the overlap length, leading to constant time cost in Figure 11.Remarkably, though LSMAR takes more time to handle overlapped pages with the increase of overlap length, it is still much more efficient than the baseline method due to the utilization of the metadata.

RELATED WORK 7.1 Autoregressive Model
Autoregressive models [8] are widely applied on forecasting and detection.There are a lot of extensions and variations based on autoregressive models, including ARX [13], IMR [26], ARIMA [8], SARIMA [8], ARFIMA [14] and vector ARIMA [24].On the base of AR model, ARX [13] further utilizes exogenous inputs to improve the performance.ARIMA [8] combines both the autoregressive process and the moving average process with integration.SARIMA [8] further considers the seasonal effect.ARFIMA [14] extends the integration order in ARIMA from integer to fraction, for a better forecasting performance on time series with long-range dependency.Vector ARIMA [24] can detect additive outlier, innovational outlier, level shift, and temporary change from multivariate time series.As a basic method in machine learning, AR is simply enough to learn when flushing batched data to disk, and we thus implement the autoregressive model in Apache IoTDB as a database-native machine learning operator.We do not implement ARX and IMR for evaluation, since they both require labeled data, difficult to maintain in the LSM-Tree based store.The learned coefficients of autoregressive models can be utilized for a variety of downstream tasks, such as forecasting, detection and clustering.STIFF [19] utilizes autoregressive model for local model construction.Chakraborty et al. [9] consider the extracted real-world events on ARIMA for forecasting.Toledano et al. [23] utilize ARIMA for anomaly detection in their anomaly detection system, Anodot.Bagnall and Janacek [7] propose to cluster time series based on the coefficients learned from ARIMA.

LSM-Tree based Storage
The Log-Structured Merge-Tree (LSM-Tree) [20] can handle extensive writing workloads for time series, especially for IoT data.Thus it is often employed in the time series database, such as InfluxDB [5], Apache IoTDB [3].Database-native machine learning operators on LSM-Tree based storage aim to accelerate the learning process by utilizing the properties of LSM-Tree.Absalyamov et al. [6] propose a novel lightweight approach for data synopses, including histograms and wavelets.LDI [22] learns the data distribution in LSM storage to improve insertion performance.However, machine learning operators for forecast and detection are rarely investigated

CONCLUSION
In this paper, we propose an efficient method LSMAR for learning AR models in LSM-Tree based store.Data points are batched into different pages in such storage, and the later received points may overwrite the previously received points, owing to the out-of-order transmission, missing value imputation, data repairing and so on.Thereby, the straight-forward method is to online merge the scattered data and learn from scratch, while our LSMAR proposes to utilize the pre-computed information in the LSM-Tree based store.We derive several propositions to ensure the aggregation of prelearned models in different pages.Remarkably, the algorithm for learning models on multi-segment time series has been deployed and become a function of Apache IoTDB, an LSM-Tree based time series database.We conduct extensive experiments in the system, where our proposal shows high efficiency in every evaluation of different aspects compared to the baseline.

Figure 1 :
Figure 1: Example of imputed, repaired, delayed and retransmitted data in LSM-Tree store for learning AR models

Figure 3 :
Figure 3: Merging two pages in Figure 1 as one series Figure 4: Example for aggregation of adjacent pages Figure 5: Example for aggregation of disjoint pages Aggregation process of two overlapped pages

Figure 6 :
Figure 6: Example for aggregation of overlapped pages 359.For non-updated segment x[u : n], we only need to eliminate the influence of the points from 11:00:04 to 11:00:07 in page 2, referring to Proposition 4.8, and we have

Figure 7 :
Figure 7: Time costs under different data sizes n

Figure 8 :
Figure 8: Time costs under different model orders p

Figure 8
Figure8evaluates the algorithm performance under different model orders in the query.When the model order p increases, the baseline takes more time to calculate the auto-covariances  0 ,  1 , . . .,  p , each with O (n) time.Thereby, the time cost of the baseline increases with the model order p increasing.The time cost of our proposal slightly increases with the model order p increasing, while much lower than that of the baseline.This is because LSMAR just needs to aggregate each pair of consecutive pages in O (p 2 ) time, and the number of pages M is far smaller than that of data points N , which is consistent with the analysis in Section 5.2.

Figure 10 Figure 9 :
Figure 10  varies the disjoint length between two consecutive pages.The baseline method needs to first impute all the missing values, and then learn on the imputed time series.However, our proposed LSMAR can aggregate the disjoint pages directly by Propositions

Figure 10 :
Figure 10: Time costs under different disjoint length

Table 2 :
Dataset statistics and learning settings