Time Series Representation for Visualization in Apache IoTDB

When analyzing time series, often interactively, the analysts frequently demand to visualize instantly large-scale data stored in databases. M4 visualization selects the first, last, bottom and top data points in each pixel column to ensure pixel-perfectness of the two-color line chart visualization. While M4 already shows its preciseness of encasing time series in different scales into a fixed size of pixels, how to efficiently support M4 representation in a time series native database is still absent. It is worth noting that, to enable fast writes, the commodity time series database systems, such as Apache IoTDB or InfluxDB, employ LSM-Tree based storage. That is, a time series is segmented and stored in a number of chunks, with possibly out-of-order arrivals, i.e., disordered on timestamps. To implement M4, a natural idea is to merge online the chunks as a whole series, with costly merge sort on timestamps, and then perform M4 representation as in relational databases. In this study, we propose a novel chunk merge free approach called M4-LSM to accelerate M4 representation and visualization. In particular, we utilize the metadata of chunks to prune and avoid the costly merging of any chunk. Moreover, intra-chunk indexing and pruning are enabled for efficiently accessing the representation points, referring to the special properties of time series. Remarkably, the time series database native operator M4-LSM has been implemented in Apache IoTDB, an open-source time series database, and deployed in companies across various industries. In the experiments over real-world datasets, the proposed M4-LSM operator demonstrates high efficiency without sacrificing preciseness.


INTRODUCTION
Time series representation is a typical data mining task [18] that reduces the dimensionality while still retaining its essential characteristics such as shape in visualization.M4 representation [25] is known as an error-free method for visualizing time series in two-color (binary) line chart.It encases time series in various scales into fixed-size pixels.For instance, Figure 1(a) shows the line chart of 1.2 million data points time series in 1000 × 500 pixels.The time series is divided into 1000 time spans, corresponding to 1000 pixel columns.

Motivation
Note that M4 is originally designed for visualizing time series data stored in relational databases, reading data points ordered by time.Different from RDBMS, to enable fast writes, the commodity time series database systems, such as Apache IoTDB [6] or InfluxDB [9], employ Log-Structured Merge-Tree (LSM-Tree) [37] based storage.That is, time series is segmented and stored in a number of chunks, denoted by rectangles in Figure 2(a).As shown, the data points in the same time period may be stored in different chunks, owing to out-of-order arrivals [26].Consequently, the data points read from chunks may not be in the order of time either.How to efficiently support M4 representation in such time series native LSM-Tree based databases is still absent.A straightforward idea is to first merge online all the chunks as a whole series, and apply M4 representation over the data points ordered by time as in RDBMS [25].This baseline implementation could still be costly, by loading all chunks from disk, ordering data points by time, and scanning the entire time series, as in Figure 2(b).Although M4 representation greatly reduces the time cost, by avoiding transferring and rendering all the raw data points, as illustrated in Figure 3, the cost of computing the representation points in the database server becomes the bottleneck (34.31s among total 37.11s).In order to increase the productivity of a visualization system, it is always desired to further reduce the response time [34].

M4-LSM Approach
In this study, we propose a novel chunk merge free approach M4-LSM in Section 3 to accelerate the M4 representation.It considers inter-chunk pruning in Section 4 and accelerates intra-chunk accessing in Section 5. (1) Note that the metadata of chunks can be used to avoid loading and merging chunks.Intuitively, if the candidate points for the first, last, bottom and top representations obtained from metadata are neither updated by other chunks nor deleted by delete operations, we can directly return them as results.For example, in Figure 2(c), the candidate point TP( 1 ) for the TopPoint representation is obtained from chunk metadata and verified as the latest (i.e., neither updated nor deleted).Therefore, we can directly return TP( 1 ) as the result of the TopPoint representation and do not need to load and merge any chunk data.(2) Nevertheless, for those chunks that cannot be pruned and need to access the raw data, we observe the regular intervals of timestamps in time series and introduce a step regression for efficient indexing.Moreover, we use a value regression function to prune the points that cannot be the top or bottom ones.
As shown in Figure 3, to visualize 100 million point time series, the database server computation time is reduced from 34.31s of M4 to only 2.75s of M4-LSM, comparable to the communication cost.It enables fast visualization of large-scale time series data, without sacrificing preciseness.More extensive experiments over real-world datasets are reported in Section 8 to demonstrate the efficiency of the chunk merge free operator M4-LSM.

System Deployment
The M4-LSM operator has been implemented in Apache IoTDB [42], an open-source LSM-Tree database.It becomes a built-in function of the system, with the document available on the product website [7].The source code of M4-LSM has been committed to the GitHub repository of Apache IoTDB by system developer [3].The code and data of experiments are available in [4] to reproduce.
Remarkably, the system with the visualization function has been deployed and used in many companies across various industries, including rail transit, steel manufacturing, aviation industry, cloud service, and so on.For example, in fault diagnosis during train maintenance, our proposed solution is used to visualize vibration signals at a frequency of about 100Hz in Grafana.In steel manufacturing, domain experts inspect the visualized time series of temperatures to explore the potential gaps among different stages.In the aviation industry, our proposed solution is used to visualize and compare the performance metrics of different parts, with the data collection frequency as high as 20kHz to 400kHz.In cloud service, our solution is employed to visualize multiple metrics on the same dashboard for fault diagnosis of application performance.Please see Section 7 and [5] for more details on use cases.

Contributions
We highlight the contributions in both research novelty and system deployment.
(1) We formalize the problem of accelerating M4 queries over the LSM-Tree based storage (Section 3).The novel idea is to leverage metadata to prune chunks, with a candidate generation and verification mechanism (Section 4).Moreover, we devise time and value specific regression techniques to accelerate the access to chunks that cannot be pruned (Section 5).
(2) We present the deployment of the proposed M4-LSM approach in Apache IoTDB, without merging any chunk (Section 6).We also introduce a specific application to illustrate the challenges of visualizing large-scale time series, and how M4-LSM tackles the problem (Section 7).
(3) We conduct extensive experiments over real-world datasets (Section 8).The proposed M4-LSM operator demonstrates high efficiency without sacrificing preciseness.It takes about 4 seconds to represent a time series of 127 million points in 1000 pixel columns, enabling instant visualization of the data in four years with a data collection frequency of every second.

PRELIMINARIES
We first introduce the M4 representation in Section 2.1, and then present LSM-Tree storage of time series in Section 2.2.Table 1 lists the frequently used notations.
Definition 1 (M4 representation functions).Given a time series  , the M4 representation functions are as follows.According to [25], the inter-column pixels are determined by both the times and values of the first and last points, while the inner-column pixels only rely on the values of the bottom and top points.In this sense, any point with the minimal (or maximal) value may sever as BP (or TP) from the visualization-driven perspective.
Example 1.Given a time series  in Figure 4, the four representation functions FP( ), LP( ), BP( ) and TP( ) return four representation points P first = (firstTime, firstValue), P last = (lastTime, lastValue), P bottom = (bottomTime, bottomValue) and P top = (topTime, topValue), respectively, marked with bold red dots.The minimal bounding rectangle of  is also plotted in the figure.Note that the four representation points contain more information than the minimal bounding rectangle ( i.e., firstValue and lastValue ).
Below, we use  ∈ {FP, LP, BP, TP} as a general notation of the four representation functions.

LSM-Tree based Storage of Time Series
To enable fast writes, each time series is stored as a set of chunks (containing inserts and appendonly updates) together with a set of deletes (containing append-only deletes), in the LSM-Tree store.
That is, time series data are not readily available without applying such updates and deletes.

Elements of LSM-Tree Storage.
A version number  is a global incremental number assigned to each chunk or delete to distinguish the append order of updates and deletes.The larger the  is, the later the operation applies.
Definition 3 (Chunk).A chunk is a segment of time series that is read-only on disk, denoted by   , where  is the version number.
When the memory component of the LSM-Tree meets the flush trigger condition (such as reaching a threshold size), the time series in memory is flushed to a new location on disk, i.e., a chunk.Each chunk maintains its own metadata, i.e., { (  ) |  ∈ {FP, LP, BP, TP}}.
We say a timestamp  is covered by a chunk   , denoted by  ⊨   , if there exists a point ∃ ∈   such that  =  ..
Definition 4 (Delete).A delete   specifies a time range to delete in the time series, where  is the version number.
By default, we denote [  ,   ] as the time range of delete   , where   and   are the left and right endpoints of the range, respectively.We say a timestamp  is covered by a delete   , denoted by  ⊨   , if   ≤  ≤   .Example 2. Figure 5 gives two ways to better understand the relationship between chunks and deletes.Figure 5 version plane is plotted as a gray translucent rectangle, corresponding to a unique version number.Each chunk or delete is drawn on its own version plane, where  1 and  3 are represented by their minimum bounding rectangles and  2 is represented by the slashed region covering its delete time range.Figure 5(b) collapses the version dimension, stacking chunks and deletes in the two-dimensional time-value space.From the version numbers of  1 ,  2 ,  3 and the relationship between their time ranges, we know that (1)  1 ,  2 and  3 are flushed to disk in turn; (2)  3 might contain some updates to  1 ; and (3)  2 only works on  1 but not  3 , because  3 has a larger version number than  2 .

Merge Function.
To get the time series with only the latest points, we formally define the merge function as follows.
Definition 5 (Merge function).Given a set of chunks C and a set of deletes D, the merge function  (C, D) returns the time series (2) where  ∞ is an empty chunk with the largest version number, and  ∞ is an empty delete with the largest version number.
That is, the point  in the merged time series is from some chunk   ∈ C such that  . is not covered by any whose version number is higher than .

M4-LSM APPROACH
In this section, we give an overview of M4-LSM, an efficient approach to perform M4 representation on LSM-Tree storage.

Problem Statement
With the M4 representation query on time series and the LSM-Tree storage of time series introduced in Section 2, we now combine them to give the formal definition of the problem of performing M4 representation on LSM-Tree storage.
Definition 6 (M4 representation on LSM-Tree storage).For a time series  with a set of chunks C and a set of deletes D, given the query parameters of time range   ,   and the number of time spans , the problem is to compute { (  ) |  ∈ {FP, LP, BP, TP}},  = 1, . . ., , where These two delete time ranges form the complement of   .Also note that the version number is infinity, larger than that of any chunk or delete.Thus as shown in Figure 2

Solution Overview
Algorithm 1 shows an overview of M4-LSM.The key observation is that we do not need to load and merge chunks if a point can simply be retrieved from the chunk metadata.For example, in Figure 2(c), the representation result of TP(  ) is TP( 1 ), because TP( 1 ) has the maximal value and TP( 1 ) is the latest, i.e., neither deleted nor updated.Thereby, for each time span I  , it iteratively generates the candidate point P  from chunk metadata (line 8) as in Section 4.1, and verifies whether P  is the latest (line 9) as in Sections 4.2 and 4.3 for each representation function G.If the candidate point is non-latest, i.e., being deleted or updated, we employ the chunk lazy loading strategy in line 11 to update chunk metadata before entering the next round of candidate generation and verification.That is, instead of eagerly loading the chunk to recalculate immediately the chunk metadata under deletes or updates, the idea is to bound chunk metadata by the delete time range as in Section 4.2, or verify first the candidates of other chunks as in Section 4.3.Both candidate verification and generation may need to access specific data points in chunks, if they cannot be pruned by metadata.Simply scanning chunks is obviously costly.Note that when a chunk is loaded in memory, its points are sorted by timestamps.Hence, the data read operation (CT) in line 9 checks if a data point exists at a given timestamp, while (GT) in line 15 gets the closest data point after/before a given timestamp, in an array of sorted timestamps in Section 5.1.Moreover, (MV) in line 17 gets the undeleted data point with the minimal/maximal value in Section 5.2. Figure 6 is an example of the algorithm steps.It shows an overview of the inter-chunk pruning process in Section 4.Moreover, we also illustrate the steps where the intra-chunk accessing operations (CT), (GT) and (MV) in Section 5 are applied.

INTER-CHUNK PRUNING
In this section, we introduce how M4-LSM utilizes chunk metadata to avoid merging chunks and prune chunks to load.To compute  ( (C, D ′ )) for representation function , M4-LSM first generates the candidate point from the precomputed chunk metadata (introduced in Section 2.2.1).Next, it performs the candidate verification to check whether the candidate point is the latest or not.If it is the latest, M4-LSM outputs it as the representation result; otherwise, M4-LSM applies a lazy loading strategy to load chunks and update chunk metadata, preparing for the next iteration of the candidate generation and verification.It is worth noting that updates of sensor reading values rarely occur in IoT scenarios.That is, most data points should be the latest, and the iteration is expected to terminate shortly.The candidate generation is described in Section 4.1.The candidate verification for FP/LP and BP/TP representation functions are presented in Section 4.2 and Section 4.3 respectively, as they require different candidate verification rules and chunk loading strategies.

Candidate Generation
M4-LSM first retrieves the candidate point P  for the representation function  from the metadata of all chunks in C. For each , let the points suggested by the metadata of all chunks in C be Among them, we need to find those points satisfying the representation condition, i.e., Lazy Load.When P  is verified to be non-latest, i.e., overlapped in time by some later appended deletes, M4-LSM does not eagerly load chunk   to which P  belongs and recalculate its metadata under deletes.Instead, it updates FP(  ). =   or LP(  ). =   by the delete time range [  ,   ].The updated time interval of   , [FP(  )., LP(  ).], might not be tight but can be used to prune   from being loaded.For example, if any other chunk has its first point earlier than FP(  ). (or at FP(  ). with a larger version number than ), then   does not need to be loaded thus far.If no such chunks exist, the load of   happens in the next iteration of candidate generation.The chunk data are read by operation (GT) and accelerated by the time indexing in Section 5.1.Example 3. Take Figure 7(a) as an example, where then the candidate point P  is the latest, and the result of  ( (C, D ′ )).
To verify whether the candidate P  is the latest, M4-LSM first checks whether the time of P  overlaps with the later appended chunks or deletes (i.e., chunks and deletes with larger version numbers than P  .).Referring to Proposition 2, there are three cases to consider.(1) If P  is not in the time interval of any later appended chunks or deletes, i.e., (   ∈C∧>P  .P  .∉ [  (  ).,  (  ).]) ∧ (   ∈D ′ ∧>P  .P  .⊭   ), then P  is the latest and M4-LSM finishes the computation by returning P  .(2) If P  is indeed in the time range of some later appended deletes, similar to the FP/LP verification in Section 4.2, P  is non-latest as it is deleted.(3) If P  is in the time interval of some later appended chunks, then P  might be the latest and needs further verification.The reason is that within the chunk time interval does not necessarily mean the point is overwritten (i.e., updated).That is, P  .∈ [  (  ).,  (  ).] does not necessarily mean P  .⊨   .The chunk data need to be read for verification by operation (CT) and accelerated by the time indexing technique in Section 5.1.
Lazy Load.If P  is found non-latest owing to some later appended deletes or updates, the corresponding chunk does not need to be loaded eagerly, as we can further verify the remaining points in P ′  \ {P  } for BP/TP.M4-LSM iterates this verification process until a candidate point P  is verified to be the latest, or all points in P ′  are non-latest.In the latter case, M4-LSM loads all the corresponding chunks to which the points in P ′  belong and recalculates their metadata under deletes or updates.Afterwards, M4-LSM starts the next new iteration of generating and verifying candidate points, as described in Sections 4.1 and 4.3.Again, the chunk data are read by operation (MV) and accelerated by the point pruning technique in Section 5.2. to check whether they contain any point that overwrites P  .Assume that the read of  4 and  5 does find a point that overwrites the current candidate point P  = TP ( 3 ).With the lazy loading strategy, M4-LSM considers the remaining points in P ′  = {TP ( 1 ), TP ( 3 )} except the non-latest TP ( 3 ), and assigns TP ( 1 ) as the new candidate point for verification.Because TP ( 1 ) is the latest, M4-LSM outputs TP ( 1 ) as the result of TP ( (C, D ′ )).

INTRA-CHUNK ACCESSING
In this section, we introduce the intra-chunk indexing and pruning techniques to speed up the chunk data accessing operations used in M4-LSM.It accelerate the three types of chunk data read operations (CT), (GT) and (MV) for verifying and generating candidates in Section 4. Machine learning techniques [45] are incorporated into M4-LSM.For example, we learn the step regression for time indexing in Section 5.1.It predicts the position of the target timestamp in the sorted array.Moreover, we use the error-bounded value regression for point pruning in Section 5.2.It bounds prediction error and thus can be used to prune points that cannot be top/bottom ones.

Time Index With Step Regression
In the following, we observe the regular intervals of timestamps and introduce a step regression for efficient indexing on timestamps and accelerating data read operations (CT) and (GT).Different from the existing learned indexes on an arbitrary cumulative distribution function (CDF) [29,30,35], we notice the distinct step features on timestamp-position relationships.Figure 8 illustrates the timestamp-position maps extracted from four real-world datasets.The steep part of the step, e.g., [ 1 ,  2 ) in Figure 8(c), stems from the fact that sensor devices usually collect data with a preset frequency, while the flat part, e.g., [ 2 ,  3 ) in Figure 8(c), reflects occasional gaps due to issues such as transmission interruption [19].Therefore, we introduce the step regression function to model such timestamp-position steps.Intuitively, a step regression function has two alternating segments, tilt and level, corresponding to a fixed positive slope and a slope of zero, respectively.
Definition 8 (Step regression).The step regression function  : [  (  ).,  (  ).] → [1, |  |] models the map from the timestamp of a data point to its relative position in the chunk, where K is the fixed positive slope and the intercepts are determined by S = { 1 , . . .,   }, and 1  () is an indicator function with intervals  such that The function is a variation of the canonical form  ×  + .Note that the first segment is tilt by default.The first and last points in the chunk always have the minimal and maximal output positions, respectively.Proposition 3 (FP/LP position).The step regression function of a chunk   always has  (  (  ).) = 1 and  ( (  ).) = |  |.
Given K and S, the step regression function is fully determined.In the following, we provide a heuristic method to learn the parameters K and S of the step regression function.

Learning Slope K.
Referring to the regular data collection frequency, the slope K is estimated by the median of slopes given by each pair of consecutive points, i.e., Example 6. Figure 9 plots the deltas of timestamps extracted from the data in Figure 8(c Fig. 9.An example for learning parameters timestamp delta is 50ms, denoted by the dotted line in Figure 9(b), i.e., collecting data in every 50ms.The slope is K = 1/50.

Learning Split Timestamps S.
The idea is to first select changing points in the chunk based on statistics, then calculate the intercept for each segment of the step regression function, and finally derive the split timestamps by intersecting two adjacent segments.
Select Changing Points P  .Changing points are selected by applying the 3-sigma rule on the deltas of timestamps.As illustrated in Figure 9(a), whenever the delta changes from below the threshold to above the threshold or vice versa, the pivot data point is selected as a changing point.Formally, we have According to Proposition 3, the first and the last segments of the step regression function should have  (  (  ).) = 1 and  ( (  ).) = |  |, respectively.Therefore,  1 and  −1 are calculated as defined in Section 5.1.1,i.e., For the other  − 3 segments, let the -th segment have  (  .)= , where   is the ( − 1)-th point in P  , 2 ≤  ≤  − 2. Then the intercept   for the -th segment is determined by Derive Split Timestamps S. Finally, the split timestamps S = { 1 , . . .,   } derived by intersecting two adjacent segments are

Point Pruning with Value Regression
In the following, we introduce point pruning for obtaining the bottom and top points by the data read operation (MV).Referring to Proposition 2, for  ∈ {BP, TP}, there are two cases for a candidate point P  to be verified as non-latest.That is, P  may be deleted by some later appended deletes, or be overwritten by some later appended updates.We unify the two cases into deletes on P  and formalize the data read operation (MV) as follows.
Definition 9 (BP/TP recalculation).For  ∈ {BP, TP}, given the set of chunks C, the set of deletes D ′ , and the chunk   where the non-latest candidate point   belongs as input, the recalculation of the chunk metadata is to compute  ( ({  }, D ′′ )), where , P  .⊨   } and  ∞ P  denotes the delete with delete time range [P  .,P  .]and version number of infinity.A straightforward idea is to iterate over all points in the chunk and apply deletes along the way to find the point with the min/max values.We propose to use value regression to prune the impossible positions for minimum/maximum.The regression model should have deterministic error bound guarantees [31,48]  where  is the deterministic error bound.
For simplicity, we denote the lower and upper bounds for the value of the -th data point as ( ) = ( ) −  and  ( ) = ( ) + , respectively.Then, we have the following for pruning.
Proposition 4 (Point pruning).Given a data point   ∈   that satisfies   .⊭   , ∀  ∈ D ′′ , for any data point whose lower bound is larger than  (), its value must be greater than that of  ( ({  }, D ′′ )) and thus can be pruned when recalculating .Likewise, for any data point whose upper bound is smaller than (), its value must be smaller than that of   ( ({  }, D ′′ )) and thus can be pruned when recalculating  .
Example 8. Take Figure 10 as an example.The value regression function ( ) is composed of 18 linear segments and 19 segment points.For  = TP, suppose that the candidate point   at the -th position of the chunk is overwritten by some later appended chunks.Therefore, we have   .⊨  ∞ P  .To recalculate TP, we first find the segment point (, ()), which is the point with the maximal value among all non-deleted segment points.Then, according to Proposition 4, we take (), the lower bound of ., as the pruning threshold.Next, the pruning intervals of positions that satisfy  ( ) < () can be calculated analytically by comparing the linear segments with the threshold.The three regions indicating pruning intervals are colored in green in the figure.Finally, we only need to iterate over the non-pruned positions for recalculating TP as defined in Definition 9.

SYSTEM DEPLOYMENT
This section describes the system deployment of M4-LSM in Apache IoTDB [6].The document of the M4 function is available on the product website [7].The source code has been committed to the GitHub repository of Apache IoTDB by system developer [3].An overview of the deployment is shown in Figure 11.Let us first introduce some interfaces of the system in Section 6.1, upon which deployment is conducted in Section 6.2.

System Overview
As illustrated in Figure 11, the storage of data in Apache IoTDB consists of TsFiles, carrying ChunkData and ChunkMetadata, as well as TsFile.modsrecording the delete operations.Note that these delete operations will not be applied to modify the read-only TsFiles until the files are compacted for a new one, which is a typical strategy in LSM-Tree store.
The system's built-in SeriesReader contains MetadataReader, DataReader and MergeReader.MetadataReader and DataReader are responsible for loading chunk metadata, chunk data, and delete data from disks, while MergeReader merges the chunks with possible overlapping time intervals and data overwrites referring to the version numbers.Of course, the delete operations are also applied if any, to form a whole time series.

Deployment Details
We first implement the original M4 method [25] in Apache IoTDB for comparison.It reads the assembled time series directly from the system built-in SeriesRawDataBatchReader, and performs the representation computation on the time series.Note that ChunkMetadata is not accessed in the original M4 design.We implement a new MFGroupByExecutor for M4-LSM.Rather than reading the system assembled time series, the M4-LSM implementation directly uses MetadataReader and DataReader.In this way, ChunkMetadata helps in pruning ChunkData, which saves both IO and computation costs.Note that MFGroupByExecutor does not use MergeReader, i.e., merge free.

APPLICATION STUDY
In the intelligent operation and maintenance system of urban rail vehicles, Apache IoTDB manages the sensor data collected in trains.Each train has tens of thousands of sensors installed, measuring current, voltage, pressure, speed, acceleration, temperature, and so on.Twenty trains on an urban rail line generate about 48TB time series data each year.Figure 12(a) presents an overview of system deployment in the company.Note that directly visualizing such a huge volume of data is impractical.As shown in Figure 12(b), it fails to visualize a time series with more than 20 million points in Grafana (with out-of-memory error) for fault diagnosis by domain experts.Python takes more than 265 seconds for 100 million points, which is also unacceptable in exploratory data analysis.
Unfortunately, existing time series visualization techniques such as M4 [25] cannot meet the efficiency requirement either.In the following, we present four tasks in the application to illustrate the challenges and how our proposal M4-LSM tackles the problem.acceptable processing time of 5s.In contrast, our more efficient M4-LSM visualizes more data in 15 days with the same processing time in Figure 13(b).As shown, there is an irregular shift on voltage found in 02/13 (caused by ad-hoc erroneous transmission from another train).It is missed by M4 which visualizes only the data in the past day, given the requirement of short processing time.
Task 2: Cutting Fluid Pressure Diagnosis.High-pressure jet cutting fluid is used in railway equipment and spare parts processing factories.When some anomaly occurs, the domain experts need to visualize and inspect the historical pressure data, to diagnose whether it is caused by normal tool changing or cutting fluid leaking.Again, the diagnosis query needs to respond in about 5 seconds, to ensure the work efficiency of domain experts.With such a time limit, as shown in Figure 14(b), our proposal M4-LSM can present the historical data in the past 7 days, and successfully illustrate that the spike in 09/28 is very different from those in 09/22 and 09/23.Domain experts highly suspect that it is caused by cutting fluid leaking rather than normal tool changing.Unfortunately, given the 5-second query response time, the original M4 can only visualize the historical data in 12 hours as illustrated in Figure 14(a).With such limited information, the domain experts are not able to diagnose whether it is a true anomaly occurring at noon on 09/28.  in 5 seconds.While regular pulses are observed, there is a large pulse at around 172s.With such limited data, domain experts cannot diagnose whether it is normal.Given a similar query time limit, our more efficient M4-LSM visualizes the data for 1.6 hours, and finds that the large pulse occurs regularly in about every 2 minutes.The regular pulses in Figure 15(a) and (b) are owing to different types of rail joints, in about every 100 and 1k meters, respectively, in the railway.
Task 4: Track Crack Width Analysis.In railway track health monitoring, track crack width is analyzed.Domain experts need to visualize together multiple time series of monitored cracks, in order to find patterns.Figure 16(a) plots the DenseLines [36,49] of 10 time series returned by M4 at a time, within the required response time of 1 minute.Unfortunately, no clear pattern is observed with such a limited number of crack time series.In contrast, Figure 16(b) visualizes 50 time series by our M4-LSM, with a similar response time.As shown, most cracks (in dark green) become larger in width from 2021-12 (winter).However, the crack widths do not drop significantly even in 2022-04, which needs further investigation of other factors like local temperature, track material types, etc.

EXPERIMENTS
In the experiments, we compare M4-LSM with the original M4 algorithm [25] implemented in Apache IoTDB.The experiments are conducted on a machine running Ubuntu 20.04.3 with 32GB DDR Memory, Intel Core i7 CPU @ 2.50GHz.
Table 2 gives a summary of the four real-world datasets used in experiments.BallSpeed dataset is a 71-minute soccer monitoring data collected by a speed sensor in a soccer ball at the frequency of 2000Hz 8.1.2Varying Query Time Range.We now consider another M4 query parameter, the query time range length.While different datasets have various data collection frequencies and total time ranges as shown in Table 2, a typical time series of 1 million points contains the data collected in two weeks with a data collection frequency of every second, often competent in visual analysis.The M4 representation query latencies of M4 and M4-LSM under different query time range lengths on four datasets are shown in Figure 18.
With the increase of the query range length, the time costs of M4 and M4-LSM increase to varying degrees.The increase of M4 is significant, because as more chunks are involved in the longer query time range, more disk I/O and CPU costs are spent to load and merge these chunks.The query latency of M4-LSM also increases but in a much slower way.The reason is that as the query time range length increases, the proportion of chunks split by M4 time spans decreases.While the chunks split by M4 time spans still need to be loaded, most other chunks can be pruned by the candidate generation and verification framework.
8.1.3Varying Chunk Overlap Percentage.In addition to M4 representation query parameters, how the data are written (updated and deleted) will affect the LSM-Tree storage, and thus the query performance.One of the key issues is chunks overlapping in time intervals, incurring costly chunk loading and merging.In this experiment, we propose to write the points in different orders, leading to various chunk overlap rates.The M4 representation query latencies of M4 and M4-LSM under different percentages of overlapping chunks are illustrated in Figure 19.
The latency of M4 increases as the overlap percentage increases.This is because merging more overlapping chunks needs more CPU cost, although the I/O cost does not change.The query latency of M4-LSM is almost constant, owing to the merge free strategy.No chunks need loading as long as the candidate point is not in the time interval of any later appended chunks or deletes.The CPU cost of candidate verification for BP/TP is saved with the time index.8.1.4Varying Delete Percentage.How the data are deleted also affects the LSM-Tree storage and thus the M4 representation query.In this experiment, we evaluate the frequency of delete operations.Figure 20 shows the M4 representation query latencies of M4 and M4-LSM under different delete percentages on four datasets.
The query latency of M4 is almost constant despite the increasing number of deletes, thanks to the CPU-efficient delete sort operation [1] inherent in IoTDB.The overall time cost of M4-LSM is small.This is because the number of deleted candidate points is limited, given that the delete time range of each delete is small compared to the chunk time interval length.8.1.5Varying Update Percentage.We vary the frequency of update operations to evaluate the impact on query performance.The update operation is implemented by adding a normally distributed random value with a mean of 0 to the original value.As shown in Figure 21, M4-LSM still has much better time performance than M4, under a large number of overwrites.It means more overlapped chucks as illustrated in Figure 7 in Example 4. The improvement is thus not surprising referring to Figure 19 on chunk overlap.8.1.6Varying Query Time Range on the Dataset with Real Updates.We consider a dataset CQD1 with updates from real usage.The time series records the average length of all the observations in every 500 milliseconds.Note that 16% observations are delayed [43].Thereby, the computed average length of the corresponding time slot needs to be updated when the delayed observations finally arrive.Figure 22(a) shows the number of updated points in various query time ranges.Our M4-LSM still has significantly better time performance than the original M4, as in Figure 22(b), when there are more updated points.The reason is that M4-LSM is chunk merge free, without merging chunks with updates.

Apply to MinMax Representation.
According to [17,41], we further implement LTTB [40] and MinMax [25] in Apache IoTDB, and compare M4-LSM with them.Moreover, since M4 returns the first/last/bottom/top points, our approach can be naturally applied to MinMax visualization by returning only bottom/top points.Thus, we also conduct experiments to demonstrate how the proposed approach improves MinMax visualization, namely MinMax-LSM.
The visualization quality (i.e., DSSIM) comparison results in Figure 23(a) show that M4-LSM is as precise as M4, with DSSIM always equal to 1, meaning perfect (error-free) visualization.This is not 8.2.2 Apply to DenseLines Visualization.In addition to line charts, M4 can also be used to maintain the visual integrity of density-based visualizations such as DenseLines [36,49], which use the same rendering mechanism.Thereby, M4-LSM can naturally accelerate DenseLines visualization by enabling faster M4 query processing.In this experiment, we evaluate the efficiency improvement of DenseLines visualization on large-scale time series stored in Apache IoTDB. Figure 24(a) presents an example of DenseLines, visualizing 45 stock time series together.The results in Figure 24(b) are generally similar to those in Figure 3(a).Without applying M4, the original DenseLines is extremely costly.By integrating our proposal, DenseLines-M4-LSM shows significantly lower time costs.

RELATED WORK
While visualizing time series is highly demanded, e.g., finding interesting patterns [33,47], the time series database native representation operator for visualization is surprisingly absent.

Representing Time Series for Visualization
A number of time series representations have been proposed after decades of research, including sampling [13], Discrete Fourier Transform (DFT) [11], Discrete Wavelets Transform (DWT) [12], Singular Value Decomposition (SVD) [28], Piecewise Aggregate Approximation (PAA) [27], Symbolic Aggregate approXimation (SAX) [32,39], piecewise polynomials [31], and shapelet-based representations [20,46].In terms of visualization tasks, Park et al. [38] develop a visualization-aware sampling layer between the visualization client and the database backend to speed up queries for the scatterplots and map plots.In contrast, M4 [24,25], as an in-DB data reduction method, is designed for the line chart, which is more suitable for the visualization of time series.Since M4 shows zero pixel error in two-color (binary) line visualization, which is impossible with other data reduction techniques such as MinMax, we focus on M4 representation in time series databases.

LSM-Tree based Storage
Log-Structured Merge Tree (LSM-Tree) [37] is widely adopted as a storage backend by state-ofthe-art key-value stores [21] including time series databases.This is because LSM-Tree meets the performance requirement of both high-throughput writes and fast point reads of key-value stores.Research on LSM-Tree storage has flourished in recent years.For example, Idreos et al. [22] propose a unified design space spanning LSM-Trees, B-trees, Logs, etc., and optimize the design of these data structures to improve the performance of NoSQL storage systems [14][15][16].These lines of work are orthogonal to our work, as our focus is on the optimization of the (M4) query execution algorithm in the LSM-Tree systems.

CONCLUSIONS
M4 representation [25] has been found error-free in two-color line visualization of time series data.The method however is originally designed for relational databases, without considering the disordered points in the LSM-Tree storage, which is widely adopted in the commodity time series database systems.In this paper, we present M4-LSM without merging any chunk in the LSM-Tree store.Metadata are utilized to prune chunks, which do not contain representation points for sure.To access data points in chunks that cannot be pruned, we observe the regular intervals of timestamps and introduce a step regression for efficient indexing.Moreover, we use a value regression function to prune the points that cannot be the top or bottom ones.The method has been deployed in Apache IoTDB, an open-source LSM-Tree time series database [42], and used in many companies across various industries, including rail transit, steel manufacturing, aviation industry, cloud service, etc. Extensive experiments over real-world datasets demonstrate that M4-LSM takes about 4 seconds to represent a time series of 127 million points in 1000 pixel columns, enabling instant visualization of data in four years with a data collection frequency of every second.
However, our approach cannot directly accelerate the time series visualizations like Largest-Triangle-Three-Buckets (LTTB) [40,41].This is because LTTB selects in each group the point with the largest effective triangle area.Therefore, such triangle representations require maintaining different statistics to avoid merges.We leave this extension as future work.As discussed in [24], visualization-driven data aggregation queries for scatter plots select the last record per pixel.Since M4 aggregation for line charts is at the pixel column level, M4-LSM cannot be directly used for scatter plots.We leave the extension of our techniques to support scatter plots also as future work.

Fig. 2 .
Fig. 1.M4 representation for time series visualization Figure 1(b) illustrates 3 time spans (pixel columns) out of 1000.For each time span, M4 selects the first, last, bottom and top data points, denoted by red dots in Figure 1(b).Pixels covered by the connecting lines of consecutive representation points are colored in black for line chart visualization.

Fig. 5 .
Fig. 5. Schematic diagrams of chunks and deletes (a) shows a three-dimensional space composed of time, value and version number.A Proc. ACM Manag.Data, Vol. 2, No. 1 (SIGMOD), Article 35.Publication date: February 2024.Time Series Representation for Visualization in Apache IoTDB 35:7 Fig. 6.Example of Algorithm 1 steps.(a) Input of each iteration includes chunk metadata, deletes and the time span.(b) Candidates (red dots) are generated from the chunk metadata in Section 4.1.(c) Candidate verification is performed to verify whether the candidate points are invalid (hollow dots) in Sections 4.2 and 4.3.Chunk access operation (CT) checks if a data point exists at a given timestamp in Section 5.1.(d) If the candidate point is invalid, new candidates are generated.(GT) gets the closest data point after/before a given timestamp in Section 5.1.(MV) gets the undeleted data point with the minimal/maximal value in Section 5.2.

Fig. 10 .
Fig. 10.Example of pruning points in green by the error bounds of value regression

Task 1 :Fig. 14 .Fig. 15 .
Fig. 14.Cutting fluid pressure diagnosis of a true anomaly at time 11:00 fails in (a) by comparing limited data, but succeeds with M4-LSM in (b) by contrasting with other spikes

Task 3 :Fig. 16 .
Fig. 16.Track crack width analysis fails to find a clear pattern by visualizing only 10 time series with M4 in a minute, but succeeds by M4-LSM showing 50 time series at a time

Fig. 23 .
Fig. 23.Comparing M4-LSM with baselines in terms of DSSIM and query time.A fair comparison is achieved by ensuring that all methods return the same number of points.

Table 1 .
Notations and explanations FP, LP, BP, and TP [  ,   ) the time range of M4 representation  the number of time spans in M4 representation   the -th time span of M4 representation, corresponding to the -th pixel column of line chart   the subsequence of  that falls in the time span    the version number   the chunk with version number    the delete operation with version number  [  ,   ] the time range of the delete C the set of all chunks for the given time series D the set of all deletes for the given time series  (C, D) the merge function (4) TopPoint representation function, denoted as TP :  → , returns any one of the points with the maximal value, i.e., P top ∈ { * ∈  |  .≤  * .,∀ ∈  }.
, given the set of chunks C, the set of deletes D and the M4 time span   as input, M4-LSM deals with the problem of computing For simplicity, we omit  in D ′  in the rest of the paper when no ambiguity.Time series T , query range [  ,   ), the number of time spans  Output: { (  ) |  ∈ {FP, LP, BP, TP}},  = 1, . . .,  1 determine all time spans I  by [  ,   ) and  2 read the metadata of all chunks C of time series T in the time range [  ,   ) 3 read all deletes D of time series T in the time range [  ,   ) 4 for each time span I  do 5 union the virtual deletes of I  with D into D ′ 6 for  ∈ {FP, LP, BP, TP} do 7 while  (  ) is not computed do 8 generate the candidate point P  in C for  (Section 4.1) 9 verify candidate P  (Sections 4.2 and 4.3) with (CT) checking if P  is overwritten (Section 5.1) 10 if P  is not the latest (Propositions 1 and 2) then 11 if chunk lazy loading applies then 12 Update chunk metadata without loading chunk data (Sections 4.2 and 4.3) 13 else 14 if  ∈ {FP, LP} then 15 Update chunk metadata for delete using (GT) in Section 5.1 16 else 17 Update chunk metadata for delete or update using (MV) in Section 5.2 18 else 19 set  (  ) = P  20 return { (  ) |  ∈ {FP, LP, BP, TP}},  = 1, . . ., TP Finally, the point with the largest version number in P ′  is the candidate point   , P  = arg max where P. is the version number of chunk C  that P belongs to.To sum up, the candidate point is the one with the largest version number from the metadata satisfying the representation condition.4.2 FP/LP Candidate VerificationWe now verify whether the candidate point   is valid as the result of  ( (C, D ′ )) for  ∈ {FP, LP}.Proposition 1 (Latest candidate point for FP/LP).For  ∈ {FP, LP}, if P  . is not covered by any delete   in D ′ with a larger version number  than P  .,i.e.,   ∈D ′ ∧>P  .P  .⊭   , then the candidate point P  is the latest, and the result of  ( (C, D ′ )).
., ∀ ∈ P  },  = BP P ′  = { * ∈ P  |  * .≥  .,∀ ∈ P  },  = Firstly, M4-LSM retrieves the candidate point P  =   ( 2 ) (denoted by the red dot) from P ′  = {  ( 1 ),   ( 2 )}.Next, M4-LSM verifies P  as non-latest because   . is covered by  3 .With the lazy loading strategy, M4-LSM updates the time interval of  2 to be [ 3 .,  ( 2 ).] without eagerly loading the chunk data.Likewise, the time interval of  1 is updated as [ 3 .,  ( 1 ).].The next iteration of candidate generation and verification starts by finding   ( 4 ) as the new candidate point and ends by outputting the verified latest   ( 4 ) as the representation result, without loading  1 and  2 .4.3 BP/TP Candidate VerificationNext, we verify whether the candidate point   is valid as the result of  ( (C, D ′ )) for  ∈ {BP, TP}.Note that FP/LP can be verified by only checking the deletes in Proposition 1. Te reason is that for FP/LP, all candidates in P ′  are at the same time, and thus the candidate point   with the largest version number from P ′  will never be updated.However, this is not the case for BP/TP candidate verification.I addition to deletes, we need to further consider whether BP/TP candidates are updated by other chunks.Proposition 2 (Latest candidate point for BP/TP).For  ∈ {BP, TP}, if P  . is not covered by any chunk in C with a larger version number than P  . ad P  . is not covered by any delete in D ′ with a larger version number than P  .,i.e., (   ∈C∧>P  . Definition 7 (Time index).Given a chunk   = { 1 , . . .,  |  | } in the increasing order of time, and a lookup timestamp  * , (CT) to check if a data point exists at  * , the time index returns TRUE if  * ∈ { 1 ., . . .,  |  | .}, FALSE otherwise; (GT-1) to get the position of the closest data point after  * , the time index returns arg min  {  .|   .>  * ,   ∈   }; (GT-2) to get the position of the closest data point before  * , the time index returns arg max  {  .|   .<  * ,   ∈   }.
).The timestamp delta for the -th data point   is  +1 .−   .,where 1 ≤  ≤ 999.The median of the ,  +1 ∈   }∪ { Only two data points,  223 and  224 , have their deltas of timestamps larger than the threshold  + 3.As a result, the set of changing points is P  = { 223 ,  225 }.Calculate Intercepts   .Next, we calculate the intercept for each segment.With |P  | changing points, we know that the step regression function has |P  | + 1 segments.In other words,  = |P  | + 2.