PreVision: An Out-of-Core Matrix Computation System with Optimal Buffer Replacement

Large-scale matrix computations have become indispensable in artificial intelligence and scientific applications. It is of paramount importance to efficiently perform out-of-core computations that often entail an excessive amount of disk I/O. Unfortunately, however, most existing systems do not focus on disk I/O aspects and are vulnerable to performance degradation when the scale of input matrices and intermediate data grows large. To address this problem, we present a new out-of-core matrix computation system called PreVision. The PreVision system can achieve optimal buffer replacement by leveraging the deterministic characteristics of data access patterns, and it can also avoid redundant I/O operations by proactively evicting the pages that are no longer referenced. Through extensive evaluations, we demonstrate that PreVision outperforms the existing out-of-core matrix computation systems and significantly reduces disk I/O operations.


INTRODUCTION
The ability to handle large-scale matrix computations becomes increasingly crucial for artificial intelligence and scientific applications.Many solutions have already been proposed for matrix computations, which range from easy-to-use packages to sophisticated distributed systems.While easy-to-use packages such as NumPy [16] and SciPy [44] are widely used for simple computational tasks on a standalone workstation, distributed systems such as SystemDS [9] and SciDB [39] focus mainly on parallelizing large-scale workloads over many computing nodes.
Large-scale matrix computation tasks such as PageRank [30] may suffer from an excessive amount of disk I/O operations, when large input matrices should be loaded into memory by each update iteration.The problem can become dire if a large quantity of intermediate data is produced, spilled out to disk, and reloaded to memory for further processing repeatedly.A vicious pattern like this happens more often than not in matrix computations (e.g., non-negative matrix factorization (NMF) [23]).Most of the existing distributed systems, however, do not address disk I/O challenges and are hardly optimized for out-of-core matrix computations, let alone those easy-to-use packages with limited support for out-of-core computations.For example, SystemDS and MLlib [28] rely on Spark [46], which spills data produced from a stage to disk for the next stage.NumPy, one of the popular easy-to-use packages, uses a naive data layout to store matrices on disk, contributing to inefficient data access for computations.Disk I/O is still a costly operation and accounts for a large portion of processing time for matrix computation tasks.
A new trend in query planning further aggravates the disk I/O problem.Traditionally, relational database management systems adopt a tree-structured query plan for its simplicity of expressing a relational algebra query.On the other hand, queries are now modeled as a directed acyclic graph (DAG) more often than before particularly for tasks requiring multiple accesses to the same data source [9,46].In a DAG query plan, the output produced by an operator can be fed into one or more operators as input.Consequently, the same output may have to be reproduced repeatedly for multiple consuming operators.A common approach to avoiding such redundant computation is to materialize the output data at least partially.For instance, Spark provides a function persist() to avoid the repeated computation of a Resilient Distributed Dataset (RDD).The materialized data will inevitably add to the working set of a DAG query plan, which lowers the buffer hit ratio and increases disk I/O operations.
Buffer replacement algorithms have been studied extensively for the past decades because they play a critical role in reducing disk I/O operations [21,26,27,29].Among those, an optimal buffer replacement algorithm OPT [26] is known to be infeasible to implement in most practical scenarios due to its reliance on the knowledge of future events.When it comes to an out-of-core matrix computation, however, it is not impossible to accurately predict the entire sequence of data chunks to be accessed.Consider a large disk-resident matrix stored as a group of tiles of varying shapes and sizes [38].With those tiles as the unit of paging between memory and disk, the entire sequence of tile accesses can be predicted at the planning stage of a given query by taking advantage of the deterministic nature of matrix computation algorithms [42].
This paper presents PreVision, a data processing system designed for large out-of-core matrix computations.For a given matrix computation task modeled as a DAG query plan, PreVision can make use of the optimal buffer replacement algorithm by leveraging the entire page access pattern that can be accurately predicted for the DAG query plan.Besides, PreVision takes a proactive approach to buffer replacement by preemptively evicting data no longer referenced in the future.By immediately evicting data once they become known to be referenced no longer, PreVision can effectively increase the utilization and the hit ratio of the buffer pool.
The contributions of this work are as follows: • We have developed an optimal buffer replacement algorithm for a DAG-structured query plan.Optimal buffer replacement becomes feasible at the nominal cost of run-time analysis that produces the future log of references for a given DAG query plan.• We propose a preemptive eviction strategy that eagerly victimizes data no longer referenced.
The future log is used to detect such data so that they can be evicted from the buffer pool immediately.• We present the implementation details of PreVision focusing on the buffer management issues such as variable-length page allocation, a few design decisions for the buffer manager, and a disk-resident file format.
• We demonstrate experimentally that PreVision outperforms existing open-source solutions for large-scale matrix computation tasks.The experimental results corroborate the efficacy of the I/O mitigation methods of PreVision.
This paper is organized as follows.In Section 2, we describe how a DAG query plan is executed in detail and introduce the future log of references that can be created by analyzing a given query plan.Section 3 and Section 4, respectively, depict how the OPT buffer replacement algorithm can be applied and how data with no future reference can be evicted preemptively using the future log.The implementation details of PreVision are presented in Section 5, and the extensive evaluation with detailed analysis is presented in Section 6.The previous studies related to this paper are summarized in Section 7. Lastly, we conclude this work in Section 8.

LOOKING INTO THE FUTURE
In a matrix computation, the shapes of intermediate and output matrices can be predicted accurately.Besides, the block matrix algorithms [42] generally have deterministic data access patterns.Consider for example a matrix multiplication  ×  that is performed by a block matrix multiplication algorithm.The shape of the output matrix is immediately determined by the number of rows in matrix  and the number of columns in matrix  .The matrix multiplication algorithm accesses the row tiles of the matrix  and the column tiles of the matrix  to produce output tiles.The order of tile pairs to multiply is fixed, and the entire sequence of tile accesses by this algorithm can be predicted accurately.In comparison with a relational join operation, the data access pattern by a block matrix algorithm can be more deterministic for input, intermediate, and output matrices.
A matrix computation task may have more than one operator in its DAG query plan.By analyzing the dependency among the operators in the query plan, PreVision selects an order of operators that will be carried out by the executor.By leveraging the fact that both the tile access pattern by an individual operator and the execution order of the operators are deterministic, the query planner of PreVision constructs a tile reference sequence by simulating the query plan.The constructed reference sequence is stored in a future log and passed to the buffer manager before the execution of the query plan begins.
This section presents the overview of query execution steps taken by PreVision for a matrix computation task (in Section 2.1).It then describes how individual operators in a DAG-structured query plan are executed and how a future log is constructed by simulating a given query plan in detail with concrete examples (in Section 2.2 and Section 2.3).

Overview of PreVision
The overall steps of query processing by PreVision is illustrated in Fig. 1.PreVision supports the Array data type and provides a set of Application Programming Interfaces (APIs) for linear algebra functions.The Array data type represents disk-resident arrays and temporary arrays produced by the linear algebra functions.A user task is typically expressed by a series of linear algebra functions, and it is activated by invoking the compute() function.The activated user task is submitted to PreVision as a query, for which the query planner constructs a DAG-structured query plan.The query planner performs query rewrites if necessary, builds a future log, and passes it to the buffer manager.
The query plan is then executed by the query executor, which works with the buffer manager to access array data and perform computations on the array data.The buffer manager adopts the OPT buffer replacement algorithm by leveraging the future log, enables an  (1) time victim selection when the buffer pool is full, and evicts obsolete data preemptively so that the buffer pool will not PreVision splits an array into a group of tiles so that it can take block matrix approaches [42] with those tiles to perform matrix computations.The buffer manager treats the tiles as the unit of buffering like pages.Unlike fixed-length pages, however, tiles can be of different shapes and lengths.Tiles of the same shape may differ in size due to different densities of the tiles.

Executing a DAG Query Plan
In PreVision, a query plan is represented as a directed acyclic graph (DAG).A DAG query plan is composed of nodes and edges, and each node represents an operation and each edge indicates the direction of data flow from one node to another.The source and destination of an edge are called a producer node and a consumer node, respectively.A node designated by a user with the compute() function is referred to as an output node.For example, in Fig. 1, the query plan of a matrix computation task    involves three nodes, namely OPEN, TRANS, and MATMUL.The OPEN node represents an operation that fetches a disk-resident array  .The TRANS and MATMUL nodes are responsible for a transposition operation and a matrix multiplication operation, respectively.The query plan also involves three directed edges, namely OPEN → TRANS, OPEN → MATMUL, and TRANS → MATMUL.The first two edges indicate that the same producer node OPEN feeds two consumer nodes TRANS and MATMUL with an array it fetches from disk.The MATMUL is designated as an output node as it returns the result of the task to the user.
The query executor of PreVision creates an iterator for each node in the query plan.Each iterator includes a common function called getPos() to fetch a tile.A tile request is made by invoking the getPos() function with the coordinates of the tile.While a query is being processed, each iterator maintains a Boolean state for each tile it is responsible for to indicate whether the tile has been computed or not.When a request is made for a tile already computed and marked as true, the Fig. 2. The tile access pattern of   for 2 × 2 tiled matrices iterator returns the computed tile without redoing the computation.Redundant computations can be avoided this way.
PreVision initiates the processing of a query by requesting tiles from the iterator of the output node in the query plan.The iterator of the output node then sends requests for all the necessary tiles to the iterators of corresponding producer nodes.This operation is propagated recursively down to all the relevant descendent nodes of the query plan.
Example 1. Fig. 2 illustrates how PreVision performs a matrix computation task  =   , where ,  and  are 2 × 2 tiled matrices.In the figure, each arrow shows the flow of data between nodes and is annotated with a circled number to indicate the order of execution.
When the first output tile  0,0 is requested by the executor of PreVision, the iterator of the output node makes tile requests to initiate a matrix multiplication as the requested tile  0,0 is not computed yet.Since  0,0 =  0,0   0,0 +  0,1   1,0 , the iterator first attempts to fetch  0,0 and   0,0 .Fetching  0,0 and   0,0 is done by sending a tile request to the iterator of  and sending another tile request to the iterator of   .While fetching  0,0 can be done immediately by its iterator ( 1 ○), the iterator of   has to send yet another tile request to the iterator of  ( 2 ○) to initiate a matrix transposition.Upon completion of the matrix transposition, the iterator of   returns the   0,0 tile to the iterator of  ( 3 ○).The iterator of  then carries out a matrix multiplication with the  0,0 and   0,0 tiles.To perform the next matrix multiplication  0,1   1,0 , the iterator of  requests the  0,1 and   1,0 tiles from its descendent iterators ( 4 ○, 5 ○, and 6 ○) in the similar way.Finally, when both  0,0   0,0 and  0,1   1,0 become available, the iterator of  computes the  0,0 tile and returns it to the user ( 7○).All these steps are summarized in Fig. 2a.
The iterator of  repeats the same steps to compute  1,0 .Note, however, that the iterator of  may not have to be involved in the computation of  1,0 , because   0,0 and   1,0 , which are necessary for  1,0 , have already been computed by the iterator of   .The two transposed tiles can simply be passed from it to the iterator of  without making any new request to the iterator of  ( 2 ○ and 4 ○).These steps are summarized in Fig. 2b.□

Constructing a Future Log
The future log is implemented as a hash table.A hash element is created for each tile referenced by a given query.Each hash element contains (1) the metadata of a tile, (2) a list of reference events, Hash Table

Future Log
… key:  !,! # events: [1,3,21] ptr: key:  !,! events: [4,9] ptr: Fig. 3.An example of a future log and (3) a pointer to the next reference event.The metadata of a tile is used as a hash key in the hash table.Since a tile can be referenced more than once, a list of reference events is stored in the hash element, and each of the reference events is essentially a logical timestamp of the corresponding reference.See Fig. 3 for an example of a future log.
The reference events can be obtained by the query planner of PreVision simulating the given query.The query planner initializes a logical clock to zero when it begins traversing the query plan.Following the prescribed execution procedure, when the query planner detects a buffer request for a tile, it appends the current timestamp of the logical clock to the list of events in the corresponding hash element.If the tile has not been referenced before and the hash element is not found in the future log, the planner creates a new hash element and adds it to the future log by using the metadata of the tile as a hash key.It then increments the logical clock by one.Since the logical clock never decrements, the list of events of a tile is sorted in the ascending order of timestamps.
Example 2. Fig. 3 illustrates the future log entries of the   0,0 and  0,0 tiles described in Example 1. Suppose that the query planner begins simulating the query plan by applying a matrix transposition operation to the  0,0 tile.The  0,0 tile will be the first one to be requested, which is then followed by an empty tile for   0,0 that will store the transposed tile.Thus, a hash element for  0,0 is created and inserted into the future log with a timestamp zero, and another hash element for   0,0 is created and inserted into the future log with a timestamp one.When  0,0 ,   0,0 , and  0,0 are referenced next by the matrix computation task in Example 1, two new hash elements are created for  0,0 and  0,0 , and three new reference events with timestamps two, three, and four, respectively, are added to their corresponding hash elements in the future log.The simulation continues in this manner adding more hash elements and reference events to the future log until the query plan is traversed completely.Fig. 3 shows just a few reference events of the   0,0 and  0,0 tiles.□ The future log is usually small enough to fit in memory.For example, when PreVision performed a logistic regression (LR) task for a 100 × 1 tall-skinny tiled matrix with three iterations, the future log was no more than 300 kilobytes in size.(See Section 6 for the details of the task.)

OPTIMAL TILE REPLACEMENT
The buffer manager of PreVision runs the OPT buffer replacement algorithm with a future log provided by the query planner.The optimal buffer replacement is achieved by evicting a buffer frame for replacement the next reference of which is the farthest into the future.Each tile cached in the buffer pool is associated with the timestamp of the next reference (denoted by next_ts).The next reference can be retrieved from the future log.
For the fast selection of a victim, all the cached tiles are maintained in a doubly linked skip list [32] in the increasing order of their next_ts values.A victim selection can be done in  (1) time because an unpinned tile with the greatest next_ts value is always found at the tail end of the skip

Retrieve next_ts from the log
Fig. 4.An example of list update process for a buffer hit list.In general, if a large number of pinned tiles had to be skipped, the victim selection time could indeed exceed  (1) time.However, this does not happen with PreVision which executes only a single operation at a time.The number of pinned tiles is always a constant, which is the number of operand and output tiles for an operator being executed.The timestamps of future references are determined by a logical clock that ticks once every tile reference.Thus the next_ts values of all cached tiles are greater than or equal to the current clock at any moment in time.This also implies that, on a buffer hit, the next tile to reference is always at the head of the skip list.When this tile is actually referenced, it is removed from the skip list and is inserted back with a renewed next_ts value.On a buffer miss, the buffer manager loads the referenced tile from disk, sets the next_ts value of the tile by accessing the future log, and inserts the tile to the skip list.Fig. 4 illustrates the steps of a list update operation when a buffer hit occurs.
In the future log, there is a hash element for each tile to be referenced.The hash element of a tile stores a list of reference events in chronological order.To determine the next_ts value of a tile quickly, the hash element of the title includes  pointing to the next reference event.The pointer  advances by a slot within the list after each reference.When there is no more event left in the hash element, the pointer  is set to NULL, and the next_ts is set to infinity.
Whenever a tile is referenced, the buffer manager updates the next_ts value for the referenced tile, and the new next_ts value can be obtained in  (1) time by following the ptr in the event list of the hash element.The buffer manager then inserts the tile into the skip list, which takes time logarithmic to the size of the skip list.Overall, each tile reference requires  ( ) time for maintaining the skip list, where  is the number of tiles in the skip list.Although the overhead may seem high, it is negligible considering the dominant I/O cost required for out-of-core matrix computations and  being typically quite small.In our experiment,  did not exceed 100 while an LR task was carried out.(See Section 6 for details.) Example 3. Fig. 4 shows how the skip list and the future log are updated when a buffer hit occurs by a tile reference.The tile reference can be considered a replay of a query simulation preserved in the future log.Thus, for example, when the logical clock of the buffer pool strikes 760, a tile with next_ts value 760 is referenced.Assuming that the tile is cached in the buffer pool, the buffer manager temporarily removes the tile from the skip list and searches for the next reference event from the future log by advancing the ptr in the tile's hash element.Since the next event (or its timestamp) is 3053, the ptr advances from the current event 760 to the next event 3053.Then, the buffer manager updates the next_ts value of the tile to 3053 and inserts the tile back into the skip list.In the case of a buffer miss where the tile with next_ts value 760 does not exist in the skip list, the procedure of removing the tile from the skip list is replaced with the procedure of loading the tile from the disk.□ Optimality.The OPT buffer replacement algorithm is known to be optimal for a fixed number of fixed-length buffer frames [6].However, the optimality of the algorithm is not guaranteed when dealing with variable-length buffer frames.Since the tiles of PreVision are not of the same length, adopting the OPT algorithm does not always guarantee the optimal buffer replacement.Nonetheless, the OPT algorithm still yields higher buffer hit ratios than LRU-K [29] and Most Recently Used (MRU) algorithms do.The impact of the OPT algorithm will be discussed more in Section 6.4.
Conditional Control Flow.One of the challenges arising in query processing is the potential data dependency.For instance, the termination condition of an iterative algorithm needs to be evaluated every iteration, and the subsequent operations will remain undecided until the evaluation of the termination condition is complete.PreVision attempts to break the data dependency by separating a query into two subqueries, one for the computation of prerequisite values and the other for the rest of the computation.While this strategy may lead to an increased query planning overhead, the increased overhead has only a negligible effect on the query performance.We will discuss the performance impact of the query planning in Section 6.4.
Applicability.The proposed method for tile replacement is based on the future log and tailored for processing a single query at a time.Following the sequential execution model described in Section 2, the tile access pattern is predicted at the planning stage of an individual query.If multiple queries are processed concurrently, the optimality of tile replacement is not guaranteed because the chronological order of the reference events of a certain tile is not deterministic.Consequently, a list of reference events saved in the future log may not be the same as the sequence of actual references, and the optimal selection protocol can be violated.Note that, however, out-of-core matrix computations can still take advantage of the optimal tile replacement of PreVision, because they are often adopted in a long-running data analytic application administered as a sole task.

PREEMPTIVE EVICTION
In a DAG-structured query plan, intermediate data produced by an operator may be consumed by one or more operators.Unless all the consuming operators are perfectly synchronized with a producing operator, some or all intermediate data from the producing operator will have to be materialized and preserved on disk until they are no longer needed.Once the intermediate data are materialized, they can be accessed by any operator.They can be cached in the buffer pool in the same way as the disk-resident matrices are done in the database.In the presence of multiple consuming operators accessing the same intermediate data, it is difficult for the buffer manager to predict when the data will no longer be accessed and need not be cached in the buffer pool.Consequently, the intermediate data tend to stay in the buffer pool longer than they should and lower the buffer hit ratio.
To address this problem of obsolete intermediate data lingering in the buffer pool, PreVision adopts a preemptive eviction strategy.As tiles are the unit of paging, the buffer manager evicts tiles eagerly as soon as it determines they are of no more use.The buffer manager can identify obsolete tiles in the buffer pool by analyzing the reference events stored in the future log.When a tile is referenced by the last event in the future log, the next_ts of the tile is set to infinity.When such a tile is unpinned, it can be immediately evicted from the buffer pool (without being flushed to disk).Example 4. Consider the computation of  1,0 =  1,0   0,0 +  1,1   1,0 for the matrix computation task described in Example 1.Since the   0,0 and   1,0 tiles are obtained while  0,0 and  0,1 are computed, the executor does not need to access the  matrix to compute  1,0 .Thus, the computation of  1,0 only requires four tiles, namely,  1,0 ,   0,0 ,  1,1 , and   1,0 .The left and right sides of Fig. 5, respectively, show the future log records of the  1,0 ,   0,0 , and  1,0 tiles before and after these tiles are referenced.When events: [1,3,21] ptr: NULL Fig. 5. Changes of future log entries after buffer requests  1,0 is referenced at the logical clock 20, the ptr field of its future log record advances to the next slot in the list of reference events so that ptr points to 26 instead of 20.The next_ts value of the tile is also updated from 20 to 26.When   0,0 is referenced at the logical clock 21, the ptr field of its future log record is set to NULL because there is no more future reference left in the list.Its next_ts value is set to infinity to indicate that   0,0 will no longer be referenced.Similarly to  1,0 , when  1,0 is referenced at the logical clock 22, its ptr field advances from 22 to 25, and the next_ts value is set to 25.
The  1,0 and   0,0 tiles are unpinned when the matrix multiplication operation is completed.Since the next_ts value of   0,0 is infinity, the buffer manager immediately evicts the tile without flushing the content to disk.As a result,   0,0 causes a disk write only if it is evicted (or materialized) after the transposition of  0,0 .□

IMPLEMENTATION DETAILS
This section presents the rationale behind the design choices and the implementation details of PreVision.We begin with memory management in Section 5.1 and then discuss I/O management in Section 5.2.

Dynamic Memory
Allocator.The buffer manager of PreVision deals with tiles of varying shapes and sizes, each of which is loaded into a contiguous memory chunk for higher memory bandwidth.The buffer manager relies on its dedicated dynamic memory allocator to manage the system memory effectively for the variable-length memory chunks.The allocator is capable of allocating memory chunks of different sizes and supports a few primitives similar to the standard malloc() and free() functions for the buffer manager.The memory allocator sets aside all the memory available to itself at the initialization.On a request from the buffer manager, the memory allocator attempts to find a contiguous memory chunk from its memory pool by applying the best-fit allocation algorithm.The  ( ) time complexity of the algorithm is relatively high, but it helps minimize fragmentation by finding the most suitable chunk in the memory pool.Note that the operating system can be entrusted with the task of handling memory fragmentation as well as memory allocation.However, that approach would incur a large number of system call invocations and hence degradation in performance.Thus, we opt for a dedicated memory allocator for the PreVision buffer manager.
The memory allocator is tightly integrated with the buffer manager.A tile eviction from the buffer pool is triggered by the memory allocator when a chunk request from the buffer manager

Index Skip List
Future Log

Dynamic Memory Allocator
P o in te r to p h y si c a l d a ta

A tile key
Requested tile Fig. 6.Processing a tile request on a buffer miss cannot be fulfilled because there is not enough free space in the memory pool.This eviction process may have to be repeated by the buffer manager until the allocator secures a contiguous memory chunk large enough to accommodate the request.
5.1.2Unified Buffer Pool.Most database systems manage buffer frames in multiple pools.For example, Oracle keeps cached pages in a few separate buffer pools by the sizes of pages, and SciDB maintains a buffer pool for persistent arrays separately from another buffer pool for temporary arrays [39].In contrast, PreVision uses a single unified buffer pool to manage tiles of all sizes for higher utilization of memory.It is particularly relevant in matrix computation tasks, where the volume of intermediate arrays cached in the buffer pool tends to fluctuate rapidly over time.Consequently, a separate buffer pool for the intermediate arrays would be underutilized, and the overall utilization of memory would also be degraded.

Tile Identification.
The buffer manager uses tile keys as a means of identification of tiles.A tile key consists of the name of an array and the coordinates of a tile, and it is included in each tile as metadata.When the buffer manager receives a request for a tile, it can retrieve a future log record of the tile using its tile key.Note that tile keys are created not only for persistent arrays but also for temporary (or intermediate) arrays.The names of temporary arrays are coined internally and used in their tile keys.

Handling a Tile
Request.When it receives a tile request, the buffer manager searches the buffer pool index for the tile key to determine whether the requested tile is cached in the buffer pool.If the tile exists in the buffer pool, the buffer manager applies the update procedure to the skip list as described in Section 3. Otherwise, the buffer manager determines whether the request is for a new tile or a persistent copy of the requested tile on disk.In both cases, the buffer manager passes the request to the memory allocator so that an empty memory chunk is allocated for the tile.Additionally, in the latter case, the buffer manager works with the I/O manager to load the disk-resident copy into the empty memory chunk.When the copy of the tile is loaded, the buffer manager adds the memory chunk with loaded data into the buffer pool, updates the buffer pool index, and applies the update procedure to the skip list.Fig. 6 illustrates processing a tile request on a buffer miss.Depending on its sparsity, a tile, either memory-resident or disk-resident, is represented by a single vector or a group of three vectors.A dense tile includes a single vector that stores the cell values sequentially in row-major order.A sparse tile includes three vectors, namely, a data vector, an index vector, and a pointer vector, following the compressed sparse row (CSR) format [37].The data vector stores non-zero cell values, and the index vector stores column indices of these cell values.The pointer vector stores the offsets of rows to the data vector and the index vector.

Array File Format.
A disk-resident array is composed of a schema file, a data file, and an index file.A schema file stores metadata of an array such as the dimension and size of the array itself and the individual tiles in the array.A data file is segmented such that each tile is stored separately and contiguously in a segment.By storing tiles contiguously in segments, PreVision can take advantage of sequential disk I/O and achieve higher disk throughput.An index file contains a hash table that stores an element for each tile, and the hash element of a tile contains the offset and length of the corresponding segment.The index file is responsible for retrieving the metadata of a given tile quickly.
The data file is prone to having a bloated size because it may suffer from fragmentation caused by variable-length segments.PreVision performs a file-collapsing operation to return fragmented disk spaces to the operating system when the fragmentation ratio of a file exceeds a certain threshold.The operation keeps the physical size of an array to an appropriate level without copying the tile data.

I/O Operations.
The I/O manager supports simple read, write, and file-collapsing operations that are performed as follows.Fig. 7 shows the overview of the I/O operations as well as the array file format of PreVision.
Read.A read operation is initiated by checking the index file to determine whether the requested tile is in the array.If the tile exists in the array, the I/O manager obtains the offset and length of the tile from the metadata stored in the index file and loads the tile data from the segment to the buffer pool.
Write.When a new tile is added to an array, the I/O manager simply appends a new segment created for the new tile to the end of the data file.A new hash element is also created for the new tile and added to the index file.When an existing tile is updated, the I/O manager compares the size of the existing tile and the size of the updated one.If the updated tile is larger than the existing one, then the segment of the existing tile is and a new segment for the updated tile is appended to the data file.If the updated tile is no larger than the existing one, then the updated tile is overwritten to the existing segment.The index is updated accordingly to reflect the changes.When an existing tile is updated, file fragmentation may occur due to the invalidated segment in the former case and the unused portion of the segment in the latter case.Thus, the file-collapsing operations need to be performed occasionally.Note that the addresses of buffer frames are aligned to 512-byte offsets for all the tiles so that every disk I/O operation can bypass the operating system page cache via Direct IO.
File Collapsing.When the degree of file fragmentation exceeds a certain threshold, the I/O manager initiates a file-collapsing operation.The I/O manager examines the index file to locate the region of fragmentation within each array file.If fragmentation is detected, the fragmented region is returned to the operating system by invoking a tool for allocating space for a file (e.g., fallocate() of Linux with the FALLOC_FL_COLLAPSE_RANGE flag up).A file-collapsing operation may change the offsets of segments, so the I/O manager updates the index entries of the affected segments if necessary.

EVALUATION
To demonstrate the effectiveness of PreVision, we conducted extensive experiments that were focused on answering the following questions.
• How well does PreVision perform matrix computation workloads in comparison with other solutions?(Section 6.2) • How effective is the preemptive eviction in reducing disk I/O? (Section 6.3) • How well does the OPT buffer replacement algorithm work?What is the overhead associated with that?(Section 6.4) We first outline the workloads and the computing platform used for the evaluation in Section 6.1.Then, we present the evaluation results and address the questions in Section 6.2 through Section 6.4.

Experimental Settings
To assess the target systems for matrix computation, three representative workloads were chosen: logistic regression (LR) with batch gradient descent [41], non-negative matrix factorization (NMF) [23], and PageRank [30]. 1 We used the SLAB linear algebra benchmark [41] for the LR and NMF tasks.We also generated several synthetic datasets for both in-memory and out-of-core computation scenarios to better understand the effect of disk I/O on the overall performance of the target systems.Specifically, all the synthetic matrices were tall-and-skinny with 100 columns ) × 100.For sparse matrices, the number of rows was fixed to 400 million with varying densities.The population of non-zero cells in the sparse matrices follows a uniform distribution.Table 1a and Table 1b summarize the specifics of the dense and sparse synthetic matrices, respectively.For the PageRank task, we selected four real-world graph datasets: Enron, Epinions, and Live-Journal from Stanford Network Analysis Project [24], and Twitter [22].Each dataset is transformed into a sparse adjacency matrix.Table 1c shows the specifics of the sparse matrices.The first three real-world datasets are relatively small, and the target systems can perform the PageRank task with the entire dataset loaded in memory.On the other hand, the Twitter dataset is too large to fit in memory, and the memory consumed by the task exceeds the memory budget slightly, which results in disk spills during computation.Note that the target systems may adopt different metadata or formats for sparse matrices (e.g., Compressed Sparse Column format for MLlib and Coordinate List format for MADlib), so they may differ from one another in the actual memory consumption.
All experiments were carried out with native binary input data for each target system.For the systems using tiled arrays, the SLAB and PageRank matrices were partitioned into 100 × 1 tiles and 10 × 10 tiles, respectively.In the case of SciDB, each side of a chunk for each array was configured to 1000 since SciDB did not allow for a size greater than 1024 in the matrix multiplication.When executing the LR and NMF tasks, the same synthetic matrices were fed to the target systems so that interference from using input matrices with different random values could be eliminated.
We used a machine running Ubuntu 18.04.5 equipped with an Intel i7-9700K CPU, 32GB DRAM memory and a 1TB Samsung 860 Pro SSD.We conducted every run for evaluation following the steps below.
(1) Clean the OS page cache and the buffer pools of each target system to ensure a cold start.
(2) Run a task with disk-resident arrays on each target system; input arrays are loaded to memory and part or all of them may be spilled to disk.
(3) Write the output to disk.
Each run was allowed for ten hours and was terminated forcefully at timeout.

Comparison with Existing Systems
We selected the following six target systems for comparsion with PreVision: SystemDS [9], MLlib [28], MADlib [17], SciDB [39], NumPy [16], and Dask [36].We did not include SciPy [44], Tensorflow [1] and PyTorch [31] because they did not support out-of-core computation.The versions of target systems for evaluation are provided in Table 2. Every system capable of leveraging OpenBLAS routines was configured to make use of those routines.Except for the parallelism evaluation described in Section 6.2.3, each target system was configured to be single-threaded with a single worker for a fair evaluation.The memory budget for each system was configured to 30GB, which was sufficient to keep many tiles simultaneously.For SystemDS and MLlib, Spark was configured to run with a driver process and a single executor process.We did not isolate JVM processes to a specific core in order to prevent performance degradation caused by garbage collection or just-in-time compilation.We tested a few memory configurations to discover a specific memory budget for each driver and executor that delivers the best performance.PostgreSQL, which MADlib runs on, recommends configuring the shared buffer size to 40% of the total memory [14].So, we allocated 12GB to the shared buffer and reserved 18GB for MADlib and the OS page cache.We also created indexes for all the tables used by MADlib.
NumPy was set up to utilize memory-mapped arrays to enable out-of-core computation.Since a NumPy operation returns an in-memory array by default, we created a memory-mapped array manually (with the out keyword argument) for each intermediate array and returned it as output from each operation.

Varying Data Size.
The scalability of the target systems was evaluated with both synthetic and real-world datasets.All three matrix computation tasks are based on iterative algorithms, and the number of iterations was fixed to three in this experiment.This was a very small number of iterations but was still sufficient to observe meaningful differences among the target systems.Sparse NMF tasks were omitted in this experiment because the intermediate matrices of NMF tasks were dense and the dense matrix computation became a dominant factor in the processing time.
Dense LR and NMF.Dense LR and NMF experiments are presented in Fig. 8a and Fig. 8b, respectively.PreVision outperformed the other systems in most cases, except for LR results with the 10M and 20M datasets.In such cases, PreVision showed slightly slower performance than NumPy.This was due to the small-sized intermediate arrays generated in each iteration of the LR task, which prevented any disk spills.Consequently, PreVision did not have a significant advantage over NumPy in this scenario.Meanwhile, PreVision incurred some overheads such as future log generation and skip list update, contributing to the slower execution times.
SystemDS presented comparable performances in both the LR and NMF with the 10M and 20M datasets.Although PreVision had better performance, both SystemDS and PreVision exhibited similar tendencies regarding the matrix size.However, the SystemDS's elapsed time spiked in both LR and NMF tasks with the 40M and 80M datasets.This was due to SystemDS starting to use Spark instruction when the dataset size exceeded the memory budget.The system inserted a checkpoint instruction that makes an RDD for the input matrix persist, which consumed a significant amount of time.
MLlib exhibited poor performance since it caused significant disk I/O.Whenever a Spark stage was completed, MLlib spilled processed data into the disk and loaded spilled data for proceeding with another stage.Dask had slower performance compared to NumPy in LR experiments, while it slightly outperformed NumPy in NMF experiments.The main reason for this was the Dask scheduling.Dask employed its own scheduling method which tends to use last-used chunks, giving Dask a better buffer hit ratio.Meanwhile, the NMF tasks caused a huge volume of intermediate data.
The better hit ratio in these experiments saved a huge volume of disk I/O for intermediate data.
On the other hand, Dask showed worse performances in LR experiments because of its scheduling overhead as well as much smaller volumes of intermediate data.
A notable observation was that both SciDB and MADlib failed with the 40M and 80M datasets.The SciDB's gemm() operator relied on ScaLAPACK [7], assuming that all data were loaded on distributed memory.When the gemm() started, SciDB attempted to convert operand matrices to ScaLAPACK data format.In the failed cases, the sizes of such data were larger than the memory budget, causing an out-of-memory error.In the case of MADlib, the PostgreSQL executor crashed when attempting matrix multiplication.MADlib operations consumed lots of memory, so the operating system killed the executor.We observed the same result even when we reduced the buffer size of PostgreSQL.
Sparse LR.As shown in Fig. 8c, PreVision outperformed the other systems in LR experiments with sparse datasets.MADlib was time-outed with the 0.025, 0.05 and 0.1 matrices since running times exceeded 10 hours.SciDB was killed in both the 0.05 and 0.1 experiments by the operating system due to its memory overuse.The sparse matrix multiplication operator of SciDB (i.e., spgemm()) attempted to load the whole row chunks of a left-handed side array and the whole column chunks of a right-handed side array for computing a single result chunk.When SciDB computes the multiplication result of a wide-short matrix and tall-skinny matrix in LR tasks, the operator used lots of memory, activating the out of memory killer.PageRank.Fig. 8d presents elapsed times for running PageRank in each system.PreVision outperformed the others in every case.SystemDS running on the Twitter dataset encountered an out-of-memory error due to its overuse of memory.MADlib showed comparable performances to MLlib in the Enron and Epinions experiments, while it had slower performances with the LiveJournal and Twitter datasets.

Varying Number of Iterations.
To observe the performance trends on a long-term basis, we ran the tasks on the target systems with a varying number of iterations from one to 32.The elapsed times taken by each system for the NMF (10M) and PageRank (Twitter) tasks are shown in Fig. 9.As is shown in the figure, the elapsed times of all systems increased linearly to the number of iterations.The differences among them were in the rate (or slope) of changes as well as the absolute amount of elapsed times.PreVision was the best performer with respect to both the rate and absolute amount.SystemDS experienced the same out-of-memory error as before in the PageRank experiment.MADlib was timed out and terminated in the PageRank experiment with nine or more iterations.Similar trends were observed in experiments with other datasets.

6.2.3
Varying Degree of Parallelism.We evaluated the target systems with different degrees of parallelism.The NMF 10M and LR 0.0125 tasks were chosen for this experiment because they were more CPU-intensive than the other tasks.The parallel execution of the target systems was configured as follows.We ran SciDB and MADlib with a varying number of workers.For NumPy and PreVision, we changed the number of threads for a single operation because these systems ran in the operator-at-a-time mode.For SystemDS and MLlib, we changed the number of threads for a single worker because this setup yielded the best performance.Fig. 10 shows the elapsed times of the target systems.Although PreVision did not achieve enough speedup after the degree of parallelism reached four, it still outperformed the other target systems consistently in all test cases.SciDB experienced an out-of-memory error while processing some of the LR 0.0125 tasks.

Effects of Preemptive Eviction
We conducted another experiment to demonstrate the effectiveness of preemptive eviction.PreVision with the getPos function executes a matrix computation task on a per-tile basis.For the matrix computation task  =   , given in Example 1, a matrix multiplication for an output tile can be interleaved with a matrix transposition for another output tile.To accurately evaluate the effect of preemptive eviction on the I/O performance, we ran PreVision with and without preemptive eviction in the interleaved execution mode and non-interleaved (or blocking) execution mode.In the blocking execution mode, an operator waits until all the child operators complete their execution and produce the entire sets of output tiles.Note that NumPy works this way in the blocking mode.Fig. 11 presents I/O volumes for each version when running LR and NMF tasks with the 80M dense matrix.The getPos with preemptive eviction showed the minimum disk I/O volumes compared to the other versions.We observed two interesting points.First, the versions with preemptive eviction showed significantly lower disk writes compared to the versions without preemptive eviction.Especially, in cases except for the blocking with preemptive eviction in NMF, disk writes occurred only for the output arrays.Second, the getPos versions presented tremendously lower disk read volumes than the blocking versions.This was mainly due to the better temporal locality of the getPos execution mode.
Fig. 12 illustrates the tile access patterns of the input matrix  when running LR with the getPos and blocking modes.12d show zoomed-in views within the black rectangles on the left.The blocking mode accessed tiles in sequential order, whereas the getPos mode repeatedly accessed the same tile coordinates before accessing the other coordinates.Since the experiment with the 80M input matrix  was out-of-core computation, this temporal locality provided the buffer manager with more opportunities for the buffer hit, resulting in reduced read I/O volume for the getPos mode.Fig. 13 illustrates the situation in which an input tile was accessed repeatedly.To compute  in the LR equation, the  iterator requested the  0,0 tile.Next, the  iterator requested the right-hand operand.Since this tile had not been computed yet, its iterator requested its operand tile.These recursive requests continued until the  iterator requested the  0,0 tile.That is, a buffer request for  0,0 was made to compute  0,0 , and it was repeatedly made to compute  after the recursive requests were completed.The time between these tile requests was short, increasing the likelihood that the  0,0 tile would remain in the buffer pool.

Effects of Buffer Replacement
To evaluate the effectiveness of the OPT replacement algorithm and the overheads associated with the skip list and future log, we compared OPT with two additional replacement algorithms, LRU- [29] and MRU.For LRU-,  was set to two for all tasks (i.e., LRU-2).Since the number of buffer frames in the pool was relatively small compared with the traditional relational systems adopting small fixed-length pages (e.g., 8KB page size), the retained information period of LRU-2 was ignored.Preemptive eviction was enabled for each replacement algorithm.Fig. 14 shows the elapsed times of executing the LR and NMF tasks on the 80M dense matrix with each replacement algorithm.The I/O, List Maintenance, Query Planning, and CPU mean the sum of read and write times, the time for skip list update and future log retrieval, the query planning time including future log creation, and the computation time, respectively.
PreVision operating under the OPT replacement algorithm took the shortest elapsed times in both the LR and NMF computations.Notably, the MRU algorithm showed almost the same performance as OPT in the LR experiment.In this case, the input matrix  required the most time to read.As the input matrix was sequentially accessed as shown in Fig. 12a, sequential flooding of the matrix occurred.In this situation, the best buffer replacement algorithm for the matrix is MRU [34], leading to its near-comparable performance to OPT.In the NMF experiment, the elapsed time of LRU-2 followed OPT's elapsed time, whereas MRU showed poor performance.The time for list maintenance and query planning was negligible in every case.In the LR task under OPT, for instance, the overhead time was accounted for 13 milliseconds and 14 milliseconds, respectively.Table 3 presents buffer hit ratios for each task under the different replacement algorithms.In the LR experiment, OPT achieved a lower hit ratio than MRU.This discrepancy was because PreVision used variable-length buffer size to hold a tile, meaning that both buffer hits for a small tile and a large tile were counted as one hit equally.In the NMF experiment, OPT demonstrated the best hit ratio among the replacement algorithms.We also investigated the impact of the number of tiles and its associated overheads.The overheads incurred by the future log and skip list are influenced by the number of tiles.Fig. 15 shows the relationship between smaller tile sizes and the resulting elapsed times.The elapsed times of the second-best performers mentioned in Section 6.2 are also displayed with black lines.Numbers in the parentheses of the x-axis indicate the size of a single tile in the input matrix.The results indicate that increasing the number of tiles led to longer elapsed times.However, PreVision consistently outperformed the second-best performers.
Our experiment presents that the overheads of the future log and skip list had a trivial impact on total elapsed times, regardless of the number of tiles.We argue that these overheads are negligible since the default tile size of many data processing systems is typically larger than 20MB.Notably, Spark defaults to a block size of 128 megabytes [13] and SciDB emphasizes that the block size should be several megabytes at least [39].

RELATED WORK
Buffer Replacement Algorithm.The MIN buffer replacement algorithm (or Belady's algorithm) [5] and the OPT buffer replacement algorithm [26] are theoretically optimal but they are known to be infeasible to implement because they require knowledge of the future.Jain and Lin show how a cache replacement algorithm can learn from Belady's algorithm by applying it to past references to inform future replacement decisions [20].Simulating the OPT buffer replacement algorithm has been studied to determine the performance measures and to characterize the buffer misses.[40].LRU, which selects the least recently used buffer frame as a victim for replacement, has been the de facto standard buffer replacement algorithm for many database systems.In contrast, MRU selects the most recently used buffer frame as a victim for replacement.MRU may not be effective for random references but it can perform better than LRU for certain database workloads by preventing sequential flooding [34].The LRU-K algorithm selects a buffer frame having the greatest backward- distance as a victim for replacement [29].The distance metric allows the algorithm to consider the frequency of references rather than the recency of references.Such buffer replacement algorithms as 2Q [21], ARC [27], and CAR [3] have also been proposed to overcome the disadvantages of LRU.
While the studies mentioned above assume that all buffer frames in the buffer pool have the same size, another branch of research has considered variable-length buffer frames.The OPT algorithm achieves optimal buffer replacement for fixed-length buffer frames by adopting a simple eviction strategy that selects a buffer frame with the greatest forward distance.For variable-length buffer frames, however, optimal caching is proven to be NP-hard [12].Many heuristics have been suggested to improve buffer replacement for variable-length buffer frames.One of the popular heuristics is Belady-Size, which evicts a buffer frame with the greatest distance weighted by its size [6].The GreedyDual-Size-Frequency [2], Hyperbolic [8], and LHD [4] replacement algorithms resort to multiple metrics, variable decaying priorities, and hit density, respectively, for victim selection.
Systems for Matrix Computation.Spark [46] is one of the most widely used data processing frameworks.A Spark query plan is split into stages, each of which is executed by Spark executors.Spark spills every output produced from a stage to disk and reads spilled data back when it starts processing another stage.MLlib [28] is a machine learning library built on top of Spark, and operations in MLlib are composed of Spark operations.SystemDS [9] is another machine-learning framework that leverages various back-ends.Some tasks are delegated to Spark when the standalone mode of SystemDS cannot deal with the data volume.The buffer pool of SystemDS is enhanced to link between its host program and Spark.This feature poses the challenge of detecting a buffered object that is no longer needed due to the lazy evaluation of Spark.To work around it, SystemDS recursively checks descendant RDDs or broadcasts in its lineage on an object cleanup.It determines whether the object is still referenced and performs a cleanup of the unreferenced object [10].
SciDB [39] is a database system for large-scale arrays.It is known for using tile pipelining for its query execution, which means a tile computed by an operator can subsequently be used by another operator.MADlib [17] is a machine learning system running on a relational database system.A tuple in the system consists of a row vector or a cell value with coordinates depending on the type of a matrix.
NumPy [43] is a numerical computation package focusing on ease of use.It can access diskresident arrays via memory-mapped I/O.Likewise, MATLAB [19] supports memory-mapped files to access arrays on disk.Dask [36] supports parallel matrix computations by splitting an array into tiles of a NumPy matrix.SciPy [44] offers scientific computing functionalities such as sparse matrix computations, complex linear algebra functions, and signal processing.TensorFlow [1] and PyTorch [31] are among the most popular machine-learning libraries in diverse fields.TensorFlow aims at distributing large-scale machine learning tasks, while PyTorch enables a high level of flexibility in modeling.Unfortunately, these libraries do not support out-of-core computation for large matrices.Marques et al. exploit cache memory to enable out-of-core computations, but they focus on exploiting parallelism rather than minimizing I/O [25].
Tiling.Tiling is a common method for partitioning an array into smaller subarrays such that data locality is preserved [38].It has been adopted by numerous systems such as SciDB [39], SystemDS [9], MLlib [28], and Dask [36].In addition, many studies have been conducted focusing on reducing disk I/O for processing tiled arrays.For instance, RIOT [47] examines the I/O patterns present in a given program, identifying transformations that minimize I/O operations.
Reducing the Memory Pressure.SuperNeurons introduces liveness analysis to evict obsolete variables from memory [45].It analyzes the input and output variables of each layer in deep learning tasks to identify objects that will not be used in the next layer, and evicts those obsolete objects after processing the current layer.The liveness analysis is similar to analyzing the future log of PreVision.The eviction unit of PreVision is a tile, whereas that of SuperNeurons is a tensor object.Operator fusion [11,15] is a method for enhancing the performance by merging one operator into another, enabling their joint execution.This fusion technique reduces intermediate data by avoiding unnecessary materialization of data, and it improves query performance by eliminating redundant scans and computations through sparsity exploitation.
GPU Memory Management.GPU memory management in deep learning tasks is a highlighted subject in the GPU research community as an increasing number of model states are required by the tasks.One strategy is GPU-CPU data swapping, which involves offloading GPU variables to CPU memory when the deep learning training states overrun the memory capacity of the GPU.There are many studies focused primarily on overlapping communication costs between the GPU and other components [33,35,45].
It is worth mentioning that SwapAdvisor [18] takes a similar approach as PreVision.Both of them select a victim for replacement based on future references.However, a key difference lies in the timing of the victim selection.While SwapAdvisor selects a victim during an optimization phase performed before runtime, PreVision makes this decision dynamically during runtime.PreVision possesses the capability of predicting access patterns based on the future log of a given plan, and focuses on the management of variable-length buffer frames.This is due to the inherent unpredictability of the buffer pool content, and variable-length tiles and memory fragmentation making it more challenging to plan evictions.In contrast, SwapAdvisor uses fixed-sized pages, ensuring predictability in the buffer pool content and allowing for the generation of eviction plans in advance.

CONCLUSION
We present an out-of-core matrix computation system called PreVision with optimal buffer replacement.A matrix larger than the available memory budget is inevitably split into smaller chunks or tiles so that they can be separately loaded into memory for further computations or query processing.The chunk-by-chunk matrix computation will incur potentially a large number of I/O operations, which would become the dominant cost of any task involving an out-of-core matrix computation.To minimize the I/O overhead, PreVision predicts the entire pattern of tile accesses for a given task by exploiting the deterministic nature of matrix computation algorithms.By referring to the predicted access patterns collected in the future log, the buffer manager can handle buffer replacement optimally.It also adopts the preemptive eviction strategy to remove obsolete tiles from the buffer pool eagerly without causing unnecessary flushing to disk.
We demonstrate that PreVision outperforms the other competing systems in all the out-of-core tasks tested in our experimental evaluations.The significant performance gain of PreVision is attributed to the following factors.First, PreVision enables the optimal buffer replacement and thereby reduces the overall I/O overhead.Second, the preemptive eviction strategy of PreVision avoids unnecessary disk write operations.Third, PreVision reduces I/O operations further by taking advantage of access locality.By leveraging the approaches introduced by PreVision, we believe that the disk I/O and the memory pressure challenges arising from many large-scale matrix computations can be addressed more effectively.
The buffer replacement of PreVision is considered optimal only on the assumption that all the tiles are given the same weight.Thus, if a larger tile is given a heavier weight than a smaller one, then the buffer replacement of PreVision will no longer be considered optimal.For the wider application of PreVision, we need to address such problems as synchronizing the timestamps of tiles across multiple concurrent tasks, processing operators in parallel, and planning multiple queries globally.We leave these challenges as future work.

Fig. 7 .
Fig. 7.An overview of the I/O management

Fig. 8 .
Fig. 8. Performance of the target systems

Fig. 11 .
Fig. 11.The I/O volumes with execution modes and preemptive eviction (PE)

Fig. 12 .
Fig.11presents I/O volumes for each version when running LR and NMF tasks with the 80M dense matrix.The getPos with preemptive eviction showed the minimum disk I/O volumes compared to the other versions.We observed two interesting points.First, the versions with preemptive eviction showed significantly lower disk writes compared to the versions without preemptive eviction.Especially, in cases except for the blocking with preemptive eviction in NMF, disk writes occurred only for the output arrays.Second, the getPos versions presented tremendously lower disk read volumes than the blocking versions.This was mainly due to the better temporal locality of the getPos execution mode.Fig.12illustrates the tile access patterns of the input matrix  when running LR with the getPos and blocking modes.Fig.12aand Fig.12crepresent the access patterns during the entire Fig. 13.Temporal locality in the LR task

Table 3 .
Buffer hit ratios with different replacement algorithms Mattson et al. introduce a stack algorithm that simulates the OPT buffer replacement by two-pass buffer trace scans [26].Sugumar et al. propose a more efficient MIN simulation scheme employing a limited look-ahead scan and tree-based stack maintenance