skip to main content
research-article
Open Access

TriCache: A User-Transparent Block Cache Enabling High-Performance Out-of-Core Processing with In-Memory Programs

Published:22 March 2023Publication History

Skip Abstract Section

Abstract

Out-of-core systems rely on high-performance cache sub-systems to reduce the number of I/O operations. Although the page cache in modern operating systems enables transparent access to memory and storage devices, it suffers from efficiency and scalability issues on cache misses, forcing out-of-core systems to design and implement their own cache components, which is a non-trivial task.

This study proposes TriCache, a cache mechanism that enables in-memory programs to efficiently process out-of-core datasets without requiring any code rewrite. It provides a virtual memory interface on top of the conventional block interface to simultaneously achieve user transparency and sufficient out-of-core performance. A multi-level block cache design is proposed to address the challenge of per-access address translations required by a memory interface. It can exploit spatial and temporal localities in memory or storage accesses to render storage-to-memory address translation and page-level concurrency control adequately efficient for the virtual memory interface.

Our evaluation shows that in-memory systems operating on top of TriCache can outperform Linux OS page cache by more than one order of magnitude, and can deliver performance comparable to or even better than that of corresponding counterparts designed specifically for out-of-core scenarios.

Skip 1INTRODUCTION Section

1 INTRODUCTION

NVMe [45] Solid-State Drives (SSDs) have drawn a wide range of interest because of their high I/O performance. The U.2 interface [48] and PCIe 4.0 standard [47] have also increased the storage density of NVMe SSDs in recent years. For instance, a dual-socket commodity server can mount an array of more than 16 NVMe SSDs to provide tens of terabytes of storage capacity, tens of millions of random IOPS, and dozens of gigabytes per second of bandwidth while being 20 to 40 times cheaper than Dynamic Random Access Memory (DRAM).

Although NVMe SSD arrays can improve the aggregated performance and capacity of the system, they still suffer from block-wise I/O accesses and have latencies at least 100\(\times\) longer than those of DRAM. To efficiently process datasets that are significantly larger than available memory, out-of-core systems rely on cache sub-systems to maintain frequently operated data in memory. I/O operations can be merged or skipped on cache hits, bridging the performance gap between DRAM and SSDs.

Page cache [46] is a cache sub-system in modern Operating Systems (OS) that manages data on the granularity of pages (typically 4KB) across DRAM and SSDs. It enables in-memory applications to support out-of-core processing on SSDs without requiring any rewrite through swapping [44] or memory mapping [43] based on virtual memory.

However, current implementations of page cache encounter issues related to scalability and performance on cache misses owing to global locking on internal data structures [27]. Recent literature [25, 32, 55] indicates that the heavy I/O stack, page faults, and context switching overheads also limit kernel swapping and I/O performance on fast storage devices such as NVMe SSD arrays.

Therefore, data-intensive applications such as databases and data processing systems [4, 10, 13, 18, 19, 20, 52, 54] usually design and implement their own user-space block caches (also known as buffer managers) that manage data by blocks (typically of a fixed size that is a multiple of the physical sector size). In contrast to OS page cache, block cache reduces context switching overhead by running mainly in the user space, and supports customization in terms of tuning block sizes and replacement policies to further improve performance.

Nevertheless, designing and implementing block caches and upper-level components imposes expensive development costs. Existing block caches in the user space usually ask users to explicitly acquire/release blocks [10, 18] or manipulate data through an asynchronous interface [54]. Developers often have to redesign and re-implement the entire system according to the API requirements of the block cache, which is non-trivial. To fill the gap between out-of-core performance and development costs, we investigate a new general cache mechanism that can transparently extend in-memory systems for efficient out-of-core processing on NVMe SSDs without requiring any manual modification.

Efficient kernel-bypass I/O stacks, such as SPDK [50], can achieve good out-of-core performance in the user space by avoiding expensive kernel I/O operations to take advantage of the high IOPS from the NVMe SSD array. It inspires us to explore a user-space solution that can eliminate the overhead due to page faults and context switching. A solution in the user space is cross platform, easy to deploy and customize, and avoids introducing potential security vulnerabilities caused by kernel modifications.

To ensure transparency for the user, a virtual memory interface is expected to fill the semantic gap between fine-grained memory accesses in existing in-memory programs and block-wise I/O operations on physical block devices as in the case of the virtual memory provided by OS. A virtual memory interface makes it possible for in-memory software to run on NVMe SSDs without requiring any modification if we can automatically redirect memory accesses to the block cache.

Several challenges need to be addressed to implement a user-space block cache with a virtual memory interface. First, the cache system requires good scalability to achieve high out-of-core performance so that it can fit in the hundreds of CPU cores and the tens of millions of SSD IOPS in use today. Second, it requires an efficient address translation mechanism that looks up in-memory addresses for cached blocks, to provide a virtual memory interface with fine-grained accesses. Such fine-grained accesses and per-access address translations pose a much more significant challenge than block lookups in current user-space block caches. Third, it requires a scheme to redirect the memory accesses of existing in-memory systems to the cache system without manual modification.

To address the preceding challenges, we propose the following contributions:

  • We build a scalable block cache based on a concurrency mechanism named Hybrid Lock-free Delegation that combines message passing based delegation with lock-free hash tables. It can utilize the NVMe SSD array with only a few server threads.

  • We design a two-level Software Address Translation Cache (SATC) to support lightning-fast address translation in the user space, replacing human effort for writing block-aware code with an automatic mechanism by exploiting locality at runtime. SATC can accelerate software address translation by some orders of magnitude.

  • We propose a pure software-based scheme to supervise memory accesses based on compile-time instrumentation and library hooking techniques. Existing in-memory applications can efficiently run on NVMe SSDs through the block cache without requiring code modification.

Based on these techniques, we design and implement a user-transparent block cache providing a virtual memory interface, named TriCache. Our results show that TriCache enables in-memory programs to efficiently process out-of-core datasets without requiring manual code rewrite, by using various domains of application. TriCache can outperform OS page cache by some orders of magnitude, and can often reach or even exceed the performance of specialized out-of-core systems.

Skip 2BACKGROUND AND MOTIVATION Section

2 BACKGROUND AND MOTIVATION

In this section, we briefly introduce the two types of general caches that can be used for out-of-core processing, OS page cache and user-space block cache, and use a motivating example to show the benefits as well as the challenges of a new approach that combines the advantages of both.

Page cache is a transparent cache for pages originating from storage devices [46]. Modern OS keep the page cache in unused portions of the main memory. Some accesses to storage devices can be handled by the page cache to improve performance. The page cache is implemented in kernels through virtual memory management and is mostly transparent to applications. Users can use a memory-mapping system call [43] to map a file to a segment in virtual memory, or rely on swapping [44] to swap out/in pages to/from disks on-demand, thus accessing storage just like memory.

The memory interface of the page cache provides maximal user transparency for developing out-of-core applications [7, 24]; however, its use can lead to severe performance bottlenecks, especially on cache misses when the backed storage is an array of high-performance NVMe SSDs. It results from various factors, including but not limited to its global locking in the kernel, the heavy I/O stack, page faults, and context switching overheads [25, 27, 32, 55]. Although some studies have attempted to modify the kernel to improve the performance of the page cache [25, 26, 27, 37, 41], it is challenging to apply the relevant modifications in the kernel space, which may introduce potential portability and security issues.

To this end, most out-of-core systems design and implement their own block caching components in the user space to mitigate and even eliminate the preceding issues. Like the OS page cache, a block cache manages a pool of pages in memory, and loads/evicts pages from/to disks upon user requests. The major difference is that the block cache runs mostly in user space and provides a block interface. Users first pin the blocks to be accessed in memory, then read/write data in corresponding blocks, and finally invoke unpin to mark the blocks that can be evicted or flushed to storage later, when needed according to the replacement policy [13, 18]. There are also some other forms of the block interface, such as asynchronous read/write routines with user-defined callbacks [54]. Block cache may be further customized for better performance according to the needs of the application. For example, it is unnecessary to support writing blocks back to the storage if cached contents are known to be read-only [10].

Although an efficient and scalable block cache can make full use of storage devices in terms of performance, its block interface requires a considerable amount of work to be put into use. Figure 1 illustrates this with a concrete example: calculating the length of a string. The left part presents the implementation by using a memory interface, and the right part shows an alternative version with a block interface. It is evident that the block version is far more complex than the memory version because system developers have to take care of more details, such as checking the block boundaries and making pin/unpin calls manually, whereas the memory version only needs to perform memory accesses. For example, FlashGraph [54], a semi-external memory graph processing framework designed for SSD arrays, contains more than 40K lines of code. In contrast, Ligra [35], an in-memory graph processing framework, takes less than 9K lines. If out-of-core systems with high performance could be programmed like in-memory code, developers could explore more data structures and algorithms suitable for out-of-core processing instead of spending time on implementing block-aware systems. Meanwhile, with longer and more complex code, there are more maintenance overheads and also more potential bugs.

Fig. 1.

Fig. 1. Out-of-core implementations of strlen with memory (left) and block (right) interfaces.

It thus motivates us to explore a user-space block cache providing a virtual memory interface, which can combine the advantages of high out-of-core performance and high user transparency from both types of caches. The user-space approach drops some functional capabilities of the OS page cache, such as sharing memory across processes with consistency guarantees. However, it allows us to redesign the cache sub-system toward new high-performance storage. Although the virtual memory interface forces applications to manipulate the cache synchronously and manage data in fixed-size blocks (rather than objects or rows), such an interface enables user transparency and saves developers considerable effort.

However, a user-space block cache with a virtual memory interface is not as easy as it might appear. Since every memory access now needs to involve a pair of pin and unpin calls to ensure that the data accessed reside in memory, as well as given that pin and unpin imply storage-to-memory address translation and concurrency control operations,1 we need optimizations in addition to those in current block cache designs to make pin/unpin as fast as possible.

Skip 3DESIGN AND IMPLEMENTATION OF TRICACHE Section

3 DESIGN AND IMPLEMENTATION OF TRICACHE

In this section, we first present an overview of the system design of TriCache, then describe its efficient multi-level block cache runtime in a bottom-up manner, including how to build a scalable block cache and reduce the cost of cache accesses in the user space to support transparent usage. Finally, we introduce how to automatically apply TriCache to in-memory applications via compiler techniques.

3.1 Overview of TriCache

Figure 2 shows the high-level architecture of TriCache. It consists of an LLVM compiler plugin and a runtime module.

Fig. 2.

Fig. 2. High-level architecture of TriCache.

TriCache LLVM Compiler Plugin first instruments each memory instruction, such as load and store, in the user application code, inserting a software address translation call (named get_raw_ptr) before the memory instructions. Upon execution, the instrumented binary calls the interface every time it tries accessing a storage address and retrieves a memory address pointing to data cached in memory. The translated address is then used as usual for the memory instruction.

TriCache Runtime is the core of TriCache (the dashed box in Figure 2). It is a multi-level block cache that supports fast address translation and provides a virtual memory interface. It implements get_raw_ptr to translate blocks to their corresponding cached memory addresses, manages the in-memory data cache for recently accessed blocks, handles I/O operations when the cache misses, and evicts blocks when the cache is full. Table 1 lists the multi-level caches of TriCache, including their APIs, managed data, cache mapping scheme, and access patterns for threads.

Table 1.
NameAPIData ManagingCache MappingAccessing
Direct SATCget_raw_ptrAddressesDirect-mappedThread-local / Fiber-local
Private SATCpin / unpinAddressesSet-AssociativeThread-local
Shared Cachepin / unpinBlock DataSet-AssociativeShared by multi-threads

Table 1. Multi-level Cache Design of TriCache

In the implementation of get_raw_ptr, TriCache Runtime introduces a two-level SATC on top of the conventional block cache (Shared Cache in Figure 2). The first level is a directly mapped Direct SATC, and the second level is a set-associative Private SATC (under the three User Threads in Figure 2). They serve purposes similar to those of the hardware TLB and accelerate address translations for hot blocks. We implement them as thread-local metadata caches for storage-to-memory address mappings. Direct SATC is responsible for efficient translation when operating the most recently used entries, whereas Private SATC aims to provide sufficient entry caching capacity and merge inter-thread operations. Meanwhile, the SATC employs a pin/unpin protocol (as mentioned in Section 2) to implement an inclusive two-level metadata cache. A pin operation performs address translation and prevents the block to be selected as a victim before a pairing unpin. TriCache deploys SATC to automatically exploit localities in running programs for address translation and reduce the cost of runtime API calls, rather than relying on manually programming against blocks to reduce the number of API calls and amortize the runtime overheads. Direct SATC and Private SATC only operate on metadata of the cache, for example, modifying the reference count of blocks and translating block IDs to memory addresses, leaving cache data untouched and thus avoiding additional memory copies.

Below SATC, TriCache Runtime manages data with Shared Cache (in the middle of Figure 2, gray background). Shared Cache is a full-featured block cache shared by multiple threads that maintains an in-memory cache pool for reading and writing the underlying storage. It manages a block table for all in-memory blocks and serves address translations when SATC misses. The block table exposes a pin/unpin interface to SATC as well, with the guarantee that recently used data pinned by SATC are not swapped out to external storage. To prevent scaling bottlenecks introduced by locking, the block space is partitioned, and each partition is owned by a single thread. Message passing based delegation is used to render critical operations (including block replacements and I/O accesses) single-threaded and lock-free. Moreover, Shared Cache can use kernel-bypass I/O stacks to eliminate context switching for I/O operations.

For Shared Cache, we propose a Hybrid Lock-free Delegation based concurrency control scheme. First, we distinguish between address translations and data accesses. Only address translations call pin/unpin remotely through message passing, whereas data accesses directly manipulate memory and rely on the CPU cache to ensure data consistency. The cached data are thus stored only in the Shared Cache and directly accessed by threads without any redundant memory copies. Second, we design and implement the per-partition block table as a concurrent lock-free hash table to further reduce inter-thread message passes. With this concurrent block table, only pinning operations that are missed in Shared Cache require a synchronous remote call.

In Figure 3, we present an example of a user program, a follow-up of Figure 1, instrumented by and then running with TriCache. The C program is first compiled to LLVM IR (Intermediate Representation) with Clang, with the memory read compiled to a load instruction. TriCache LLVM Compiler Plugin instruments the load instruction into two operations: one calls get_raw_ptr to retrieve the translated memory address, and the other loads the cached data. Upon execution, the get_raw_ptr call of TriCache Runtime results in an address translation operation sequence. If Direct SATC hits, the result is returned; otherwise, it pins the corresponding block in Private SATC. If the block is found in Private SATC, the result is returned; otherwise, it pins the corresponding block in Shared Cache. If Shared Cache is holding the block, the pin operation finds the memory address of the cached block in the concurrent block table. Otherwise, as invoked by a remote call, Shared Cache reads the block from storage and loads it into memory.

Fig. 3.

Fig. 3. An example of a user program running on TriCache.

3.2 Shared Cache

As the core module of TriCache, Shared Cache determines TriCache’s throughput, especially its I/O performance. Therefore, good scalability is the primary design goal of Shared Cache for the effective use of hundreds of CPU cores, tens of NVMe SSDs, and millions of IOPS.

Design Decisions. Figure 4(a) shows a straightforward design used by the current Linux kernel. It uses a global lock to protect the block table (or page table) and the cache. However, the single lock leads to heavy lock contention and is difficult to scale for high-performance storage devices [27]. The sharding technique can help mitigate the scalability issue, as shown in Figure 4(b). The block cache [10, 27] can use a predefined function (usually hashing) to partition the blocks into several shards and then assign a lock to each shard. In addition, recent work proposes that well-designed delegation based on message passing can provide better scalability and hotspot tolerance than locks [21, 31, 53] on NUMA (Non-Uniform Memory Access) architectures.

Fig. 4.

Fig. 4. Different designs and concurrent mechanisms for the shared block cache.

Therefore, we propose a Hybrid Lock-free Delegation for Shared Cache of TriCache, as shown in Figure 4(c). The Shared Cache adopts a client-server model based on message passing (solid lines in Figure 4(c)). Each client-server pair shares a lightweight message queue with a size of two cache lines, similar to ffwd [31]. Each user thread corresponds to a client, and several dedicated servers handle requests from clients. Each server is single-threaded, lock-free, and only responsible for managing a part of the blocks (e.g., partitioned by hashing block IDs). Multiple partitions and servers can achieve concurrency and scalability, and more servers can be added when a higher throughput is desired. In addition to message passing based delegation, clients can directly access per-partition block tables on cache hits to reduce server-side CPU consumptions (dashed lines in Figure 4(c); more details are provided in the following Client-side Fast Paths on Cache Hits discussion).

Metadata-only Delegation. When a user thread accesses block data, TriCache divides the block access into a metadata operation and a data operation. Metadata operations include address translations, reference count management, and evict policy enforcement. Data operations are memory accesses, such as load and store. In TriCache, only metadata operations are processed by servers through delegation, whereas clients issue data operations by themselves.

A block is accessed in three stages. In the first stage, the client asks the server to cache the block in memory (pin) and translate the storage address to its address in memory. The server updates the metadata of the requested block, reads uncached blocks, and evicts unused blocks, without touching the actual data. In the second stage, after receiving the response, the client will directly perform its memory access on the translated memory address. In the last step, the client notifies the server that the block has been released and can be further evicted by the server (unpin).

This design eliminates redundant memory copies between servers and clients. Servers focus on metadata operations so that a few server threads can achieve good performance. Meanwhile, it helps TriCache provide the same consistency and atomicity guarantees as memory, which is necessary for user transparency and compatible with in-memory applications. CPU directly executes data operations on the client side via memory instructions, and cache coherence is ensured by hardware. TriCache only needs a memory fence to ensure that the updates are visible before evicting modified blocks.

Client-side Fast Paths on Cache Hits. We propose using concurrent block tables to avoid server-side synchronizations on cache hits. A client first tries to directly find a block in the block table and update the block reference counts (number of clients in use) by using atomic operations. If it succeeds, the client can translate the address from the concurrent block table by itself, thus skipping synchronous message passing. The client then sends an asynchronous message if this direct operation changes the reference count from 0 to 1 or conversely, to notify the server to update the evict policy for the block. Multiple asynchronous messages can be batched and processed together to amortize message passing overheads.

Workflow and Implementation. Figure 5 shows the workflow and implementation of Shared Cache. Each user thread corresponds to a Shared Cache Client and gets its unique Client ID. Each partition has a polling-based message passing server that is used to process requests sent from clients and return results to them. In addition, each partition maintains a concurrent block table for blocks cached in the memory, and each entry in the block table stores the Block ID (BID) of the block, Memory ID (MID) of its in-memory cache, and its metadata (Meta). The metadata include information on whether the block is available and whether it has been modified, and the reference count pertaining to the clients in use. We use the compact hashed block table similar to Yaniv and Tsafrir [51], in which each entry occupies only an average of 8 bytes, and uncached blocks do not occupy memory.

Fig. 5.

Fig. 5. Shared cache of TriCache.

The cached blocks are indexed by their Memory IDs and stored in a Memory Pool. Meanwhile, an Evict Policy tracks all cached blocks with a zero reference count as they can be safely evicted. Every time the cache is full, the Evict Policy chooses and evicts one or more blocks based on its statistics and strategies. TriCache uses the CLOCK algorithm [39] by default, and it is replaceable, allowing users to customize the policy based on their application characteristics. In addition, policy implementations in TriCache are completely single-threaded, so users do not need to consider any concurrency issues. At the bottom, an asynchronous I/O backend (Storage Backend) manages pending IO requests in an I/O Queue to continuously poll and process I/O operations. The I/O backend is also customizable and defaults to SPDK that is backed by user-space NVMe drivers. A kernel-space alternative based on Linux AIO is also supported as another candidate.

When a user thread operates on a block, its client (N in the figure) first computes the Partition ID (M in the figure) by a predefined partition function. The client then searches for a valid block entry from the block table of Partition M, and if such an entry exists, the client tries increasing the reference count by using atomic operations. If the atomic operations succeed on cache hits (Cache-hit Fast Path in Figure 5), the client pins the block in memory and can directly query the memory address of the cached block. The client may further send an asynchronous request to the server if it is updating the reference count from 0 to 1, or the converse. The server then performs the corresponding actions according to the Evict Policy, such as enabling or disabling evictions of the block.

If the atomic operations fail on cache misses or when the block is being swapped in/out, the client requests a remote operation via synchronous message passing, immediately releases CPU resources, and waits for responses from the server (Remote Operation in Figure 5). After receiving the request, the server creates a block table entry and sets its valid bit to false. If the block table is full, the server evicts blocks according to the Evict Policy and sets their valid bit to false. The server then appends I/O operations for new blocks and the evicted blocks to the I/O queue. The I/O Backend processes the I/O requests by polling and controlling NVMe SSDs to perform DMA operations directly on the Memory Pool. Additionally, the server sets valid bits to true once the I/O requests have been processed, and it sends the memory addresses of the blocks to clients via message passing. After receiving the response, the client resumes and performs its memory accesses.

We use a micro-benchmark on a 128-core machine to test the effectiveness of TriCache Shared Cache. TriCache Shared Cache can scale linearly to 256 threads (1/8 of the threads are servers), reaching 96.8M ops/s, and the hybrid mechanism provides an improvement of \(52\%\) compared with the delegation-only approach.

The preceding discussion focuses on architectures that use only threads for parallel processing. The message passing design can perform better in parallel architectures with stackful coroutines (also known as fibers). With several coroutines multiplexed in a common thread, the message passing module of TriCache can send or receive multiple operations together to amortize overheads of inter-thread synchronization. Further, in case of a cache miss, the client must wait for the server to return the requested address before resuming execution. Under the pure threading architecture, TriCache has to release the CPU to the OS and switch between user threads to utilize these waiting CPU resources. When the thread manages multiple coroutines, TriCache can keep in the user space and perform lighter coroutine switching instead of thread switching through the OS. It further saves CPU consumption for the user’s application.

3.3 Software Address Translation Cache

The Shared Cache of TriCache provides scalable I/O performance and an efficient set-associative cache. However, block table lookups and atomic operations are required for each access on cache hits, still limiting the performance of TriCache.

Guiding Ideas. Considering the manual use of the block cache (e.g., Figure 1), users call the pin interface to get the in-memory address for a block and then use the memory address to perform multiple operations; they finally call the unpin interface to release the block. Multiple read and write operations can be performed between a pair of manual pin and unpin operations to reduce the number of cache lookup operations. Users manually take advantage of data locality while investing extra effort in development.

In contrast, we design TriCache to automatically exploit locality to simulate manual coding without requiring human effort. We propose to build a two-level SATC on top of Shared Cache. The higher-level cache stores hotter data and provides faster access and smaller capacities than the lower-level cache, similar to the multi-level cache of the CPU and hierarchical storage [30, 49, 56]. Based on this idea, we now show how to implement the multi-level cache in software and where to divide the levels.

SATC Design. In our design, only the last-level cache manages data, and higher-level translation caches manage only metadata, such as modifying the reference counts of the blocks and translating block IDs to memory addresses. Managing metadata instead of data can help avoid redundant memory consumption, additional memory copies, and memory consistency issues caused by the multi-level design. The multi-level cache of TriCache is designed to be an inclusive cache, which means that all blocks in the higher-level cache are also present in the lower-level cache. With this inclusive policy, higher-level caches need to only interact with their next level. Moreover, TriCache guarantees that the capacity of higher-level caches is no greater than the lower-level cache, thus eliminating out-of-space errors from the lower-level cache when the higher-level cache requests to swap in blocks.

On top of the Shared Cache, we build a thread-local set-associative cache called Private SATC. When the Private SATC hits, the user thread uses its thread-local block table and evict policy, and only when the Private SATC swaps in/out blocks does the user thread need to operate on the Shared Cache. Private SATC is purposed to reduce Shared Cache operations for the hot data of its thread. Examples include thread-local hot data when each thread computes a segment of data independently, and hot elements shared by all threads when processing skewed data. Private SATC also helps reduce concurrent block table operations with cross-NUMA memory accesses and false sharing, which could take about three to eight times higher latency than local memory accesses in our evaluation.

We further build a direct mapping cache called Direct SATC on top of the Private SATC to alleviate overheads due to hash table lookups and evict policy maintenance. Direct SATC maintains a few recently accessed pages in a fixed-size array to speed up address translation to a few bitwise operations and avoid having to update the evict policy for each access. The goal of Direct SATC is to cover multiple consecutive operations on the hottest blocks, such as sequential reads and writes, and displace manually written pin/unpin operations.

Meanwhile, the addition of Direct SATC can also fit parallel architectures with stackful coroutines (also called fibers). With TriCache, each coroutine has a fiber-local Direct SATC, and coroutines scheduled within a thread share a thread-local Private SATC, and all threads share the Shared Cache. When scheduling multiple coroutines mapping to the same thread, Direct SATC can automatically ensure that the hottest blocks for each coroutine are not swapped out by Private SATC, which can reduce re-access blocks from the Shared Cache when resuming the coroutine execution.

Implementation. As shown in Figure 6, we implement an inclusive multi-level cache with pin and unpin interfaces. Blocks in Direct SATC must exist in Private SATC, and blocks in Private SATC should also be present in Shared Cache. Private SATC calls the pin interface of Shared Cache to load blocks into the Private SATC, thus increasing reference counts to ensure that Shared Cache does not swap out the blocks. When blocks are evicted from Private SATC, it calls the unpin interface of Shared Cache to release the reference count. On top of Private Cache, Direct SATC also uses a scheme similar to Private SATC but provides a single get_raw_ptr interface implicitly combining a pin call and a following unpin call. It implies caching a block and translating the block ID into its raw address in memory. The raw address is valid until the next get_raw_ptr call because the subsequent access can evict any previous block from Direct SATC and possibly call the unpin interface of Private SATC or Shared Cache.

Fig. 6.

Fig. 6. Software Address Translation Cache.

The right-hand side of Figure 6 presents the state machine maintained in TriCache. Starting from Disk Only state, the user thread loads blocks into Shared Cache, Private SATC, and Direct SATC by calling get_raw_ptr. When Direct SATC evicts blocks, Shared Cache and Private SATC still hold them. When the last thread in use evicts a block from its Private SATC, the block enters Shared Cache Only state. Any get_raw_ptr reloads the block into all three levels of caches. If Shared Cache also evicts the block, it is removed from the in-memory cache and written back to storage when it is dirty, ending in Disk Only state.

In the default configuration, the aggregated capacity of Private SATC entries is equal to that of Shared Cache entries, which is the largest possible Private SATC with inclusive policy. The maximum size of Private SATC can hold as many as possible entries from Shared Cache, which can help TriCache reduce the number of inter-thread synchronizations. Additionally, the Direct SATC has a size 1/4 of that of the Private SATC on the same thread. These sizes of SATC are configurable to provide better performance under different workloads.

Our evaluation shows that SATC can improve performance tens of times over Shared Cache on real-world workloads. When SATC can absorb all accesses, TriCache can reach \(57\%\) and \(91\%\) of the in-memory performance for purely random and nearly sequential access patterns, respectively, making it practical to operate the block cache at per-access granularity.

3.4 Compile-time Instrumentation

With the help of SATC, TriCache opens up opportunities to provide a virtual memory interface and make TriCache fully transparent to users. To this end, we propose a purely user-space scheme based on compile-time instrumentation and library hooking techniques.

Memory Layout. We first modify the memory layout as shown in the upper part of Figure 7. In Linux, the current x86_64 virtual memory layout (with four-level page tables) consists of three main parts. User space takes 47 bits at the beginning, kernel space occupies 47 bits at the end, and most of the space in the middle is a hole of non-canonical virtual memory. We map the TriCache-managed disks into unused holes by block size, starting from 0x800...000 as TriCache-space virtual memory. Memory addresses in TriCache-space can be translated into an actual user-space memory address by calling the get_raw_ptr interface of TriCache Direct SATC.

Fig. 7.

Fig. 7. Memory layout and LLVM instrumentation.

Instrumentation. To enable transparent read and write operations on top of the TriCache block cache, we perform a translation before each memory operation (e.g., load, store, and atomic operations) so that TriCache-space addresses can be used just like user-space memory. TriCache does not require any manual code modifications with the help of compile-time instrumentation. The lower part of Figure 7 shows pseudo-codes for instrumenting load and store instructions in LLVM IR. TriCache instrumentation takes the highest bit of addresses to determine whether a memory address refers to user-space memory or TriCache-space virtual memory.

Although instrumentation provides the virtual memory interface for TriCache-space memory, we still need to determine what data should be placed in TriCache-space memory. First, data on the stack is not necessary to enter the block cache. We perform a dataflow analysis from LLVM alloca instructions to eliminate unnecessary instrumentation and overhead for the stack. Second, we set a runtime threshold for TriCache so that only memory allocations greater than the threshold will belong to TriCache-space memory. In contrast, small chunks of data, usually short-term temporary data, remain in-memory allocation. Finally, TriCache supports limiting the total size of data allocated in memory by a predetermined memory quota. If in-memory data exceed the memory quota, TriCache is able to take over later allocations. Users can adjust the preceding runtime parameters to obtain a balanced tradeoff between the memory usage and the performance.

Implementation. Figure 8 illustrates the compiling workflow of TriCache. The user application code is compiled by a compiler based on LLVM. The TriCache LLVM Plugin instruments the code and generates instrumented LLVM IR bitcode. The plugin performs instrumentation after all the optimization passes, so it does not affect the compiler optimizations on applications, such as automatic vectorization. Additionally, TriCache supports vector instructions since TriCache leaves the CPU to perform memory operations.

Fig. 8.

Fig. 8. Compiling workflow of TriCache.

Then, the bitcode links with the precompiled TriCache runtime (including get_raw_ptr implementations). TriCache forces to inline the cache-hitting implementations of Direct SATC and Private SATC through link-time optimization (LTO) to avoid intensive function call overheads.

The TriCache runtime also contains APIs on top of the virtual memory interface for manual optimizations, including pin and unpin. Optionally, users can optimize some bottlenecks of their applications through these APIs, such as using block-wise accesses and prefetching, while leaving other parts to transparently support out-of-core processing by instrumentation. In the TriCache runtime, some common utility functions, such as memcpy and memset, are already manually implemented by block-wise pin and unpin to reduce overheads of per-byte address translation from common components.

Skip 4EVALUATION Section

4 EVALUATION

We set up our experiments on a dual-socket server equipped with two AMD EPYC 7742 CPUs (64 physical cores and 128 hyper-threads per CPU) and 512GB DDR4-3200 main memory. The storage devices are eight PCIe-attached Intel P4618 DC SSDs that provide 51.2TB capacity, 9.6M 4KB read IOPS, and 3.9M 4KB write IOPS in total. The server runs Debian 11.1 with Linux kernel 5.10 and uses Clang 13.0.1 to compile TriCache and other systems.

In our evaluation, we limit the total available capacity of DRAM by cgroups to evaluate out-of-core performance. For TriCache and other systems with block caches, we ensure that the overall memory is less than the expected memory limit by adjusting the cache sizes. The SSDs are configured in SPDK mode for TriCache and as raw blocks for swapping. If the system requires a single filesystem, we construct a software RAID-0 by mdadm and use the XFS filesystem. TriCache launches 16 background threads (bound to eight cores) for Shared Cache and uses 4KB blocks by default. Additionally, the total number of threads is searched to maximize performance over powers of 2 from the number of hardware threads (i.e., 256).

We first evaluate TriCache on four representative domains in terms of end-to-end performance: graph processing (Section 4.1), key-value store (Section 4.2), big-data analytics (Section 4.3), and transactional graph database (Section 4.4). In these experiments, we focus on the end-to-end performance of TriCache and examine different workloads running in memory as well as out of core by limiting the size of the available memory for TriCache.

We then conduct a micro-benchmark, by using a configurable number of threads that issue load/store instructions. We adjust the hit rates and access patterns to explore circumstances in which TriCache outperforms OS page cache and to assess whether the design of TriCache provides a reasonable tradeoff between in-memory (i.e., cache hit) and out-of-core (i.e., cache miss) performance (Section 4.5).

Finally, we use a series of breakdown experiments to evaluate the performance-related impact on TriCache, including hit rates and hit latency of SATC, number of threads, I/O backends, evict policies, page sizes, SATC sizes, and threads and fibers (Section 4.6).

4.1 Performance on Graph Processing

Experimental Setup. Graph processing is a demanding workload for cache systems due to many small and random accesses on large datasets. We transparently apply TriCache to an in-memory graph processing framework Ligra2 [35] and extend it to out-of-core by hooking malloc memory allocations without manual modification. The baselines are Ligra with OS swapping and FlashGraph3 [54], an efficient semi-external memory graph processing framework designed for SSDs.

Both Ligra and FlashGraph use 32-bit vertex IDs, and we force Ligra to use push mode to align with FlashGraph. For FlashGraph, we follow its recommended configuration of creating an XFS filesystem for each SSD block device and binding the device to the corresponding NUMA nodes. Meanwhile, FlashGraph is a semi-external memory graph engine that always stores vertex states in memory and edge lists on SSDs. We thus make TriCache to manage at least edge lists in the cache for a fair comparison.

We evaluate FlashGraph, Ligra on swapping, and Ligra on TriCache by three common graph algorithms: PageRank (PR), Weakly Connected Components (WCC), and Breadth-First Search (BFS). The dataset is a real-world graph dataset, UK-2014 [5, 6], with 788M vertices and 47.6B edges. It requires more than 400GB for Ligra in-memory execution.

In-memory Performance. Figure 9 shows the computation time of FlashGraph, Ligra on swapping, and Ligra on TriCache under different memory quotas. With 512GB of memory, Ligra can process all three algorithms in memory, and TriCache and FlashGraph can buffer all data in their cache. Under this setting, TriCache incurs overheads of only \(34.4\%\) for PR, \(64.0\%\) for WCC, and \(23.5\%\) for BFS. The in-memory performance shows that TriCache can provide efficient address translations and cache hits with its virtual memory interface, owing to the two-level SATC. Meanwhile, TriCache outperforms FlashGraph by \(6.08\times\), \(3.85\times\), and \(2.06\times\), respectively, when the working set can be cached in memory. It illustrates that FlashGraph yields much higher in-memory overheads than TriCache because the block cache of FlashGraph involves redundant memory copies on cache operations with its read/write interfaces.

Fig. 9.

Fig. 9. Computation time of FlashGraph, Ligra on swapping, and Ligra on TriCache (lower is better).

Out-of-core Performance. Under 256GB memory limitation, the caches start swapping in/out blocks/pages. Compared to the in-memory performance, Ligra on TriCache saves about half of memory and yields \(47.7\%\) performance on PR, \(12.5\%\) performance on WCC, and \(78.4\%\) performance on BFS. Additionally, TriCache’s speedups over OS swapping and FlashGraph are \(6.30\times\) and \(5.31\times\) on PR, \(26.1\times\) and \(1.46\times\) on WCC, and \(0.85\times\) and \(2.05\times\) on BFS, respectively.

As the usable memory further decreases, I/O efficiency becomes the main factor affecting performance. For example, in the case of 64GB of memory, the performance of TriCache is \(19.3\times\), \(38.3\times\), and \(26.8\times\) better than that of swapping. In this case, the average I/O bandwidth of TriCache is more than 12GB/s, and the peak I/O performance can reach 4.8M IOPS and 18GB/s bandwidth. The difference between the peak and average I/O performance shows that the Bulk Synchronous Parallel model used by Ligra limits the I/O performance, and a better performance can be potentially achieved with asynchronous execution models. Compared with FlashGraph, TriCache can still provide improvements of \(54.8\%\) and \(58.3\%\) on PR and WCC, respectively, whereas the performance of Ligra with TriCache is \(34.3\%\) lower than FlashGraph on the BFS algorithm. This is because FlashGraph adopts a two-dimension partition for out-of-core graph processing, resulting in a \(50.1\%\) cache hit rate that saves \(2.68\times\) of I/O volume compared to TriCache. Still, TriCache provides an average I/O bandwidth \(1.78\times\) better than FlashGraph and thus reduces the performance gap.

It is noteworthy that the semi-external memory FlashGraph cannot fit vertex states of PR and WCC with 16GB of memory. It leads to out-of-memory errors, whereas TriCache can operate the same dataset fully out-of-core.

The preceding results indicate that TriCache can extend an in-memory graph framework to support out-of-core processing without manual modification and can deliver performance comparable to a well-designed external memory framework. Meanwhile, TriCache outperforms OS swapping by up to \(38.3\times\) while providing the same user transparency.

4.2 Performance on Key-Value Stores

Experimental Setup. Key-value stores manage large amounts of data, requiring cache systems to buffer hot data in memory. We use RocksDB4 [10], a persistent key-value store widely used in production systems, for evaluation in this part. RocksDB organizes on-disk data in immutable sorted sequence tables. It provides a block-based table format on top of its user-space block cache, and a plain table format optimized for in-memory performance via mmap. We use TriCache to buffer RocksDB plain tables without manual modification by hooking RocksDB mmap calls and compare it with plain tables based on OS memory-mapped files and block-based tables on RocksDB’s own cache.

We use the mixgraph [8] (prefix-dist) workload proposed by Facebook, which models production use cases at Facebook and emulates real-world workloads of key-value stores with hotness distribution and temporal patterns. The keys and values are 48 and 43 bytes on average, respectively, and there are \(83\%\) reads, \(14\%\) writes, and \(3\%\) scans. We generate 2B key-value pairs (consuming 180GB of space) and execute 100M operations. Both plain and block-based tables use the hash index with a 4-byte prefix. We increase the sharding number of the RocksDB block cache to 1024 to avoid lock contentions on our 256-thread server and use the direct I/O mode for the RocksDB block cache. We also disable WAL to prevent log flushing from becoming a performance bottleneck.

In-memory Performance. Figure 10 illustrates the throughput of Plain Tables on TriCache, Plain Tables on mmap, and Block-based Tables on the RocksDB user-space cache with different memory quotas. In memory, RocksDB Plain Tables with mmap provides the best performance, which is 4.28M ops/s. TriCache reaches about \(53.5\%\) throughput of mmap and \(73.7\%\) throughput of the RocksDB block cache.

Fig. 10.

Fig. 10. RocksDB throughput with varying memory quotas.

Out-of-core Performance. When RocksDB runs out-of-core, TriCache brings performance improvements of two to three orders of magnitude compared with mmap. Plain Tables with TriCache outperforms Block-based Tables by \(6.69\times\) under 128GB of memory, \(10.8\times\) with 64GB, \(10.9\times\) with 32GB, and \(10.0\times\) with 16GB. Some performance benefits of TriCache come from the efficient I/O stack of SPDK, whereas the excellent scalability of Shared Cache is another key factor. For example, the RocksDB block cache can deliver a throughput of 122K ops/s with 256 threads. However, our eight NVMe SSDs require about 1024 I/O in-flight requests to maximize the I/O performance. Unfortunately, when the number of threads is increased from 256 to 1024, the throughput instead gradually drops. In the case of 1024 threads, RocksDB only provides \(71.3\%\) throughput of 256 threads. In contrast, the RocksDB throughput on TriCache improves by \(2.15\times\) from 256 threads to 1024 threads.

The in-memory performance indicates that user-transparent TriCache can provide similar performance as manually managed block cache in RocksDB. Meanwhile, TriCache has the potential to help existing systems with in-memory backends, such as RocksDB with Plain Tables, to achieve better out-of-core performance without any manual modifications.

4.3 Performance on Big-Data Analytics

Experimental Setup. TeraSort [38] is a representative application and an important performance indicator in the domain of big-data analytics [14]. Its typical distributed or out-of-core implementation consists of a shuffle phase followed by a sort phase. The shuffle phase produces parallel sequential reads and writes, which is I/O bound [14] and can stress sequential I/O throughput on cache systems. The sort phase requires the cache to buffer the working partition in memory and issues a vast number of string comparisons and copies that can examine the runtime overhead of cache systems.

We generate two TeraSort workloads, 1.5B records (about 150GB) and 4B records (about 400GB). For TriCache, we first use the parallel sort based on multi-way merge sort in GNU libstdc++ [36] (named GNU Sort) to implement an out-of-core TeraSort, which requires only a single function call. We also implement a shuffle-based parallel sort by partitioning the first byte of the keys (named Shuffle Sort), which takes 15 additional lines of C++ code. Compared with the multi-way merge sort, the shuffle-based parallel sort mainly issues sequential read/write I/O operations, so it is more friendly to out-of-core processing. We use TriCache to manage memory allocations during sorting and compare TriCache with OS swapping. For Shuffle Sort, we configure the page size of TriCache to 128KB to maximize the sequential I/O performance. We also use a widely used big-data framework Spark5 [52] as a baseline, which supports both scale-up and scale-out processing.

In-memory Performance. Figure 11 shows the computation time of TeraSort. On the 150GB dataset, both GNU Sort and Shuffle Sort occupy about 300GB of memory and fit in 512GB of memory. In this case, Shuffle Sort is \(2.01\times\) faster than GNU Sort. Meanwhile, the overheads of TriCache amount to only \(14\%\) for GNU Sort and nearly zero (less than \(1\%\)) for Shuffle Sort. The reason is that the Shuffle Sort algorithm mainly generates sequential reads and writes for each thread, which can be well handled by thread-local Direct SATC and Private SATC. Compared with Spark, GNU Sort and Shuffle Sort on TriCache is faster by \(1.55\times\) and \(3.62\times\), respectively.

Fig. 11.

Fig. 11. Computation time for TeraSort workloads with different memory quotas (lower is better).

Out-of-core Performance. When the memory quota is less than 256GB for the 150GB workload, TriCache can provide tens of times speedups over swapping, up to \(39.3\times\) for GNU Sort at 128GB of memory and \(57.8\times\) for Shuffle Sort at 64GB of memory. Meanwhile, the performance of Shuffle Sort with TriCache is up to \(20.2\times\) better than Spark at 32GB of memory.

For the 400GB dataset, both algorithms keep executing out-of-core. Shuffle Sort on TriCache is faster than swapping by up to \(43.6\times\) at 32GB of memory and outperforms Spark by up to \(13.7\times\) with the same amount of memory. GNU Sort based on swapping is \(41.2\times\) slower than TriCache Shuffle Sort with 512GB of memory and about \(128\times\) slower when the memory quota is less than 256GB because of its sub-optimized algorithm and the limited performance of the OS page cache.

Compared with the in-memory processing of the 150GB dataset, Shuffle Sort with TriCache saves \(90\%\) memory with 32GB of memory, whereas its processing time is only \(49.3\%\) longer than the processing time at 512GB. We also compared the distributed Spark and TriCache-based scale-up solutions. We use four servers with the same hardware configuration and connect them with 200Gb HDR Infiniband NIC. TriCache under 32GB of memory outperforms in-memory distributed Spark by \(7.20\times\) using Shuffle Sort and \(1.33\times\) with GNU Sort on the 400GB workload. So TriCache with NVMe arrays can use less memory and provide nearly in-memory performance for TeraSort. In addition, TriCache can nearly utilize the peak bandwidth of our eight NVMe SSDs, reaching 44GB/s for read-only operations and 31GB/s for mixed read/write operations.

In summary, developers can write in-memory programs (e.g., less than 20 lines of C++ code for Shuffle Sort), and TriCache then helps them to fully utilize the high-performance NVMe SSD array, especially when the algorithm is friendly to out-of-core processing.

4.4 Performance on Graph Database

For workloads in graph databases, we evaluate TriCache on LiveGraph6 [57], an efficient transactional graph database based on OS memory-mapped files. LiveGraph treats memory-mapped files as in-memory data and relies on atomic memory accesses and cache consistency to support transactional queries. It can examine whether a user-transparent block cache is able to provide the same semantics as in-memory operations. We replace the memory-mapped files with TriCache and compare it with the original LiveGraph. We evaluate their performance on the LDBC SNB interactive benchmark, which simulates user activities in a social network and consists of 14 complex-read queries, 7 short-read queries, and 8 update queries. As the SNB driver occupies part of the memory, we limit LiveGraph to use up to 256GB of memory and generate two workloads: SF30 and SF100 datasets. With LiveGraph, these datasets take about 100GB and 320GB of memory, respectively. SNB clients request 1.28M operations for the SF30 workload, and 256K operations for the SF100 workload during the benchmark run.

Figure 12 shows the SNB throughputs of LiveGraph on TriCache and mmap. When the dataset can fit into 256GB of memory, the instrumentation and user-space cache of TriCache incur only \(21\%\) runtime overheads on the SNB benchmark. As the memory quota gradually decreases, the advantage of TriCache becomes increasingly prominent—for example, TriCache outperforms mmap by \(12.4\times\) at 32GB of memory as its scalable Shared Cache and the efficient I/O backend supply much higher throughputs. For SF100, LiveGraph keeps running in out-of-core states. TriCache improves the throughput by \(5.48\times\) compared with mmap at 256GB of memory, and the speedup can grow up to \(10.5\times\) at 16GB of memory.

Fig. 12.

Fig. 12. Throughput of LiveGraph on TriCache and mmap.

We then take a closer look at the latency metrics when running SF100 with 256GB of memory. TriCache cuts the average latency on complex queries by \(11.5\times\), on short queries by \(1.79\times\), and on update queries by \(21.1\times\) (geometric means). The P999 tail latency of TriCache keeps \(10.9\times\) lower than mmap on complex queries and \(1.35\times\) lower on short queries. Meanwhile, TriCache shortens the P999 latency of update queries to \(34.6\times\) shorter than the original LiveGraph because TriCache is additionally aware of thread locality while mmap is not. Although TriCache and mmap are both user-transparent, the Private SATC of TriCache can automatically hold recently updated data for writer threads in memory even when writers are waiting for group commits. On the contrary, mmap may evict these dirty pages under memory pressure. Our design helps LiveGraph to reduce tail latencies on update operations.

4.5 Micro-benchmarks

We conduct two custom multi-threaded micro-benchmarks that issue random memory-load instructions. The first generates random accesses in 8 bytes (named 8B Random workload), which can stress the systems in the case of completely random memory accesses. We control the random pattern to generate operations with different hit rates of block caches, and we also adjust the hit rate of Private SATC to examine its performance impact. The second randomly chooses 4KB pages and sequentially accesses each page in 8-byte words (named 4KB Random workload) to evaluate the performance when a page is accessed multiple times.

We compare TriCache with Linux mmap and FastMap [27] (both given a hint of random accesses, MADV_RANDOM). FastMap optimizes the mmap path in the Linux kernel, including sharding locks (discussed in Section 3.2) and batching TLB invalidations. It mainly aims to mitigate the scalability limitation of Linux mmap. For FastMap, we downgrade the kernel version to 4.14 as FastMap relies on it, and configure a RAID-0 with mdadm as suggested by the authors of FastMap, and leverage FastMap to manage bare block devices.

Figure 13 shows TriCache’s speedup compared with mmap and FastMap with 8B Random and 4KB Random workloads.

Fig. 13.

Fig. 13. The speedup of TriCache over mmap and FastMap on 8B Random and 4KB Random workloads.

For 8B Random workloads, the performance of TriCache is about \(11\%\) of the in-memory (with mmap) performance in the worst case, when the memory hit rate is \(100\%\) and Private SATC hardly hits. Under the same \(100\%\) memory hit rate, when the hit rate of Private SATC grows up to \(100\%\), TriCache can attain \(57\%\) of the in-memory (with mmap) performance, with a performance improvement of up to \(6.07\times\). It demonstrates that the SATC design can improve the performance of address translation, which in turn provides higher random access performance.

Once the memory hit rate drops to \(90\%\), TriCache can provide improvements of \(18.6\times\) to \(31.5\times\) over mmap whose performance is severely limited by lock contentions in the kernel. At the same time, TriCache outperforms FastMap by \(1.22\times\) on average. As the memory hit rate gradually decreases, the advantage of TriCache becomes increasingly significant. For instance, TriCache outperforms mmap by \(33.6\times\) and FastMap by \(3.34\times\) with an \(80\%\) hit rate. When the memory hit rate reaches \(10\%\), TriCache performs \(45.0\times\) to \(47.2\times\) better than mmap, and \(5.38\times\) to \(5.60\times\) better than FastMap. In this case, TriCache provides 12.4M random accesses per second and fully saturates our eight NVMe SSDs. However, FastMap can only support 2.22M accesses per second with all the hardware cores, where this is equivalent to about the I/O performance of only two NVMe SSDs. This indicates that FastMap cannot accommodate currently available high-performance NVMe SSD arrays because it still suffers from the heavy I/O stack of the kernel, page faults, and context switching overheads [25, 32, 55].

For 4KB Random workloads, Direct SATC can mainly absorb the in-memory overheads of TriCache because the sequential accesses on the same pages can be handled by Direct SATC with very lightweight address translation. The performance of TriCache reaches \(84\%\) to \(91\%\) of the in-memory (with mmap) performance when the memory hit rate is \(100\%\), which shows that SATC can efficiently improve TriCache in-memory performance through optimizing address translation. With a \(90\%\) memory hit rate, TriCache can provide a speedup of \(8.43\times\) on average over mmap and \(1.46\times\) over FastMap. Under a \(10\%\) memory hit rate, TriCache can provide 12.3M random accesses per second, which outperforms mmap by \(43.1\times\) and FastMap by \(5.08\times\) on average.

4.6 Performance Breakdown

In this section, we use a series of breakdown experiments to evaluate some performance-related impacts and configurations of TriCache, including hit rates and hit latency of SATC, number of threads, I/O backends, evict policies, page sizes, SATC sizes, and threads and fibers. For the breakdown experiments, we select five cases under 64GB of memory that are running out of core: PR, RocksDB, Shuffle, and GNU Sort for the 400GB TeraSort dataset, and LiveGraph for the SNB SF100 workload.

Breakdown Analysis of SATC. The three columns of Table 2 break down the performance impact of SATC by gradually removing SATC levels from TriCache. W/O Direct disables Direct SATC, W/O Private disables Private SATC, and Shared Only uses only Shared Cache by removing both SATC levels.

Table 2.
W/O DirectW/O PrivateShared Only
PageRank2.75\(\times\)1.03\(\times\)40.1\(\times\)
RocksDB1.27\(\times\)1.02\(\times\)22.0\(\times\)
Shuffle Sort1.87\(\times\)4.67\(\times\)10.1\(\times\)
GNU Sort2.51\(\times\)4.25\(\times\)57.9\(\times\)
LiveGraph1.07\(\times\)1.01\(\times\)7.55\(\times\)

Table 2. Performance Slowdown by Removing SATC Levels

According to the performance degradation listed in Table 2, SATC is an essential component contributing to the good performance of TriCache. The slowdown that occurs by disabling SATC (Shared Only) is \(20.8\times\) on average for the five cases. Even when the memory quotas are less than 1/5 of the working set (i.e., running out-of-core), SATC still yields a speedup of \(40.1\times\) for PR, \(10.1\times\) for Shuffle Sort, and \(57.9\times\) for GNU Sort.

Meanwhile, both Direct SATC and Private SATC are indispensable to TriCache. Without Direct SATC, the performance of PR is degraded by \(2.75\times\) because accessing each edge incurs a heavy overhead due to hash table lookups and evict policy maintenance. However, PR is not sensitive to Private SATC because the size of the dataset is more than \(5\times\) larger than the available memory, and the edges are visited only once for each iteration. For the shuffle phase of Shuffle Sort, the performance drops by \(5.39\times\) without Private SATC but remains almost the same (only \(4.8\%\) slower) without Direct SATC. The reason is that string copies constitute the bottleneck in the shuffle phase and are optimized by the compiler to memcpy, which is implemented by manually calling pin/unpin in the TriCache runtime. For GNU Sort, removing Direct SATC and Private SATC degrades the performance by \(2.51\times\) and \(4.25\times\), respectively.

Multi-level Cache in TriCache. Next, we use PR, Shuffle Sort, and GNU Sort to further examine the design of the multi-level cache in TriCache. Table 3 lists the miss rates for each level of the cache, the average hit cycles for Direct SATC and Private SATC, and the average access cycles for Shared Cache.

Table 3.
Direct SATCPrivate SATCShared Cache
Miss RateHit CyclesMiss RateHit CyclesMiss RateAccess Cycles
PageRank0.00352.60.0633210.6262.36M
Shuffle Sort0.00163.00.0011620.9691.68M
GNU Sort0.0451430.0074880.926789K

Table 3. Miss Rate and Average Cycles on Each Cache Level

According to the miss rates listed in Table 3, Direct SATC and Private SATC can handle most memory accesses. The miss rate of Direct SATC is less than \(5\%\) for all the three workloads, and the miss rate of Private SATC is less than \(1\%\) for Shuffle Sort and GNU Sort. The results show that SATC can cover most accesses to meet the preceding performance.

Additionally, the hit cycles of Direct SATC and Private SATC in Table 3 show that the software address translation of TriCache is quite efficient. The average costed cycles of Direct SATC hits in PR and Shuffle Sort are approximately 50 cycles, Direct SATC hits in GNU Sort and Private SATC hits in Shuffle Sort take about 150 cycles, and Private SATC hits in GNU Sort use about 450 cycles. To give an idea of how much time they take, we list some hardware latencies: 50 cycles are close to a NUMA-local L3 cache hit or an L2 cache false sharing within a NUMA node, 150 cycles correspond to about the half of a NUMA-local memory access, and 450 cycles are less than a cross-NUMA memory access or a cross-NUMA cache false sharing. Therefore, TriCache with SATC is efficient enough to provide a virtual memory interface and also to deliver memory-comparable performance.

Performance and Numbers of Threads. We also compare the performance of TriCache and baselines under different numbers of threads for PR and RocksDB with 64GB of memory. More precisely, “the performance under a given number of threads” means the maximum performance with less than or equal to this number of threads (only searched over powers of 2). Since TriCache uses 16 server threads as the default configuration in Section 4, the number of threads starts with 32 threads (including server threads).

As shown in Figure 14, TriCache achieves good scalability for the both workloads of PR and RocksDB, which is one of the reasons TriCache performs well. For example, from 32 threads to 256 threads (the number of hardware threads), Ligra with TriCache (in Figure 14(a)) achieves a \(4.29\times\) speedup, and RocksDB on top of TriCache (in Figure 14(b)) yields a \(13.5\times\) performance improvement.

Fig. 14.

Fig. 14. Performance of TriCache and baselines under different numbers of threads.

Meanwhile, with a small number of threads, TriCache’s performance is worse than that of manually optimized prefetch and asynchronous I/O because of TriCache’s synchronous scheme for triggering I/O and its lack of program-specific optimizations (similar to mmap). To mitigate these limitations, over-subscription can help to utilize the queue depth of SSDs as much as possible. Through over-subscription, the performance of TriCache is improved by \(2.58\times\) for PR workload and \(2.15\times\) for RocksDB workload, thus enabling good performance for TriCache even without manual optimizations.

Performance and I/O Backends. TriCache currently supports SPDK and Linux AIO as its storage backends and uses SPDK to handle I/O operations in the default configuration. With the NVMe-oF (NVMe over Fabrics) feature provided by SPDK and the Linux kernel, TriCache can also support accessing remote NVMe SSDs through the network and be a good option in the disaggregated architecture. To this end, we evaluated the performance of TriCache with SPDK and Linux AIO backends operating on local SSDs and NVMe-oF remote SSDs. To simulate a disk pool on the network, another server with the same hardware is connected to our testbed using Infiniband in the experiments, which acts as a remote disk server on the network. We also include two different network configurations to evaluate the performance impact of network bandwidth, one with dual 100Gb EDR Infiniband NICs and one with dual 56Gb FDR Infiniband NICs. For the software configuration, both SPDK and Linux AIO backends use the SPDK-provided NVMe-oF targets (i.e., NVMe-oF servers), and the AIO backend uses the NVMe-oF client provided by the Linux kernel, whereas the SPDK backend uses the NVMe-oF client from the SPDK framework.

Figure 15 shows the relative performance when using the AIO and SPDK backends operating on local or remote SSDs. The baseline is SPDK and local SSDs.

Fig. 15.

Fig. 15. Performance of TriCache with different I/O backends.

When using local SSDs, SPDK performs 1.64\(\times\) better than Linux AIO in terms of the (geometric) average, demonstrating that the user-space NVMe driver enables better IO performance. Nevertheless, SPDK has some drawbacks, such as high programming complexity, deployment difficulties, and not easy for supporting multiple applications. Luckily, TriCache hides SPDK programming details from users, allowing users to code in-memory programs and achieve efficient out-of-core performance. Moreover, the design of TriCache is not coupled with SPDK and can provide comparable performance with Linux AIO. If using or deploying SPDK is not feasible, AIO can serve as a reasonable alternative backend for use in TriCache.

When using remote SSDs through NVMe-oF, the SPDK backend delivers 0.80\(\times\) and 0.75\(\times\) the performance (in geometric average) of local disks in a dual EDR and dual FDR Infiniband network environment, respectively. It demonstrates that TriCache can be used for local high-performance NVMe SSD arrays and is also expected to be applied for disaggregated architectures that expand local memory capacity through leveraging remote SSDs on other servers. In the experiments, the most significant performance degradation is seen in the Shuffle Sort workload, which produces a 45% performance drop with remote SSDs. The gap is mainly attributed to the network bandwidth of our experimental environment, which cannot match the more than 40GB/s peak bandwidth of the local NVMe SSD arrays. In this case, the Shuffle Sort workload is bandwidth-bound, so the bandwidth gap can lead to the preceding performance slowdown. It also illustrates that there are still some cost and bandwidth advantages in utilizing local NVMe SSD arrays compared to far memory and remote SSDs under the disaggregated architecture. Meanwhile, when using the same NVMe-oF SSDs, the SPDK backend provides about 2.1\(\times\) performance advantages over the Linux AIO backend, with a more significant gap than over local disks (about 1.6\(\times\) larger), indicating that SPDK has less software overhead than the Linux kernel when operating NVMe-oF remote disks.

Even though SPDK already provides the functionality to operate remote disks, there is still about 15% performance degradation when using the NVMe-oF feature of SPDK. Additionally, the experimental results illustrate that these performance drops are not brought by network bandwidth limitations for a total of four out of five workloads. In these cases, the performance losses are approximately the same under the EDR and FDR Infiniband networks (drop by 12.5% for dual EDR NICs and 17.5% for FDR NICs). Therefore, how to design caching systems for disaggregated architectures and remote SSDs is still an open research question, and we intend to consider complete system designs as our future work.

Performance and Evict Policy. TriCache supports customizable evict policy, and it uses the CLOCK algorithm as its default policy. In this set of experiments, we replace the CLOCK policy of TriCache Shared Cache and Private SATC to compare the performance of the different algorithms under different workloads. We additionally implement LRU, the 2Q algorithm, and a Random algorithm with random page selection for TriCache. TriCache uses message passing technology to design its Shared Cache, and the Private SATC is thread-local, so the evict policy is single-threaded and does not require concurrency. It is easy to implement evict algorithms for Shared Cache and Private SATC of TriCache—for example, only about 25 lines of code are enough to implement an LRU algorithm.

Figure 16 shows the relative performance with different evict policies, where the baseline is the CLOCK algorithm. For all five workloads, the performance with different policies is approximately the same, and the difference is within 3% of the geometric average. CLOCK performs the best (only with a slight advantage) among the four algorithms on the tested workloads, achieving the best performance on three out of five workloads (including PR, Shuffle Sort, and GNU Sort). The CLOCK algorithm can achieve almost the same cache hit rate as LRU and 2Q for these applications, but with lower maintenance overhead. For RocksDB, the Random algorithm has the smallest overhead since it does not require maintaining any information for eviction at the time of data access, which can improve the performance by about 1.2%. Additionally, the LRU and 2Q algorithms handle complex workloads better by maintaining more access information than CLOCK, so they bring better performance on the LiveGraph workload compared to CLOCK by 2.1% and 1.5%, respectively.

Fig. 16.

Fig. 16. Performance of TriCache with different evict policies.

Performance and Page Size. TriCache uses a software implementation of page tables in the user space, which enables the option of customizing the page size instead of relying on CPU-defined pages. For example, the page size of TriCache is configured to 128KB in the end-to-end experiment of Shuffle Sort as shown in Section 4.3, and the 128KB pages can help TriCache to maximize the sequential I/O performance. For other workloads, TriCache uses a 4KB page size by default and aligns with the Linux page cache. In this section, we try to decrease the page size for the Shuffle Sort workload and increase the page size for other workloads to evaluate how much performance improvement can be achieved with the configurable page size from TriCache.

Figure 17 shows the relative performance of different workloads at page sizes from 4KB to 128KB, where the Shuffle Sort workload has a baseline of 128KB pages, and the other payloads have a baseline of 4KB pages. For PR, larger page sizes of 8KB to 32KB can provide up to 13% performance improvements (with 16KB pages) over the default page size of 4KB. The GNU Sort workload performs similarly to PR, with 1.36\(\times\) higher performance at a 16KB page size than with 4KB pages. For Shuffle Sort, a page size configured to 128KB can fully utilize the bandwidth of the SSD array and provide 1.85\(\times\) the performance with the 4KB page size. From the hardware perspective, the 4KB access granularity is not large enough for SSDs to reach the maximum hardware bandwidth, showing that configurable page sizes can provide TriCache performance improvements especially when the performance bottleneck is sequential I/O bandwidth.

Fig. 17.

Fig. 17. Performance of TriCache with different page sizes.

However, workloads may have different performance curves, and misconfigured page sizes can also cause performance degradation. For example, the configurable page size of TriCache only brings about 2% throughput improvements to the LiveGraph workload at 8KB pages, whereas other page sizes instead degrade its performance. As for the RocksDB workload, the optimal configuration is the default 4KB page size, and larger pages produce worse performance. For example, if the page size of TriCache is set to 16KB, the RocksDB throughput will drop by about 38%. Therefore, TriCache still uses the traditional 4f pages as the default configuration, leaving users to find the optimal page size for their workloads if necessary.

Performance and SATC Sizes. Besides the page size is a configurable parameter in TriCache, the two levels of SATC are also tunable. By default, TriCache sets the aggregated capacity of Private SATC to be equal to Shared Cache and makes Direct SATC 1/4 the size of Private SATC. In this paragraph, we resize the Direct SATC and Private SATC separately to examine the performance impact yielded by different SATC sizes.

The experiments include tests running out-of-core, which are cases of the preceding breakdown experiments with 64GB of memory, and the in-memory cases for each workload. In the experiments, we adjust the size of SATC from the same size as the next (larger) level cache to a minimum of 1/16 of the next level cache. The performance results show that the majority of cases (i.e., Figure 18(a)–(d), Figure 18(g), and Figure 18(h)) are not sensitive to the size of SATC. Similar to TLB, SATC is used to handle address translation, which requires only a small amount of memory to be able to cache the addresses of hot data.

Fig. 18.

Fig. 18. Performance of TriCache with different SATC sizes.

However, there are still some cases where different SATC sizes lead to more than 50% performance differences, as shown in Figure 18. The Shuffle Sort workload expects that all the scanning headers of the shuffle phase are held in memory if possible, which requires a large Private SATC (about 1/4 the size of the Shared Cache for the out-of-core case) to store them. When Private SATC is not large enough to store all the headers, the performance can drop by about 55%. For GNU Sort running out-of-core (Figure 18(h)), the performance is better when the size of Private SATC is 1/4 or 1/2 of Shared Cache, but the highest performance is achieved when Direct SATC, Private SATC, and Shared Cache are the same size. Another case is LiveGraph running in-memory (Figure 18(i)), whose performance curve shows that it is sensitive to the size of Private SATC but not Direct SATC. Additionally, it requires a Private SATC as large as possible, with the best performance when the size of Private SATC is 1/2 and the same of Shared Cache. In general, the default configuration of TriCache, especially maximizing Private SATC, is a good starting point and is able to achieve relatively good performance across different workloads in the experiments. The results also show that TriCache can achieve better performance with parameter searching, so we believe a self-driving caching system is an interesting research problem and leave it as future work.

Threads or Fibers. Some designs of TriCache are optimized for stackful coroutines (also called fibers), as fibers are more lightweight than threads and have the potential to provide better out-of-core performance than threads. This set of experiments compare TriCache working in fiber and thread modes through a set of micro-benchmarks. The micro-benchmark is similar to the 4KB Random workload in Ssection 4.5 with a 10% memory hit rate. It visits a page randomly and then accesses random data within the same page several times. By adjusting the number of random data accesses within a page, it is possible to examine how much CPU resources are available for computation apart from I/O operations.

In the experiments, the number of threads and fibers are controlled to be the same to fairly compare the performance of different parallel schemes. For the fiber mode, one thread is launched for each core, and the fibers are evenly distributed and bound to each core. Figure 19 shows the performance of TriCache using threads and fibers. With only one random access per page access (when the throughput of page accesses is maximum), the performance of fibers is 1.12\(\times\) higher than that of threads. When multiple fibers are running in a single thread, the message passing module of TriCache can merge the messages and amortize overheads of inter-thread synchronization, thus providing better I/O performance.

Fig. 19.

Fig. 19. Performance of TriCache with threads and fibers.

As the number of padded random accesses increases, the performance of both thread and fiber modes gradually degrades, dropping to about 90% of the maximum throughput at 512 random accesses with fibers and 1024 random accesses with threads. With more padded accesses, the advantage of fibers is more significant, reaching 1.76\(\times\) at 4096 random accesses per page. At this point, fibers still deliver more than 5M ops/s of page access, whereas threads can only support 2048 random accesses to achieve similar performance. Additionally, the geometric average speedup for fibers relative to threads is 1.31\(\times\) for different padding accesses. The preceding results show that fibers can leave more CPU resources for computation while achieving the same I/O performance, resulting in better performance in CPU-bound workloads. The reason is that the number of CPU cores currently available is insufficient to fill the queue depth of NVMe SSD arrays, and over-subscriptions are required to fully utilize the disks. The context switching of fibers happens purely in the user space and does not need to enter the kernel space as threads do. Hence, the advantage of fibers becomes apparent when one CPU core is processing multiple parallel tasks simultaneously (with over-subscriptions).

Skip 5RELATED WORK Section

5 RELATED WORK

There is a series of work that tries to improve the page caching performance with customized memory-mapped file I/O paths or swapping approaches [25, 26, 27, 28, 37, 41]. Kmmap [26] provides several improvements to reduce the variation in performance owing to the aggressive write-back policy of Linux. FastMap [27] addresses scalability issues by separating clean and dirty pages and using per-core data structures to avoid centralized contentions, with the help of a custom Linux kernel. Still, our evaluation shows that FastMap cannot saturate current high-performance NVMe SSD arrays. Aquila [25] offers a library OS solution that eliminates the need for kernel modifications, relying on hardware support for virtualization, which makes it not easy to deploy on cloud environments. Umap [28] provides an mmap-like interface to user-space page fault handlers based on userfaultfd [40] in Linux but is faster than mmap only with large page sizes. LightSwap [55] redesigns the swapping system to reduce context switching and page fault overheads, but it requires both kernel and program modifications. TriCache exposes a memory interface like these kernel-involved solutions but runs completely in the user space to achieve maximal out-of-core performance.

Block caches (or buffer managers) are critical components in data-intensive applications for supporting out-of-core processing [4, 10, 13, 18, 52, 54]. Some attempts try to improve the performance of block caches. SAFS [53], the storage backend for FlashGraph [54], adopts a lightweight cache design based on NUMA-aware message passing. Users need to program with its asynchronous I/O interface to exploit maximal I/O performance on SSD arrays. LeanStore [18] proposes to use pointer swizzling so that pages residing in memory can be directly referenced without page lookups. However, it requires pages to form a tree-like structure and thus is applicable to limited scenarios. TriCache shares similar goals but provides a memory interface that is user-transparent and more general.

Remote cache systems [12, 22] have been developed upon ideas of the disaggregated architecture [1, 9, 11, 16, 17, 23, 29, 32, 33, 34], which utilizes high bandwidth and low latency of modern networks. In this article, we focus on scaling-up through NVMe SSD arrays. Additionally, we intend to consider support for disaggregated architectures in our future work.

Non-Volatile Memory (NVM) enables larger capacity compared with DRAM, and research has been devoted to memory management instead of paging strategies to render memory access efficient on hybrid NVM and DRAM architectures [15, 30, 42, 56]. Nevertheless, block caches such as TriCache are still better suited for NVMe SSDs due to their higher latencies than NVM or DRAM.

Skip 6DISCUSSION Section

6 DISCUSSION

In this section, we present some possibilities where hardware and software could be further codesigned to improve the performance of TriCache.

First, the message passing technique used in Shared Cache decouples clients and servers and eliminates locks on the server side. However, it still requires synchronization between clients and servers due to the nature of memory operations. To solve this problem, processors could add hardware queues among physical cores and corresponding instructions to control them. Such modification would bring native support for message passing semantics on processors and boost not only TriCache but also systems with a similar design (e.g., MPI).

Second, Section 3 and Section 4 both mention that having stackful coroutines or fibers can further improve the performance of TriCache. However, to the best of our knowledge, there are currently no frameworks that can express and run parallel tasks with a hybrid thread-coroutine architecture in a general and transparent manner. TriCache expects such work to apply coroutine techniques to existing systems, thus completing the cache operations purely in the user space and achieving performance improvements.

Finally, SATC in TriCache does not need to be notified by its Shared Cache when a block is swapped out. In contrast, hardware TLB in processors, which also accelerates address translation as SATC, requires OS page cache to explicitly invalidate evicted pages through TLB shootdown, incurring considerable overhead [2, 3] due to inter-processor interrupts. A comparison of the mechanisms of SATC and TLB shows that SATC utilizes reference counting to prevent evicting blocks currently being used by clients, whereas the OS is not directly aware of how many TLB entries are still referring to the pages to be evicted. It is possible to extend the design of SATC to TLB. Processors could mark the reference counts for page table entries—for example, recording the number of TLB entries that currently hold a specific page table entry. The OS can then adapt its page swapping and evict policies to avoid evicting pages currently present in TLBs, thus mitigating the performance issue brought by TLB shootdown.

Skip 7CONCLUSION Section

7 CONCLUSION

In this article, we explored a new user-space approach to achieving efficient out-of-core processing with in-memory programs by providing a virtual memory interface on top of a block cache. We implemented TriCache based on a novel multi-level design and applied it to various in-memory or mmap-based programs without manual code modification. TriCache achieves out-of-core performance that is orders of magnitude higher than that of the Linux OS page cache, and is often comparable to or even faster than specialized out-of-core solutions.

The open source implementation of TriCache and instructions to reproduce the main experimental results can be accessed at https://github.com/thu-pacman/TriCache.

Footnotes

  1. 1 In case of cache misses, the victim blocks resident in memory need to be replaced with the requested blocks on storage; in case of cache hits, the reference counts need to be updated with locks/latches or atomic operations.

    Footnote
  2. 2 https://github.com/jshun/ligra [commit 7755d95].

    Footnote
  3. 3 https://github.com/flashxio/FlashX [commit 2a649ff].

    Footnote
  4. 4 https://github.com/facebook/rocksdb [tag v6.26.1].

    Footnote
  5. 5 https://github.com/apache/spark [tag v3.2.0].

    Footnote
  6. 6 https://github.com/thu-pacman/LiveGraph [commit eea5a40].

    Footnote

REFERENCES

  1. [1] Aguilera Marcos K., Amit Nadav, Calciu Irina, Deguillard Xavier, Gandhi Jayneel, Novakovic Stanko, Ramanathan Arun, Subrahmanyam Pratap, Suresh Lalith, Tati Kiran, et al. 2018. Remote regions: A simple abstraction for remote memory. In Proceedings of the 2018 USENIX Annual Technical Conference (USENIX’18). 775787.Google ScholarGoogle Scholar
  2. [2] Amit Nadav. 2017. Optimizing the TLB shootdown algorithm with page access tracking. In Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC’17). 2739. https://www.usenix.org/conference/atc17/technical-sessions/presentation/amit.Google ScholarGoogle Scholar
  3. [3] Amit Nadav, Tai Amy, and Wei Michael. 2020. Don’t shoot down TLB shootdowns! In Proceedings of the 15th European Conference on Computer Systems (EuroSys’20). ACM, New York, NY, 114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Bingmann Timo, Axtmann Michael, Jöbstl Emanuel, Lamm Sebastian, Nguyen Huyen Chau, Noe Alexander, Schlag Sebastian, Stumpp Matthias, Sturm Tobias, and Sanders Peter. 2016. Thrill: High-performance algorithmic distributed batch data processing with C++. arxiv:cs.DC/1608.05634 (2016).Google ScholarGoogle Scholar
  5. [5] Boldi Paolo, Rosa Marco, Santini Massimo, and Vigna Sebastiano. 2011. Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. In Proceedings of the 20th International Conference on World Wide Web (WWW’11). ACM, New York, NY, 587596. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Boldi P. and Vigna S.. 2004. The webgraph framework I: Compression techniques. In Proceedings of the 13th International Conference on World Wide Web (WWW’04). ACM, New York, NY, 595602. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Boncz Peter A., Kersten Martin L., and Manegold Stefan. 2008. Breaking the memory wall in MonetDB. Communications of the ACM 51, 12 (Dec. 2008), 7785. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Cao Zhichao, Dong Siying, Vemuri Sagar, and Du David H. C.. 2020. Characterizing, modeling, and benchmarking RocksDB key-value workloads at Facebook. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST’20). 209223. https://www.usenix.org/conference/fast20/presentation/cao-zhichao.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Costa Paolo, Ballani Hitesh, Razavi Kaveh, and Kash Ian. 2015. R2C2: A network stack for rack-scale computers. ACM SIGCOMM Computer Communication Review 45, 4 (2015), 551564.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Dong Siying, Callaghan Mark, Galanis Leonidas, Borthakur Dhruba, Savor Tony, and Strum Michael. 2017. Optimizing space amplification in RocksDB. In Proceedings of the 8th Biennial Conference on Innovative Data Systems Research (CIDR’17). 1–9.Google ScholarGoogle Scholar
  11. [11] Gao Peter X., Narayan Akshay, Karandikar Sagar, Carreira Joao, Han Sangjin, Agarwal Rachit, Ratnasamy Sylvia, and Shenker Scott. 2016. Network requirements for resource disaggregation. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 249264.Google ScholarGoogle Scholar
  12. [12] Gu Juncheng, Lee Youngmoon, Zhang Yiwen, Chowdhury Mosharaf, and Shin Kang G.. 2017. Efficient memory disaggregation with Infiniswap. In Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI’17). 649667.Google ScholarGoogle Scholar
  13. [13] Hellerstein Joseph M., Stonebraker Michael, and Hamilton James. 2007. Architecture of a Database System. Now Publishers Inc.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Huang Shengsheng, Huang Jie, Dai Jinquan, Xie Tao, and Huang Bo. 2010. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In Proceedings of the 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW’10). IEEE, Los Alamitos, CA, 4151.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Kimura Hideaki. 2015. FOEDUS: OLTP engine for a thousand cores and NVRAM. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 691706.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Klimovic Ana, Kozyrakis Christos, Thereska Eno, John Binu, and Kumar Sanjeev. 2016. Flash storage disaggregation. In Proceedings of the 11th European Conference on Computer Systems (EuroSys’16). ACM, New York, NY, 115. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Klimovic Ana, Litz Heiner, and Kozyrakis Christos. 2017. ReFlex: Remote flash \(\approx\) local flash. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’17). ACM, New York, NY, 345359. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Leis Viktor, Haubenschild Michael, Kemper Alfons, and Neumann Thomas. 2018. LeanStore: In-memory data management beyond main memory. In Proceedings of the 2018 IEEE 34th International Conference on Data Engineering (ICDE’18). 185196. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Lepers Baptiste, Balmau Oana, Gupta Karan, and Zwaenepoel Willy. 2019. KVell: The design and implementation of a fast persistent key-value store. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP’19). ACM, New York, NY, 447461. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Lepers Baptiste, Balmau Oana, Gupta Karan, and Zwaenepoel Willy. 2020. KVell+: Snapshot isolation without snapshots. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20). 425441. https://www.usenix.org/conference/osdi20/presentation/lepers.Google ScholarGoogle Scholar
  21. [21] Lersch Lucas, Schreter Ivan, Oukid Ismail, and Lehner Wolfgang. 2020. Enabling low tail latency on multicore key-value stores. Proceedings of the VLDB Endowment 13, 7 (March2020), 10911104. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Liang Shuang, Noronha Ranjit, and Panda Dhabaleswar K.. 2005. Swapping to remote memory over Infiniband: An approach using a high performance network block device. In Proceedings of the 2005 IEEE International Conference on Cluster Computing. IEEE, Los Alamitos, CA, 110.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Lim Kevin, Turner Yoshio, Santos Jose Renato, AuYoung Alvin, Chang Jichuan, Ranganathan Parthasarathy, and Wenisch Thomas F.. 2012. System-level implications of disaggregated memory. In Proceedings of the IEEE International Symposium on High-Performance Comp Architecture. IEEE, Los Alamitos, CA, 112.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Lin Zhiyuan, Kahng Minsuk, Sabrin Kaeser Md., Chau Duen Horng Polo, Lee Ho, and Kang U.. 2014. MMap: Fast billion-scale graph computation on a PC via memory mapping. In Proceedings of the 2014 IEEE International Conference on Big Data (Big Data’14). IEEE, Los Alamitos, CA, 159164.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Papagiannis Anastasios, Marazakis Manolis, and Bilas Angelos. 2021. Memory-mapped I/O on steroids. In Proceedings of the 16th European Conference on Computer Systems. ACM, New York, NY, 277293. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Papagiannis Anastasios, Saloustros Giorgos, González-Férez Pilar, and Bilas Angelos. 2018. An efficient memory-mapped key-value store for flash storage. In Proceedings of the ACM Symposium on Cloud Computing. 490502.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Papagiannis Anastasios, Xanthakis Giorgos, Saloustros Giorgos, Marazakis Manolis, and Bilas Angelos. 2020. Optimizing memory-mapped I/O for fast storage devices. In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC’20). 813827. https://www.usenix.org/conference/atc20/presentation/papagiannis.Google ScholarGoogle Scholar
  28. [28] Peng Ivy, McFadden Marty, Green Eric, Iwabuchi Keita, Wu Kai, Li Dong, Pearce Roger, and Gokhale Maya. 2019. UMap: Enabling application-driven optimizations for page management. In Proceedings of the 2019 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC’19). IEEE, Los Alamitos, CA, 7178.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Protic Jelica, Tomasevic Milo, and Milutinovic Veljko. 1996. Distributed shared memory: Concepts and systems. IEEE Parallel & Distributed Technology: Systems & Applications 4, 2 (1996), 6371.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Raybuck Amanda, Stamler Tim, Zhang Wei, Erez Mattan, and Peter Simon. 2021. HeMem: Scalable tiered memory management for big data applications and real NVM. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 392407.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Roghanchi Sepideh, Eriksson Jakob, and Basu Nilanjana. 2017. ffwd: Delegation is (much) faster than you think. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP’17). ACM, New York, NY, 342358. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Ruan Zhenyuan, Schwarzkopf Malte, Aguilera Marcos K., and Belay Adam. 2020. AIFM: High-performance, application-integrated far memory. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20). 315332.Google ScholarGoogle Scholar
  33. [33] Shan Yizhou, Huang Yutong, Chen Yilun, and Zhang Yiying. 2018. LegoOS: A disseminated, distributed OS for hardware resource disaggregation. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). 6987. https://www.usenix.org/conference/osdi18/presentation/shan.Google ScholarGoogle Scholar
  34. [34] Shan Yizhou, Tsai Shin-Yeh, and Zhang Yiying. 2017. Distributed shared persistent memory. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC’17). ACM, New York, NY, 323337. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Shun Julian and Blelloch Guy E.. 2013. Ligra: A lightweight graph processing framework for shared memory. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 135146.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Singler Johannes and Konsik Benjamin. 2008. The GNU libstdc++ parallel mode: Software engineering considerations. In Proceedings of the 1st International Workshop on Multicore Software Engineering (IWMSE’08). 15–22. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Song Nae Young, Son Yongseok, Han Hyuck, and Yeom Heon Young. 2016. Efficient memory-mapped I/O on fast storage device. ACM Transactions on Storage 12, 4 (2016), 127.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Committee Sort Benchmark. n.d. Sort Benchmark Home Page. Retrieved May 31, 2022 from http://sortbenchmark.org.Google ScholarGoogle Scholar
  39. [39] Tanenbaum Andrew S. and Bos Herbert. 2015. Modern Operating Systems. Pearson.Google ScholarGoogle Scholar
  40. [40] community The kernel development. Userfaultfd—The Linux Kernel Documentation. Retrieved May 31, 2022 from https://www.kernel.org/doc/html/latest/admin-guide/mm/userfaultfd.html.Google ScholarGoogle Scholar
  41. [41] Essen Brian Van, Hsieh Henry, Ames Sasha, Pearce Roger, and Gokhale Maya. 2015. DI-MMAP–A scalable memory-map runtime for out-of-core data-intensive applications. Cluster Computing 18, 1 (2015), 1528.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Renen Alexander van, Leis Viktor, Kemper Alfons, Neumann Thomas, Hashida Takushi, Oe Kazuichi, Doi Yoshiyasu, Harada Lilian, and Sato Mitsuru. 2018. Managing non-volatile memory in database systems. In Proceedings of the 2018 International Conference on Management of Data. 15411555.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Contributors Wikipedia. 2022. Memory-mapped file. Wikipedia. Retrieved May 31, 2022 from http://en.wikipedia.org/w/index.php?title=Memory-mapped%20file&oldid=1089594834.Google ScholarGoogle Scholar
  44. [44] Contributors Wikipedia. 2022. Memory paging. Wikipedia. Retrieved May 31, 2022 from http://en.wikipedia.org/w/index.php?title=Memory%20paging&oldid=1068326108.Google ScholarGoogle Scholar
  45. [45] Contributors Wikipedia. 2022. NVM Express. Wikipedia. Retrieved May 31, 2022 from http://en.wikipedia.org/w/index.php?title=NVM%20Express&oldid=1090339430.Google ScholarGoogle Scholar
  46. [46] Contributors Wikipedia. 2022. Page cache. Wikipedia. Retrieved May 31, 2022 from http://en.wikipedia.org/w/index.php?title=Page%20cache&oldid=1068818367.Google ScholarGoogle Scholar
  47. [47] Contributors Wikipedia. 2022. PCI Express. Wikipedia. Retrieved May 31, 2022 from https://en.wikipedia.org/w/index.php?title=PCI_Express&oldid=1090153203.Google ScholarGoogle Scholar
  48. [48] Contributors Wikipedia. 2022. U.2. Wikipedia. Retrieved May 31, 2022 from http://en.wikipedia.org/w/index.php?title=U.2&oldid=1066844795.Google ScholarGoogle Scholar
  49. [49] Yadgar Gala, Factor Michael, and Schuster Assaf. 2007. Karma: Know-it-all replacement for a multilevel cache. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07). https://www.usenix.org/conference/fast-07/karma-know-it-all-replacement-multilevel-cache.Google ScholarGoogle Scholar
  50. [50] Yang Ziye, Harris James R., Walker Benjamin, Verkamp Daniel, Liu Changpeng, Chang Cunyin, Cao Gang, Stern Jonathan, Verma Vishal, and Paul Luse E.. 2017. SPDK: A development kit to build high performance storage applications. In Proceedings of the 2017 IEEE International Conference on Cloud Computing Technology and Science (CloudCom’17). 154–161. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Yaniv Idan and Tsafrir Dan. 2016. Hash, Don’t cache (the page table). ACM SIGMETRICS Performance Evaluation Review 44, 1 (June2016), 337350. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Zaharia Matei, Xin Reynold S., Wendell Patrick, Das Tathagata, Armbrust Michael, Dave Ankur, Meng Xiangrui, et al. 2016. Apache Spark: A unified engine for big data processing. Communications of the ACM 59, 11 (Oct. 2016), 5665. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Zheng Da, Burns Randal, and Szalay Alexander S.. 2013. Toward millions of file system IOPS on low-cost, commodity hardware. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis (SC’13). ACM, New York, NY, Article 69, 12 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Zheng Da, Mhembere Disa, Burns Randal, Vogelstein Joshua, Priebe Carey E., and Szalay Alexander S.. 2015. FlashGraph: Processing billion-node graphs on an array of commodity SSDs. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). 4558. https://www.usenix.org/conference/fast15/technical-sessions/presentation/zheng.Google ScholarGoogle Scholar
  55. [55] Zhong Kan, Cui Wenlin, Lu Youyou, Liu Quanzhang, Yan Xiaodan, Yuan Qizhao, Luo Siwei, and Huang Keji. 2021. Revisiting swapping in user-space with lightweight threading. arxiv:cs.OS/2107.13848 (2021).Google ScholarGoogle Scholar
  56. [56] Zhou Xinjing, Arulraj Joy, Pavlo Andrew, and Cohen David. 2021. Spitfire: A three-tier buffer manager for volatile and non-volatile memory. In Proceedings of the 2021 International Conference on Management of Data. 21952207.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. [57] Zhu Xiaowei, Feng Guanyu, Serafini Marco, Ma Xiaosong, Yu Jiping, Xie Lei, Aboulnaga Ashraf, and Chen Wenguang. 2020. LiveGraph: A transactional graph storage system with purely sequential adjacency list scans. Proceedings of the VLDB Endowment 13, 7 (March2020), 10201034. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. TriCache: A User-Transparent Block Cache Enabling High-Performance Out-of-Core Processing with In-Memory Programs

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM Transactions on Storage
    ACM Transactions on Storage  Volume 19, Issue 2
    May 2023
    269 pages
    ISSN:1553-3077
    EISSN:1553-3093
    DOI:10.1145/3585541
    Issue’s Table of Contents

    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 22 March 2023
    • Online AM: 13 February 2023
    • Accepted: 23 January 2023
    • Received: 27 December 2022
    Published in tos Volume 19, Issue 2

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
  • Article Metrics

    • Downloads (Last 12 months)533
    • Downloads (Last 6 weeks)152

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!