Abstract
Out-of-core systems rely on high-performance cache sub-systems to reduce the number of I/O operations. Although the page cache in modern operating systems enables transparent access to memory and storage devices, it suffers from efficiency and scalability issues on cache misses, forcing out-of-core systems to design and implement their own cache components, which is a non-trivial task.
This study proposes TriCache, a cache mechanism that enables in-memory programs to efficiently process out-of-core datasets without requiring any code rewrite. It provides a virtual memory interface on top of the conventional block interface to simultaneously achieve user transparency and sufficient out-of-core performance. A multi-level block cache design is proposed to address the challenge of per-access address translations required by a memory interface. It can exploit spatial and temporal localities in memory or storage accesses to render storage-to-memory address translation and page-level concurrency control adequately efficient for the virtual memory interface.
Our evaluation shows that in-memory systems operating on top of TriCache can outperform Linux OS page cache by more than one order of magnitude, and can deliver performance comparable to or even better than that of corresponding counterparts designed specifically for out-of-core scenarios.
1 INTRODUCTION
NVMe [45] Solid-State Drives (SSDs) have drawn a wide range of interest because of their high I/O performance. The U.2 interface [48] and PCIe 4.0 standard [47] have also increased the storage density of NVMe SSDs in recent years. For instance, a dual-socket commodity server can mount an array of more than 16 NVMe SSDs to provide tens of terabytes of storage capacity, tens of millions of random IOPS, and dozens of gigabytes per second of bandwidth while being 20 to 40 times cheaper than Dynamic Random Access Memory (DRAM).
Although NVMe SSD arrays can improve the aggregated performance and capacity of the system, they still suffer from block-wise I/O accesses and have latencies at least 100\(\times\) longer than those of DRAM. To efficiently process datasets that are significantly larger than available memory, out-of-core systems rely on cache sub-systems to maintain frequently operated data in memory. I/O operations can be merged or skipped on cache hits, bridging the performance gap between DRAM and SSDs.
Page cache [46] is a cache sub-system in modern Operating Systems (OS) that manages data on the granularity of pages (typically 4KB) across DRAM and SSDs. It enables in-memory applications to support out-of-core processing on SSDs without requiring any rewrite through swapping [44] or memory mapping [43] based on virtual memory.
However, current implementations of page cache encounter issues related to scalability and performance on cache misses owing to global locking on internal data structures [27]. Recent literature [25, 32, 55] indicates that the heavy I/O stack, page faults, and context switching overheads also limit kernel swapping and I/O performance on fast storage devices such as NVMe SSD arrays.
Therefore, data-intensive applications such as databases and data processing systems [4, 10, 13, 18, 19, 20, 52, 54] usually design and implement their own user-space block caches (also known as buffer managers) that manage data by blocks (typically of a fixed size that is a multiple of the physical sector size). In contrast to OS page cache, block cache reduces context switching overhead by running mainly in the user space, and supports customization in terms of tuning block sizes and replacement policies to further improve performance.
Nevertheless, designing and implementing block caches and upper-level components imposes expensive development costs. Existing block caches in the user space usually ask users to explicitly acquire/release blocks [10, 18] or manipulate data through an asynchronous interface [54]. Developers often have to redesign and re-implement the entire system according to the API requirements of the block cache, which is non-trivial. To fill the gap between out-of-core performance and development costs, we investigate a new general cache mechanism that can transparently extend in-memory systems for efficient out-of-core processing on NVMe SSDs without requiring any manual modification.
Efficient kernel-bypass I/O stacks, such as SPDK [50], can achieve good out-of-core performance in the user space by avoiding expensive kernel I/O operations to take advantage of the high IOPS from the NVMe SSD array. It inspires us to explore a user-space solution that can eliminate the overhead due to page faults and context switching. A solution in the user space is cross platform, easy to deploy and customize, and avoids introducing potential security vulnerabilities caused by kernel modifications.
To ensure transparency for the user, a virtual memory interface is expected to fill the semantic gap between fine-grained memory accesses in existing in-memory programs and block-wise I/O operations on physical block devices as in the case of the virtual memory provided by OS. A virtual memory interface makes it possible for in-memory software to run on NVMe SSDs without requiring any modification if we can automatically redirect memory accesses to the block cache.
Several challenges need to be addressed to implement a user-space block cache with a virtual memory interface. First, the cache system requires good scalability to achieve high out-of-core performance so that it can fit in the hundreds of CPU cores and the tens of millions of SSD IOPS in use today. Second, it requires an efficient address translation mechanism that looks up in-memory addresses for cached blocks, to provide a virtual memory interface with fine-grained accesses. Such fine-grained accesses and per-access address translations pose a much more significant challenge than block lookups in current user-space block caches. Third, it requires a scheme to redirect the memory accesses of existing in-memory systems to the cache system without manual modification.
To address the preceding challenges, we propose the following contributions:
We build a scalable block cache based on a concurrency mechanism named Hybrid Lock-free Delegation that combines message passing based delegation with lock-free hash tables. It can utilize the NVMe SSD array with only a few server threads.
We design a two-level Software Address Translation Cache (SATC) to support lightning-fast address translation in the user space, replacing human effort for writing block-aware code with an automatic mechanism by exploiting locality at runtime. SATC can accelerate software address translation by some orders of magnitude.
We propose a pure software-based scheme to supervise memory accesses based on compile-time instrumentation and library hooking techniques. Existing in-memory applications can efficiently run on NVMe SSDs through the block cache without requiring code modification.
Based on these techniques, we design and implement a user-transparent block cache providing a virtual memory interface, named TriCache. Our results show that TriCache enables in-memory programs to efficiently process out-of-core datasets without requiring manual code rewrite, by using various domains of application. TriCache can outperform OS page cache by some orders of magnitude, and can often reach or even exceed the performance of specialized out-of-core systems.
2 BACKGROUND AND MOTIVATION
In this section, we briefly introduce the two types of general caches that can be used for out-of-core processing, OS page cache and user-space block cache, and use a motivating example to show the benefits as well as the challenges of a new approach that combines the advantages of both.
Page cache is a transparent cache for pages originating from storage devices [46]. Modern OS keep the page cache in unused portions of the main memory. Some accesses to storage devices can be handled by the page cache to improve performance. The page cache is implemented in kernels through virtual memory management and is mostly transparent to applications. Users can use a memory-mapping system call [43] to map a file to a segment in virtual memory, or rely on swapping [44] to swap out/in pages to/from disks on-demand, thus accessing storage just like memory.
The memory interface of the page cache provides maximal user transparency for developing out-of-core applications [7, 24]; however, its use can lead to severe performance bottlenecks, especially on cache misses when the backed storage is an array of high-performance NVMe SSDs. It results from various factors, including but not limited to its global locking in the kernel, the heavy I/O stack, page faults, and context switching overheads [25, 27, 32, 55]. Although some studies have attempted to modify the kernel to improve the performance of the page cache [25, 26, 27, 37, 41], it is challenging to apply the relevant modifications in the kernel space, which may introduce potential portability and security issues.
To this end, most out-of-core systems design and implement their own block caching components in the user space to mitigate and even eliminate the preceding issues. Like the OS page cache, a block cache manages a pool of pages in memory, and loads/evicts pages from/to disks upon user requests. The major difference is that the block cache runs mostly in user space and provides a block interface. Users first
Although an efficient and scalable block cache can make full use of storage devices in terms of performance, its block interface requires a considerable amount of work to be put into use. Figure 1 illustrates this with a concrete example: calculating the length of a string. The left part presents the implementation by using a memory interface, and the right part shows an alternative version with a block interface. It is evident that the block version is far more complex than the memory version because system developers have to take care of more details, such as checking the block boundaries and making
Fig. 1. Out-of-core implementations of strlen with memory (left) and block (right) interfaces.
It thus motivates us to explore a user-space block cache providing a virtual memory interface, which can combine the advantages of high out-of-core performance and high user transparency from both types of caches. The user-space approach drops some functional capabilities of the OS page cache, such as sharing memory across processes with consistency guarantees. However, it allows us to redesign the cache sub-system toward new high-performance storage. Although the virtual memory interface forces applications to manipulate the cache synchronously and manage data in fixed-size blocks (rather than objects or rows), such an interface enables user transparency and saves developers considerable effort.
However, a user-space block cache with a virtual memory interface is not as easy as it might appear. Since every memory access now needs to involve a pair of
3 DESIGN AND IMPLEMENTATION OF TRICACHE
In this section, we first present an overview of the system design of TriCache, then describe its efficient multi-level block cache runtime in a bottom-up manner, including how to build a scalable block cache and reduce the cost of cache accesses in the user space to support transparent usage. Finally, we introduce how to automatically apply TriCache to in-memory applications via compiler techniques.
3.1 Overview of TriCache
Figure 2 shows the high-level architecture of TriCache. It consists of an LLVM compiler plugin and a runtime module.
Fig. 2. High-level architecture of TriCache.
TriCache LLVM Compiler Plugin first instruments each memory instruction, such as load and store, in the user application code, inserting a software address translation call (named
TriCache Runtime is the core of TriCache (the dashed box in Figure 2). It is a multi-level block cache that supports fast address translation and provides a virtual memory interface. It implements
In the implementation of
Below SATC, TriCache Runtime manages data with Shared Cache (in the middle of Figure 2, gray background). Shared Cache is a full-featured block cache shared by multiple threads that maintains an in-memory cache pool for reading and writing the underlying storage. It manages a block table for all in-memory blocks and serves address translations when SATC misses. The block table exposes a
For Shared Cache, we propose a Hybrid Lock-free Delegation based concurrency control scheme. First, we distinguish between address translations and data accesses. Only address translations call
In Figure 3, we present an example of a user program, a follow-up of Figure 1, instrumented by and then running with TriCache. The C program is first compiled to LLVM IR (Intermediate Representation) with Clang, with the memory read compiled to a load instruction. TriCache LLVM Compiler Plugin instruments the load instruction into two operations: one calls
Fig. 3. An example of a user program running on TriCache.
3.2 Shared Cache
As the core module of TriCache, Shared Cache determines TriCache’s throughput, especially its I/O performance. Therefore, good scalability is the primary design goal of Shared Cache for the effective use of hundreds of CPU cores, tens of NVMe SSDs, and millions of IOPS.
Design Decisions. Figure 4(a) shows a straightforward design used by the current Linux kernel. It uses a global lock to protect the block table (or page table) and the cache. However, the single lock leads to heavy lock contention and is difficult to scale for high-performance storage devices [27]. The sharding technique can help mitigate the scalability issue, as shown in Figure 4(b). The block cache [10, 27] can use a predefined function (usually hashing) to partition the blocks into several shards and then assign a lock to each shard. In addition, recent work proposes that well-designed delegation based on message passing can provide better scalability and hotspot tolerance than locks [21, 31, 53] on NUMA (Non-Uniform Memory Access) architectures.
Fig. 4. Different designs and concurrent mechanisms for the shared block cache.
Therefore, we propose a Hybrid Lock-free Delegation for Shared Cache of TriCache, as shown in Figure 4(c). The Shared Cache adopts a client-server model based on message passing (solid lines in Figure 4(c)). Each client-server pair shares a lightweight message queue with a size of two cache lines, similar to ffwd [31]. Each user thread corresponds to a client, and several dedicated servers handle requests from clients. Each server is single-threaded, lock-free, and only responsible for managing a part of the blocks (e.g., partitioned by hashing block IDs). Multiple partitions and servers can achieve concurrency and scalability, and more servers can be added when a higher throughput is desired. In addition to message passing based delegation, clients can directly access per-partition block tables on cache hits to reduce server-side CPU consumptions (dashed lines in Figure 4(c); more details are provided in the following Client-side Fast Paths on Cache Hits discussion).
Metadata-only Delegation. When a user thread accesses block data, TriCache divides the block access into a metadata operation and a data operation. Metadata operations include address translations, reference count management, and evict policy enforcement. Data operations are memory accesses, such as load and store. In TriCache, only metadata operations are processed by servers through delegation, whereas clients issue data operations by themselves.
A block is accessed in three stages. In the first stage, the client asks the server to cache the block in memory (
This design eliminates redundant memory copies between servers and clients. Servers focus on metadata operations so that a few server threads can achieve good performance. Meanwhile, it helps TriCache provide the same consistency and atomicity guarantees as memory, which is necessary for user transparency and compatible with in-memory applications. CPU directly executes data operations on the client side via memory instructions, and cache coherence is ensured by hardware. TriCache only needs a memory fence to ensure that the updates are visible before evicting modified blocks.
Client-side Fast Paths on Cache Hits. We propose using concurrent block tables to avoid server-side synchronizations on cache hits. A client first tries to directly find a block in the block table and update the block reference counts (number of clients in use) by using atomic operations. If it succeeds, the client can translate the address from the concurrent block table by itself, thus skipping synchronous message passing. The client then sends an asynchronous message if this direct operation changes the reference count from 0 to 1 or conversely, to notify the server to update the evict policy for the block. Multiple asynchronous messages can be batched and processed together to amortize message passing overheads.
Workflow and Implementation. Figure 5 shows the workflow and implementation of Shared Cache. Each user thread corresponds to a Shared Cache Client and gets its unique Client ID. Each partition has a polling-based message passing server that is used to process requests sent from clients and return results to them. In addition, each partition maintains a concurrent block table for blocks cached in the memory, and each entry in the block table stores the Block ID (BID) of the block, Memory ID (MID) of its in-memory cache, and its metadata (Meta). The metadata include information on whether the block is available and whether it has been modified, and the reference count pertaining to the clients in use. We use the compact hashed block table similar to Yaniv and Tsafrir [51], in which each entry occupies only an average of 8 bytes, and uncached blocks do not occupy memory.
Fig. 5. Shared cache of TriCache.
The cached blocks are indexed by their Memory IDs and stored in a Memory Pool. Meanwhile, an Evict Policy tracks all cached blocks with a zero reference count as they can be safely evicted. Every time the cache is full, the Evict Policy chooses and evicts one or more blocks based on its statistics and strategies. TriCache uses the CLOCK algorithm [39] by default, and it is replaceable, allowing users to customize the policy based on their application characteristics. In addition, policy implementations in TriCache are completely single-threaded, so users do not need to consider any concurrency issues. At the bottom, an asynchronous I/O backend (Storage Backend) manages pending IO requests in an I/O Queue to continuously poll and process I/O operations. The I/O backend is also customizable and defaults to SPDK that is backed by user-space NVMe drivers. A kernel-space alternative based on Linux AIO is also supported as another candidate.
When a user thread operates on a block, its client (N in the figure) first computes the Partition ID (M in the figure) by a predefined partition function. The client then searches for a valid block entry from the block table of Partition M, and if such an entry exists, the client tries increasing the reference count by using atomic operations. If the atomic operations succeed on cache hits (Cache-hit Fast Path in Figure 5), the client pins the block in memory and can directly query the memory address of the cached block. The client may further send an asynchronous request to the server if it is updating the reference count from 0 to 1, or the converse. The server then performs the corresponding actions according to the Evict Policy, such as enabling or disabling evictions of the block.
If the atomic operations fail on cache misses or when the block is being swapped in/out, the client requests a remote operation via synchronous message passing, immediately releases CPU resources, and waits for responses from the server (Remote Operation in Figure 5). After receiving the request, the server creates a block table entry and sets its valid bit to
We use a micro-benchmark on a 128-core machine to test the effectiveness of TriCache Shared Cache. TriCache Shared Cache can scale linearly to 256 threads (1/8 of the threads are servers), reaching 96.8M ops/s, and the hybrid mechanism provides an improvement of \(52\%\) compared with the delegation-only approach.
The preceding discussion focuses on architectures that use only threads for parallel processing. The message passing design can perform better in parallel architectures with stackful coroutines (also known as fibers). With several coroutines multiplexed in a common thread, the message passing module of TriCache can send or receive multiple operations together to amortize overheads of inter-thread synchronization. Further, in case of a cache miss, the client must wait for the server to return the requested address before resuming execution. Under the pure threading architecture, TriCache has to release the CPU to the OS and switch between user threads to utilize these waiting CPU resources. When the thread manages multiple coroutines, TriCache can keep in the user space and perform lighter coroutine switching instead of thread switching through the OS. It further saves CPU consumption for the user’s application.
3.3 Software Address Translation Cache
The Shared Cache of TriCache provides scalable I/O performance and an efficient set-associative cache. However, block table lookups and atomic operations are required for each access on cache hits, still limiting the performance of TriCache.
Guiding Ideas. Considering the manual use of the block cache (e.g., Figure 1), users call the
In contrast, we design TriCache to automatically exploit locality to simulate manual coding without requiring human effort. We propose to build a two-level SATC on top of Shared Cache. The higher-level cache stores hotter data and provides faster access and smaller capacities than the lower-level cache, similar to the multi-level cache of the CPU and hierarchical storage [30, 49, 56]. Based on this idea, we now show how to implement the multi-level cache in software and where to divide the levels.
SATC Design. In our design, only the last-level cache manages data, and higher-level translation caches manage only metadata, such as modifying the reference counts of the blocks and translating block IDs to memory addresses. Managing metadata instead of data can help avoid redundant memory consumption, additional memory copies, and memory consistency issues caused by the multi-level design. The multi-level cache of TriCache is designed to be an inclusive cache, which means that all blocks in the higher-level cache are also present in the lower-level cache. With this inclusive policy, higher-level caches need to only interact with their next level. Moreover, TriCache guarantees that the capacity of higher-level caches is no greater than the lower-level cache, thus eliminating out-of-space errors from the lower-level cache when the higher-level cache requests to swap in blocks.
On top of the Shared Cache, we build a thread-local set-associative cache called Private SATC. When the Private SATC hits, the user thread uses its thread-local block table and evict policy, and only when the Private SATC swaps in/out blocks does the user thread need to operate on the Shared Cache. Private SATC is purposed to reduce Shared Cache operations for the hot data of its thread. Examples include thread-local hot data when each thread computes a segment of data independently, and hot elements shared by all threads when processing skewed data. Private SATC also helps reduce concurrent block table operations with cross-NUMA memory accesses and false sharing, which could take about three to eight times higher latency than local memory accesses in our evaluation.
We further build a direct mapping cache called Direct SATC on top of the Private SATC to alleviate overheads due to hash table lookups and evict policy maintenance. Direct SATC maintains a few recently accessed pages in a fixed-size array to speed up address translation to a few bitwise operations and avoid having to update the evict policy for each access. The goal of Direct SATC is to cover multiple consecutive operations on the hottest blocks, such as sequential reads and writes, and displace manually written
Meanwhile, the addition of Direct SATC can also fit parallel architectures with stackful coroutines (also called fibers). With TriCache, each coroutine has a fiber-local Direct SATC, and coroutines scheduled within a thread share a thread-local Private SATC, and all threads share the Shared Cache. When scheduling multiple coroutines mapping to the same thread, Direct SATC can automatically ensure that the hottest blocks for each coroutine are not swapped out by Private SATC, which can reduce re-access blocks from the Shared Cache when resuming the coroutine execution.
Implementation. As shown in Figure 6, we implement an inclusive multi-level cache with
Fig. 6. Software Address Translation Cache.
The right-hand side of Figure 6 presents the state machine maintained in TriCache. Starting from Disk Only state, the user thread loads blocks into Shared Cache, Private SATC, and Direct SATC by calling
In the default configuration, the aggregated capacity of Private SATC entries is equal to that of Shared Cache entries, which is the largest possible Private SATC with inclusive policy. The maximum size of Private SATC can hold as many as possible entries from Shared Cache, which can help TriCache reduce the number of inter-thread synchronizations. Additionally, the Direct SATC has a size 1/4 of that of the Private SATC on the same thread. These sizes of SATC are configurable to provide better performance under different workloads.
Our evaluation shows that SATC can improve performance tens of times over Shared Cache on real-world workloads. When SATC can absorb all accesses, TriCache can reach \(57\%\) and \(91\%\) of the in-memory performance for purely random and nearly sequential access patterns, respectively, making it practical to operate the block cache at per-access granularity.
3.4 Compile-time Instrumentation
With the help of SATC, TriCache opens up opportunities to provide a virtual memory interface and make TriCache fully transparent to users. To this end, we propose a purely user-space scheme based on compile-time instrumentation and library hooking techniques.
Memory Layout. We first modify the memory layout as shown in the upper part of Figure 7. In Linux, the current x86_64 virtual memory layout (with four-level page tables) consists of three main parts. User space takes 47 bits at the beginning, kernel space occupies 47 bits at the end, and most of the space in the middle is a hole of non-canonical virtual memory. We map the TriCache-managed disks into unused holes by block size, starting from 0x800...000 as TriCache-space virtual memory. Memory addresses in TriCache-space can be translated into an actual user-space memory address by calling the
Fig. 7. Memory layout and LLVM instrumentation.
Instrumentation. To enable transparent read and write operations on top of the TriCache block cache, we perform a translation before each memory operation (e.g., load, store, and atomic operations) so that TriCache-space addresses can be used just like user-space memory. TriCache does not require any manual code modifications with the help of compile-time instrumentation. The lower part of Figure 7 shows pseudo-codes for instrumenting load and store instructions in LLVM IR. TriCache instrumentation takes the highest bit of addresses to determine whether a memory address refers to user-space memory or TriCache-space virtual memory.
Although instrumentation provides the virtual memory interface for TriCache-space memory, we still need to determine what data should be placed in TriCache-space memory. First, data on the stack is not necessary to enter the block cache. We perform a dataflow analysis from LLVM
Implementation. Figure 8 illustrates the compiling workflow of TriCache. The user application code is compiled by a compiler based on LLVM. The TriCache LLVM Plugin instruments the code and generates instrumented LLVM IR bitcode. The plugin performs instrumentation after all the optimization passes, so it does not affect the compiler optimizations on applications, such as automatic vectorization. Additionally, TriCache supports vector instructions since TriCache leaves the CPU to perform memory operations.
Fig. 8. Compiling workflow of TriCache.
Then, the bitcode links with the precompiled TriCache runtime (including
The TriCache runtime also contains APIs on top of the virtual memory interface for manual optimizations, including
4 EVALUATION
We set up our experiments on a dual-socket server equipped with two AMD EPYC 7742 CPUs (64 physical cores and 128 hyper-threads per CPU) and 512GB DDR4-3200 main memory. The storage devices are eight PCIe-attached Intel P4618 DC SSDs that provide 51.2TB capacity, 9.6M 4KB read IOPS, and 3.9M 4KB write IOPS in total. The server runs Debian 11.1 with Linux kernel 5.10 and uses Clang 13.0.1 to compile TriCache and other systems.
In our evaluation, we limit the total available capacity of DRAM by
We first evaluate TriCache on four representative domains in terms of end-to-end performance: graph processing (Section 4.1), key-value store (Section 4.2), big-data analytics (Section 4.3), and transactional graph database (Section 4.4). In these experiments, we focus on the end-to-end performance of TriCache and examine different workloads running in memory as well as out of core by limiting the size of the available memory for TriCache.
We then conduct a micro-benchmark, by using a configurable number of threads that issue load/store instructions. We adjust the hit rates and access patterns to explore circumstances in which TriCache outperforms OS page cache and to assess whether the design of TriCache provides a reasonable tradeoff between in-memory (i.e., cache hit) and out-of-core (i.e., cache miss) performance (Section 4.5).
Finally, we use a series of breakdown experiments to evaluate the performance-related impact on TriCache, including hit rates and hit latency of SATC, number of threads, I/O backends, evict policies, page sizes, SATC sizes, and threads and fibers (Section 4.6).
4.1 Performance on Graph Processing
Experimental Setup. Graph processing is a demanding workload for cache systems due to many small and random accesses on large datasets. We transparently apply TriCache to an in-memory graph processing framework Ligra2 [35] and extend it to out-of-core by hooking
Both Ligra and FlashGraph use 32-bit vertex IDs, and we force Ligra to use push mode to align with FlashGraph. For FlashGraph, we follow its recommended configuration of creating an XFS filesystem for each SSD block device and binding the device to the corresponding NUMA nodes. Meanwhile, FlashGraph is a semi-external memory graph engine that always stores vertex states in memory and edge lists on SSDs. We thus make TriCache to manage at least edge lists in the cache for a fair comparison.
We evaluate FlashGraph, Ligra on swapping, and Ligra on TriCache by three common graph algorithms: PageRank (PR), Weakly Connected Components (WCC), and Breadth-First Search (BFS). The dataset is a real-world graph dataset, UK-2014 [5, 6], with 788M vertices and 47.6B edges. It requires more than 400GB for Ligra in-memory execution.
In-memory Performance. Figure 9 shows the computation time of FlashGraph, Ligra on swapping, and Ligra on TriCache under different memory quotas. With 512GB of memory, Ligra can process all three algorithms in memory, and TriCache and FlashGraph can buffer all data in their cache. Under this setting, TriCache incurs overheads of only \(34.4\%\) for PR, \(64.0\%\) for WCC, and \(23.5\%\) for BFS. The in-memory performance shows that TriCache can provide efficient address translations and cache hits with its virtual memory interface, owing to the two-level SATC. Meanwhile, TriCache outperforms FlashGraph by \(6.08\times\), \(3.85\times\), and \(2.06\times\), respectively, when the working set can be cached in memory. It illustrates that FlashGraph yields much higher in-memory overheads than TriCache because the block cache of FlashGraph involves redundant memory copies on cache operations with its read/write interfaces.
Fig. 9. Computation time of FlashGraph, Ligra on swapping, and Ligra on TriCache (lower is better).
Out-of-core Performance. Under 256GB memory limitation, the caches start swapping in/out blocks/pages. Compared to the in-memory performance, Ligra on TriCache saves about half of memory and yields \(47.7\%\) performance on PR, \(12.5\%\) performance on WCC, and \(78.4\%\) performance on BFS. Additionally, TriCache’s speedups over OS swapping and FlashGraph are \(6.30\times\) and \(5.31\times\) on PR, \(26.1\times\) and \(1.46\times\) on WCC, and \(0.85\times\) and \(2.05\times\) on BFS, respectively.
As the usable memory further decreases, I/O efficiency becomes the main factor affecting performance. For example, in the case of 64GB of memory, the performance of TriCache is \(19.3\times\), \(38.3\times\), and \(26.8\times\) better than that of swapping. In this case, the average I/O bandwidth of TriCache is more than 12GB/s, and the peak I/O performance can reach 4.8M IOPS and 18GB/s bandwidth. The difference between the peak and average I/O performance shows that the Bulk Synchronous Parallel model used by Ligra limits the I/O performance, and a better performance can be potentially achieved with asynchronous execution models. Compared with FlashGraph, TriCache can still provide improvements of \(54.8\%\) and \(58.3\%\) on PR and WCC, respectively, whereas the performance of Ligra with TriCache is \(34.3\%\) lower than FlashGraph on the BFS algorithm. This is because FlashGraph adopts a two-dimension partition for out-of-core graph processing, resulting in a \(50.1\%\) cache hit rate that saves \(2.68\times\) of I/O volume compared to TriCache. Still, TriCache provides an average I/O bandwidth \(1.78\times\) better than FlashGraph and thus reduces the performance gap.
It is noteworthy that the semi-external memory FlashGraph cannot fit vertex states of PR and WCC with 16GB of memory. It leads to out-of-memory errors, whereas TriCache can operate the same dataset fully out-of-core.
The preceding results indicate that TriCache can extend an in-memory graph framework to support out-of-core processing without manual modification and can deliver performance comparable to a well-designed external memory framework. Meanwhile, TriCache outperforms OS swapping by up to \(38.3\times\) while providing the same user transparency.
4.2 Performance on Key-Value Stores
Experimental Setup. Key-value stores manage large amounts of data, requiring cache systems to buffer hot data in memory. We use RocksDB4 [10], a persistent key-value store widely used in production systems, for evaluation in this part. RocksDB organizes on-disk data in immutable sorted sequence tables. It provides a block-based table format on top of its user-space block cache, and a plain table format optimized for in-memory performance via
We use the mixgraph [8] (prefix-dist) workload proposed by Facebook, which models production use cases at Facebook and emulates real-world workloads of key-value stores with hotness distribution and temporal patterns. The keys and values are 48 and 43 bytes on average, respectively, and there are \(83\%\) reads, \(14\%\) writes, and \(3\%\) scans. We generate 2B key-value pairs (consuming 180GB of space) and execute 100M operations. Both plain and block-based tables use the hash index with a 4-byte prefix. We increase the sharding number of the RocksDB block cache to 1024 to avoid lock contentions on our 256-thread server and use the direct I/O mode for the RocksDB block cache. We also disable WAL to prevent log flushing from becoming a performance bottleneck.
In-memory Performance. Figure 10 illustrates the throughput of Plain Tables on TriCache, Plain Tables on
Fig. 10. RocksDB throughput with varying memory quotas.
Out-of-core Performance. When RocksDB runs out-of-core, TriCache brings performance improvements of two to three orders of magnitude compared with
The in-memory performance indicates that user-transparent TriCache can provide similar performance as manually managed block cache in RocksDB. Meanwhile, TriCache has the potential to help existing systems with in-memory backends, such as RocksDB with Plain Tables, to achieve better out-of-core performance without any manual modifications.
4.3 Performance on Big-Data Analytics
Experimental Setup. TeraSort [38] is a representative application and an important performance indicator in the domain of big-data analytics [14]. Its typical distributed or out-of-core implementation consists of a shuffle phase followed by a sort phase. The shuffle phase produces parallel sequential reads and writes, which is I/O bound [14] and can stress sequential I/O throughput on cache systems. The sort phase requires the cache to buffer the working partition in memory and issues a vast number of string comparisons and copies that can examine the runtime overhead of cache systems.
We generate two TeraSort workloads, 1.5B records (about 150GB) and 4B records (about 400GB). For TriCache, we first use the parallel sort based on multi-way merge sort in GNU
In-memory Performance. Figure 11 shows the computation time of TeraSort. On the 150GB dataset, both GNU Sort and Shuffle Sort occupy about 300GB of memory and fit in 512GB of memory. In this case, Shuffle Sort is \(2.01\times\) faster than GNU Sort. Meanwhile, the overheads of TriCache amount to only \(14\%\) for GNU Sort and nearly zero (less than \(1\%\)) for Shuffle Sort. The reason is that the Shuffle Sort algorithm mainly generates sequential reads and writes for each thread, which can be well handled by thread-local Direct SATC and Private SATC. Compared with Spark, GNU Sort and Shuffle Sort on TriCache is faster by \(1.55\times\) and \(3.62\times\), respectively.
Fig. 11. Computation time for TeraSort workloads with different memory quotas (lower is better).
Out-of-core Performance. When the memory quota is less than 256GB for the 150GB workload, TriCache can provide tens of times speedups over swapping, up to \(39.3\times\) for GNU Sort at 128GB of memory and \(57.8\times\) for Shuffle Sort at 64GB of memory. Meanwhile, the performance of Shuffle Sort with TriCache is up to \(20.2\times\) better than Spark at 32GB of memory.
For the 400GB dataset, both algorithms keep executing out-of-core. Shuffle Sort on TriCache is faster than swapping by up to \(43.6\times\) at 32GB of memory and outperforms Spark by up to \(13.7\times\) with the same amount of memory. GNU Sort based on swapping is \(41.2\times\) slower than TriCache Shuffle Sort with 512GB of memory and about \(128\times\) slower when the memory quota is less than 256GB because of its sub-optimized algorithm and the limited performance of the OS page cache.
Compared with the in-memory processing of the 150GB dataset, Shuffle Sort with TriCache saves \(90\%\) memory with 32GB of memory, whereas its processing time is only \(49.3\%\) longer than the processing time at 512GB. We also compared the distributed Spark and TriCache-based scale-up solutions. We use four servers with the same hardware configuration and connect them with 200Gb HDR Infiniband NIC. TriCache under 32GB of memory outperforms in-memory distributed Spark by \(7.20\times\) using Shuffle Sort and \(1.33\times\) with GNU Sort on the 400GB workload. So TriCache with NVMe arrays can use less memory and provide nearly in-memory performance for TeraSort. In addition, TriCache can nearly utilize the peak bandwidth of our eight NVMe SSDs, reaching 44GB/s for read-only operations and 31GB/s for mixed read/write operations.
In summary, developers can write in-memory programs (e.g., less than 20 lines of C++ code for Shuffle Sort), and TriCache then helps them to fully utilize the high-performance NVMe SSD array, especially when the algorithm is friendly to out-of-core processing.
4.4 Performance on Graph Database
For workloads in graph databases, we evaluate TriCache on LiveGraph6 [57], an efficient transactional graph database based on OS memory-mapped files. LiveGraph treats memory-mapped files as in-memory data and relies on atomic memory accesses and cache consistency to support transactional queries. It can examine whether a user-transparent block cache is able to provide the same semantics as in-memory operations. We replace the memory-mapped files with TriCache and compare it with the original LiveGraph. We evaluate their performance on the LDBC SNB interactive benchmark, which simulates user activities in a social network and consists of 14 complex-read queries, 7 short-read queries, and 8 update queries. As the SNB driver occupies part of the memory, we limit LiveGraph to use up to 256GB of memory and generate two workloads: SF30 and SF100 datasets. With LiveGraph, these datasets take about 100GB and 320GB of memory, respectively. SNB clients request 1.28M operations for the SF30 workload, and 256K operations for the SF100 workload during the benchmark run.
Figure 12 shows the SNB throughputs of LiveGraph on TriCache and
Fig. 12. Throughput of LiveGraph on TriCache and mmap.
We then take a closer look at the latency metrics when running SF100 with 256GB of memory. TriCache cuts the average latency on complex queries by \(11.5\times\), on short queries by \(1.79\times\), and on update queries by \(21.1\times\) (geometric means). The P999 tail latency of TriCache keeps \(10.9\times\) lower than
4.5 Micro-benchmarks
We conduct two custom multi-threaded micro-benchmarks that issue random memory-load instructions. The first generates random accesses in 8 bytes (named 8B Random workload), which can stress the systems in the case of completely random memory accesses. We control the random pattern to generate operations with different hit rates of block caches, and we also adjust the hit rate of Private SATC to examine its performance impact. The second randomly chooses 4KB pages and sequentially accesses each page in 8-byte words (named 4KB Random workload) to evaluate the performance when a page is accessed multiple times.
We compare TriCache with Linux
Figure 13 shows TriCache’s speedup compared with
Fig. 13. The speedup of TriCache over mmap and FastMap on 8B Random and 4KB Random workloads.
For 8B Random workloads, the performance of TriCache is about \(11\%\) of the in-memory (with
Once the memory hit rate drops to \(90\%\), TriCache can provide improvements of \(18.6\times\) to \(31.5\times\) over
For 4KB Random workloads, Direct SATC can mainly absorb the in-memory overheads of TriCache because the sequential accesses on the same pages can be handled by Direct SATC with very lightweight address translation. The performance of TriCache reaches \(84\%\) to \(91\%\) of the in-memory (with
4.6 Performance Breakdown
In this section, we use a series of breakdown experiments to evaluate some performance-related impacts and configurations of TriCache, including hit rates and hit latency of SATC, number of threads, I/O backends, evict policies, page sizes, SATC sizes, and threads and fibers. For the breakdown experiments, we select five cases under 64GB of memory that are running out of core: PR, RocksDB, Shuffle, and GNU Sort for the 400GB TeraSort dataset, and LiveGraph for the SNB SF100 workload.
Breakdown Analysis of SATC. The three columns of Table 2 break down the performance impact of SATC by gradually removing SATC levels from TriCache. W/O Direct disables Direct SATC, W/O Private disables Private SATC, and Shared Only uses only Shared Cache by removing both SATC levels.
Table 2. Performance Slowdown by Removing SATC Levels
According to the performance degradation listed in Table 2, SATC is an essential component contributing to the good performance of TriCache. The slowdown that occurs by disabling SATC (Shared Only) is \(20.8\times\) on average for the five cases. Even when the memory quotas are less than 1/5 of the working set (i.e., running out-of-core), SATC still yields a speedup of \(40.1\times\) for PR, \(10.1\times\) for Shuffle Sort, and \(57.9\times\) for GNU Sort.
Meanwhile, both Direct SATC and Private SATC are indispensable to TriCache. Without Direct SATC, the performance of PR is degraded by \(2.75\times\) because accessing each edge incurs a heavy overhead due to hash table lookups and evict policy maintenance. However, PR is not sensitive to Private SATC because the size of the dataset is more than \(5\times\) larger than the available memory, and the edges are visited only once for each iteration. For the shuffle phase of Shuffle Sort, the performance drops by \(5.39\times\) without Private SATC but remains almost the same (only \(4.8\%\) slower) without Direct SATC. The reason is that string copies constitute the bottleneck in the shuffle phase and are optimized by the compiler to
Multi-level Cache in TriCache. Next, we use PR, Shuffle Sort, and GNU Sort to further examine the design of the multi-level cache in TriCache. Table 3 lists the miss rates for each level of the cache, the average hit cycles for Direct SATC and Private SATC, and the average access cycles for Shared Cache.
According to the miss rates listed in Table 3, Direct SATC and Private SATC can handle most memory accesses. The miss rate of Direct SATC is less than \(5\%\) for all the three workloads, and the miss rate of Private SATC is less than \(1\%\) for Shuffle Sort and GNU Sort. The results show that SATC can cover most accesses to meet the preceding performance.
Additionally, the hit cycles of Direct SATC and Private SATC in Table 3 show that the software address translation of TriCache is quite efficient. The average costed cycles of Direct SATC hits in PR and Shuffle Sort are approximately 50 cycles, Direct SATC hits in GNU Sort and Private SATC hits in Shuffle Sort take about 150 cycles, and Private SATC hits in GNU Sort use about 450 cycles. To give an idea of how much time they take, we list some hardware latencies: 50 cycles are close to a NUMA-local L3 cache hit or an L2 cache false sharing within a NUMA node, 150 cycles correspond to about the half of a NUMA-local memory access, and 450 cycles are less than a cross-NUMA memory access or a cross-NUMA cache false sharing. Therefore, TriCache with SATC is efficient enough to provide a virtual memory interface and also to deliver memory-comparable performance.
Performance and Numbers of Threads. We also compare the performance of TriCache and baselines under different numbers of threads for PR and RocksDB with 64GB of memory. More precisely, “the performance under a given number of threads” means the maximum performance with less than or equal to this number of threads (only searched over powers of 2). Since TriCache uses 16 server threads as the default configuration in Section 4, the number of threads starts with 32 threads (including server threads).
As shown in Figure 14, TriCache achieves good scalability for the both workloads of PR and RocksDB, which is one of the reasons TriCache performs well. For example, from 32 threads to 256 threads (the number of hardware threads), Ligra with TriCache (in Figure 14(a)) achieves a \(4.29\times\) speedup, and RocksDB on top of TriCache (in Figure 14(b)) yields a \(13.5\times\) performance improvement.
Fig. 14. Performance of TriCache and baselines under different numbers of threads.
Meanwhile, with a small number of threads, TriCache’s performance is worse than that of manually optimized prefetch and asynchronous I/O because of TriCache’s synchronous scheme for triggering I/O and its lack of program-specific optimizations (similar to
Performance and I/O Backends. TriCache currently supports SPDK and Linux AIO as its storage backends and uses SPDK to handle I/O operations in the default configuration. With the NVMe-oF (NVMe over Fabrics) feature provided by SPDK and the Linux kernel, TriCache can also support accessing remote NVMe SSDs through the network and be a good option in the disaggregated architecture. To this end, we evaluated the performance of TriCache with SPDK and Linux AIO backends operating on local SSDs and NVMe-oF remote SSDs. To simulate a disk pool on the network, another server with the same hardware is connected to our testbed using Infiniband in the experiments, which acts as a remote disk server on the network. We also include two different network configurations to evaluate the performance impact of network bandwidth, one with dual 100Gb EDR Infiniband NICs and one with dual 56Gb FDR Infiniband NICs. For the software configuration, both SPDK and Linux AIO backends use the SPDK-provided NVMe-oF targets (i.e., NVMe-oF servers), and the AIO backend uses the NVMe-oF client provided by the Linux kernel, whereas the SPDK backend uses the NVMe-oF client from the SPDK framework.
Figure 15 shows the relative performance when using the AIO and SPDK backends operating on local or remote SSDs. The baseline is SPDK and local SSDs.
Fig. 15. Performance of TriCache with different I/O backends.
When using local SSDs, SPDK performs 1.64\(\times\) better than Linux AIO in terms of the (geometric) average, demonstrating that the user-space NVMe driver enables better IO performance. Nevertheless, SPDK has some drawbacks, such as high programming complexity, deployment difficulties, and not easy for supporting multiple applications. Luckily, TriCache hides SPDK programming details from users, allowing users to code in-memory programs and achieve efficient out-of-core performance. Moreover, the design of TriCache is not coupled with SPDK and can provide comparable performance with Linux AIO. If using or deploying SPDK is not feasible, AIO can serve as a reasonable alternative backend for use in TriCache.
When using remote SSDs through NVMe-oF, the SPDK backend delivers 0.80\(\times\) and 0.75\(\times\) the performance (in geometric average) of local disks in a dual EDR and dual FDR Infiniband network environment, respectively. It demonstrates that TriCache can be used for local high-performance NVMe SSD arrays and is also expected to be applied for disaggregated architectures that expand local memory capacity through leveraging remote SSDs on other servers. In the experiments, the most significant performance degradation is seen in the Shuffle Sort workload, which produces a 45% performance drop with remote SSDs. The gap is mainly attributed to the network bandwidth of our experimental environment, which cannot match the more than 40GB/s peak bandwidth of the local NVMe SSD arrays. In this case, the Shuffle Sort workload is bandwidth-bound, so the bandwidth gap can lead to the preceding performance slowdown. It also illustrates that there are still some cost and bandwidth advantages in utilizing local NVMe SSD arrays compared to far memory and remote SSDs under the disaggregated architecture. Meanwhile, when using the same NVMe-oF SSDs, the SPDK backend provides about 2.1\(\times\) performance advantages over the Linux AIO backend, with a more significant gap than over local disks (about 1.6\(\times\) larger), indicating that SPDK has less software overhead than the Linux kernel when operating NVMe-oF remote disks.
Even though SPDK already provides the functionality to operate remote disks, there is still about 15% performance degradation when using the NVMe-oF feature of SPDK. Additionally, the experimental results illustrate that these performance drops are not brought by network bandwidth limitations for a total of four out of five workloads. In these cases, the performance losses are approximately the same under the EDR and FDR Infiniband networks (drop by 12.5% for dual EDR NICs and 17.5% for FDR NICs). Therefore, how to design caching systems for disaggregated architectures and remote SSDs is still an open research question, and we intend to consider complete system designs as our future work.
Performance and Evict Policy. TriCache supports customizable evict policy, and it uses the CLOCK algorithm as its default policy. In this set of experiments, we replace the CLOCK policy of TriCache Shared Cache and Private SATC to compare the performance of the different algorithms under different workloads. We additionally implement LRU, the 2Q algorithm, and a Random algorithm with random page selection for TriCache. TriCache uses message passing technology to design its Shared Cache, and the Private SATC is thread-local, so the evict policy is single-threaded and does not require concurrency. It is easy to implement evict algorithms for Shared Cache and Private SATC of TriCache—for example, only about 25 lines of code are enough to implement an LRU algorithm.
Figure 16 shows the relative performance with different evict policies, where the baseline is the CLOCK algorithm. For all five workloads, the performance with different policies is approximately the same, and the difference is within 3% of the geometric average. CLOCK performs the best (only with a slight advantage) among the four algorithms on the tested workloads, achieving the best performance on three out of five workloads (including PR, Shuffle Sort, and GNU Sort). The CLOCK algorithm can achieve almost the same cache hit rate as LRU and 2Q for these applications, but with lower maintenance overhead. For RocksDB, the Random algorithm has the smallest overhead since it does not require maintaining any information for eviction at the time of data access, which can improve the performance by about 1.2%. Additionally, the LRU and 2Q algorithms handle complex workloads better by maintaining more access information than CLOCK, so they bring better performance on the LiveGraph workload compared to CLOCK by 2.1% and 1.5%, respectively.
Fig. 16. Performance of TriCache with different evict policies.
Performance and Page Size. TriCache uses a software implementation of page tables in the user space, which enables the option of customizing the page size instead of relying on CPU-defined pages. For example, the page size of TriCache is configured to 128KB in the end-to-end experiment of Shuffle Sort as shown in Section 4.3, and the 128KB pages can help TriCache to maximize the sequential I/O performance. For other workloads, TriCache uses a 4KB page size by default and aligns with the Linux page cache. In this section, we try to decrease the page size for the Shuffle Sort workload and increase the page size for other workloads to evaluate how much performance improvement can be achieved with the configurable page size from TriCache.
Figure 17 shows the relative performance of different workloads at page sizes from 4KB to 128KB, where the Shuffle Sort workload has a baseline of 128KB pages, and the other payloads have a baseline of 4KB pages. For PR, larger page sizes of 8KB to 32KB can provide up to 13% performance improvements (with 16KB pages) over the default page size of 4KB. The GNU Sort workload performs similarly to PR, with 1.36\(\times\) higher performance at a 16KB page size than with 4KB pages. For Shuffle Sort, a page size configured to 128KB can fully utilize the bandwidth of the SSD array and provide 1.85\(\times\) the performance with the 4KB page size. From the hardware perspective, the 4KB access granularity is not large enough for SSDs to reach the maximum hardware bandwidth, showing that configurable page sizes can provide TriCache performance improvements especially when the performance bottleneck is sequential I/O bandwidth.
Fig. 17. Performance of TriCache with different page sizes.
However, workloads may have different performance curves, and misconfigured page sizes can also cause performance degradation. For example, the configurable page size of TriCache only brings about 2% throughput improvements to the LiveGraph workload at 8KB pages, whereas other page sizes instead degrade its performance. As for the RocksDB workload, the optimal configuration is the default 4KB page size, and larger pages produce worse performance. For example, if the page size of TriCache is set to 16KB, the RocksDB throughput will drop by about 38%. Therefore, TriCache still uses the traditional 4f pages as the default configuration, leaving users to find the optimal page size for their workloads if necessary.
Performance and SATC Sizes. Besides the page size is a configurable parameter in TriCache, the two levels of SATC are also tunable. By default, TriCache sets the aggregated capacity of Private SATC to be equal to Shared Cache and makes Direct SATC 1/4 the size of Private SATC. In this paragraph, we resize the Direct SATC and Private SATC separately to examine the performance impact yielded by different SATC sizes.
The experiments include tests running out-of-core, which are cases of the preceding breakdown experiments with 64GB of memory, and the in-memory cases for each workload. In the experiments, we adjust the size of SATC from the same size as the next (larger) level cache to a minimum of 1/16 of the next level cache. The performance results show that the majority of cases (i.e., Figure 18(a)–(d), Figure 18(g), and Figure 18(h)) are not sensitive to the size of SATC. Similar to TLB, SATC is used to handle address translation, which requires only a small amount of memory to be able to cache the addresses of hot data.
Fig. 18. Performance of TriCache with different SATC sizes.
However, there are still some cases where different SATC sizes lead to more than 50% performance differences, as shown in Figure 18. The Shuffle Sort workload expects that all the scanning headers of the shuffle phase are held in memory if possible, which requires a large Private SATC (about 1/4 the size of the Shared Cache for the out-of-core case) to store them. When Private SATC is not large enough to store all the headers, the performance can drop by about 55%. For GNU Sort running out-of-core (Figure 18(h)), the performance is better when the size of Private SATC is 1/4 or 1/2 of Shared Cache, but the highest performance is achieved when Direct SATC, Private SATC, and Shared Cache are the same size. Another case is LiveGraph running in-memory (Figure 18(i)), whose performance curve shows that it is sensitive to the size of Private SATC but not Direct SATC. Additionally, it requires a Private SATC as large as possible, with the best performance when the size of Private SATC is 1/2 and the same of Shared Cache. In general, the default configuration of TriCache, especially maximizing Private SATC, is a good starting point and is able to achieve relatively good performance across different workloads in the experiments. The results also show that TriCache can achieve better performance with parameter searching, so we believe a self-driving caching system is an interesting research problem and leave it as future work.
Threads or Fibers. Some designs of TriCache are optimized for stackful coroutines (also called fibers), as fibers are more lightweight than threads and have the potential to provide better out-of-core performance than threads. This set of experiments compare TriCache working in fiber and thread modes through a set of micro-benchmarks. The micro-benchmark is similar to the 4KB Random workload in Ssection 4.5 with a 10% memory hit rate. It visits a page randomly and then accesses random data within the same page several times. By adjusting the number of random data accesses within a page, it is possible to examine how much CPU resources are available for computation apart from I/O operations.
In the experiments, the number of threads and fibers are controlled to be the same to fairly compare the performance of different parallel schemes. For the fiber mode, one thread is launched for each core, and the fibers are evenly distributed and bound to each core. Figure 19 shows the performance of TriCache using threads and fibers. With only one random access per page access (when the throughput of page accesses is maximum), the performance of fibers is 1.12\(\times\) higher than that of threads. When multiple fibers are running in a single thread, the message passing module of TriCache can merge the messages and amortize overheads of inter-thread synchronization, thus providing better I/O performance.
Fig. 19. Performance of TriCache with threads and fibers.
As the number of padded random accesses increases, the performance of both thread and fiber modes gradually degrades, dropping to about 90% of the maximum throughput at 512 random accesses with fibers and 1024 random accesses with threads. With more padded accesses, the advantage of fibers is more significant, reaching 1.76\(\times\) at 4096 random accesses per page. At this point, fibers still deliver more than 5M ops/s of page access, whereas threads can only support 2048 random accesses to achieve similar performance. Additionally, the geometric average speedup for fibers relative to threads is 1.31\(\times\) for different padding accesses. The preceding results show that fibers can leave more CPU resources for computation while achieving the same I/O performance, resulting in better performance in CPU-bound workloads. The reason is that the number of CPU cores currently available is insufficient to fill the queue depth of NVMe SSD arrays, and over-subscriptions are required to fully utilize the disks. The context switching of fibers happens purely in the user space and does not need to enter the kernel space as threads do. Hence, the advantage of fibers becomes apparent when one CPU core is processing multiple parallel tasks simultaneously (with over-subscriptions).
5 RELATED WORK
There is a series of work that tries to improve the page caching performance with customized memory-mapped file I/O paths or swapping approaches [25, 26, 27, 28, 37, 41]. Kmmap [26] provides several improvements to reduce the variation in performance owing to the aggressive write-back policy of Linux. FastMap [27] addresses scalability issues by separating clean and dirty pages and using per-core data structures to avoid centralized contentions, with the help of a custom Linux kernel. Still, our evaluation shows that FastMap cannot saturate current high-performance NVMe SSD arrays. Aquila [25] offers a library OS solution that eliminates the need for kernel modifications, relying on hardware support for virtualization, which makes it not easy to deploy on cloud environments. Umap [28] provides an
Block caches (or buffer managers) are critical components in data-intensive applications for supporting out-of-core processing [4, 10, 13, 18, 52, 54]. Some attempts try to improve the performance of block caches. SAFS [53], the storage backend for FlashGraph [54], adopts a lightweight cache design based on NUMA-aware message passing. Users need to program with its asynchronous I/O interface to exploit maximal I/O performance on SSD arrays. LeanStore [18] proposes to use pointer swizzling so that pages residing in memory can be directly referenced without page lookups. However, it requires pages to form a tree-like structure and thus is applicable to limited scenarios. TriCache shares similar goals but provides a memory interface that is user-transparent and more general.
Remote cache systems [12, 22] have been developed upon ideas of the disaggregated architecture [1, 9, 11, 16, 17, 23, 29, 32, 33, 34], which utilizes high bandwidth and low latency of modern networks. In this article, we focus on scaling-up through NVMe SSD arrays. Additionally, we intend to consider support for disaggregated architectures in our future work.
Non-Volatile Memory (NVM) enables larger capacity compared with DRAM, and research has been devoted to memory management instead of paging strategies to render memory access efficient on hybrid NVM and DRAM architectures [15, 30, 42, 56]. Nevertheless, block caches such as TriCache are still better suited for NVMe SSDs due to their higher latencies than NVM or DRAM.
6 DISCUSSION
In this section, we present some possibilities where hardware and software could be further codesigned to improve the performance of TriCache.
First, the message passing technique used in Shared Cache decouples clients and servers and eliminates locks on the server side. However, it still requires synchronization between clients and servers due to the nature of memory operations. To solve this problem, processors could add hardware queues among physical cores and corresponding instructions to control them. Such modification would bring native support for message passing semantics on processors and boost not only TriCache but also systems with a similar design (e.g., MPI).
Second, Section 3 and Section 4 both mention that having stackful coroutines or fibers can further improve the performance of TriCache. However, to the best of our knowledge, there are currently no frameworks that can express and run parallel tasks with a hybrid thread-coroutine architecture in a general and transparent manner. TriCache expects such work to apply coroutine techniques to existing systems, thus completing the cache operations purely in the user space and achieving performance improvements.
Finally, SATC in TriCache does not need to be notified by its Shared Cache when a block is swapped out. In contrast, hardware TLB in processors, which also accelerates address translation as SATC, requires OS page cache to explicitly invalidate evicted pages through TLB shootdown, incurring considerable overhead [2, 3] due to inter-processor interrupts. A comparison of the mechanisms of SATC and TLB shows that SATC utilizes reference counting to prevent evicting blocks currently being used by clients, whereas the OS is not directly aware of how many TLB entries are still referring to the pages to be evicted. It is possible to extend the design of SATC to TLB. Processors could mark the reference counts for page table entries—for example, recording the number of TLB entries that currently hold a specific page table entry. The OS can then adapt its page swapping and evict policies to avoid evicting pages currently present in TLBs, thus mitigating the performance issue brought by TLB shootdown.
7 CONCLUSION
In this article, we explored a new user-space approach to achieving efficient out-of-core processing with in-memory programs by providing a virtual memory interface on top of a block cache. We implemented TriCache based on a novel multi-level design and applied it to various in-memory or
The open source implementation of TriCache and instructions to reproduce the main experimental results can be accessed at https://github.com/thu-pacman/TriCache.
Footnotes
1 In case of cache misses, the victim blocks resident in memory need to be replaced with the requested blocks on storage; in case of cache hits, the reference counts need to be updated with locks/latches or atomic operations.
Footnote2 https://github.com/jshun/ligra [commit
Footnote7755d95 ].3 https://github.com/flashxio/FlashX [commit
Footnote2a649ff ].4 https://github.com/facebook/rocksdb [tag
Footnotev6.26.1 ].5 https://github.com/apache/spark [tag
Footnotev3.2.0 ].6 https://github.com/thu-pacman/LiveGraph [commit
Footnoteeea5a40 ].
- [1] . 2018. Remote regions: A simple abstraction for remote memory. In Proceedings of the 2018 USENIX Annual Technical Conference (USENIX’18). 775–787.Google Scholar
- [2] . 2017. Optimizing the TLB shootdown algorithm with page access tracking. In Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC’17). 27–39. https://www.usenix.org/conference/atc17/technical-sessions/presentation/amit.Google Scholar
- [3] . 2020. Don’t shoot down TLB shootdowns! In Proceedings of the 15th European Conference on Computer Systems (EuroSys’20). ACM, New York, NY, 1–14. Google Scholar
Digital Library
- [4] . 2016. Thrill: High-performance algorithmic distributed batch data processing with C++.
arxiv:cs.DC/1608.05634 (2016).Google Scholar - [5] . 2011. Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. In Proceedings of the 20th International Conference on World Wide Web (WWW’11). ACM, New York, NY, 587–596.
DOI: Google ScholarDigital Library
- [6] . 2004. The webgraph framework I: Compression techniques. In Proceedings of the 13th International Conference on World Wide Web (WWW’04). ACM, New York, NY, 595–602.
DOI: Google ScholarDigital Library
- [7] . 2008. Breaking the memory wall in MonetDB. Communications of the ACM 51, 12 (Dec. 2008), 77–85.
DOI: Google ScholarDigital Library
- [8] . 2020. Characterizing, modeling, and benchmarking RocksDB key-value workloads at Facebook. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST’20). 209–223. https://www.usenix.org/conference/fast20/presentation/cao-zhichao.Google Scholar
Digital Library
- [9] . 2015. R2C2: A network stack for rack-scale computers. ACM SIGCOMM Computer Communication Review 45, 4 (2015), 551–564.Google Scholar
Digital Library
- [10] . 2017. Optimizing space amplification in RocksDB. In Proceedings of the 8th Biennial Conference on Innovative Data Systems Research (CIDR’17). 1–9.Google Scholar
- [11] . 2016. Network requirements for resource disaggregation. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 249–264.Google Scholar
- [12] . 2017. Efficient memory disaggregation with Infiniswap. In Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI’17). 649–667.Google Scholar
- [13] . 2007. Architecture of a Database System. Now Publishers Inc.Google Scholar
Digital Library
- [14] . 2010. The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In Proceedings of the 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW’10). IEEE, Los Alamitos, CA, 41–51.Google Scholar
Cross Ref
- [15] . 2015. FOEDUS: OLTP engine for a thousand cores and NVRAM. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 691–706.Google Scholar
Digital Library
- [16] . 2016. Flash storage disaggregation. In Proceedings of the 11th European Conference on Computer Systems (EuroSys’16). ACM, New York, NY, 1–15.
DOI: Google ScholarDigital Library
- [17] . 2017. ReFlex: Remote flash \(\approx\) local flash. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’17). ACM, New York, NY, 345–359.
DOI: Google ScholarDigital Library
- [18] . 2018. LeanStore: In-memory data management beyond main memory. In Proceedings of the 2018 IEEE 34th International Conference on Data Engineering (ICDE’18). 185–196.
DOI: Google ScholarCross Ref
- [19] . 2019. KVell: The design and implementation of a fast persistent key-value store. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP’19). ACM, New York, NY, 447–461.
DOI: Google ScholarDigital Library
- [20] . 2020. KVell+: Snapshot isolation without snapshots. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20). 425–441. https://www.usenix.org/conference/osdi20/presentation/lepers.Google Scholar
- [21] . 2020. Enabling low tail latency on multicore key-value stores. Proceedings of the VLDB Endowment 13, 7 (
March 2020), 1091–1104.DOI: Google ScholarDigital Library
- [22] . 2005. Swapping to remote memory over Infiniband: An approach using a high performance network block device. In Proceedings of the 2005 IEEE International Conference on Cluster Computing. IEEE, Los Alamitos, CA, 1–10.Google Scholar
Cross Ref
- [23] . 2012. System-level implications of disaggregated memory. In Proceedings of the IEEE International Symposium on High-Performance Comp Architecture. IEEE, Los Alamitos, CA, 1–12.Google Scholar
Digital Library
- [24] . 2014. MMap: Fast billion-scale graph computation on a PC via memory mapping. In Proceedings of the 2014 IEEE International Conference on Big Data (Big Data’14). IEEE, Los Alamitos, CA, 159–164.Google Scholar
Cross Ref
- [25] . 2021. Memory-mapped I/O on steroids. In Proceedings of the 16th European Conference on Computer Systems. ACM, New York, NY, 277–293. Google Scholar
Digital Library
- [26] . 2018. An efficient memory-mapped key-value store for flash storage. In Proceedings of the ACM Symposium on Cloud Computing. 490–502.Google Scholar
Digital Library
- [27] . 2020. Optimizing memory-mapped I/O for fast storage devices. In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC’20). 813–827. https://www.usenix.org/conference/atc20/presentation/papagiannis.Google Scholar
- [28] . 2019. UMap: Enabling application-driven optimizations for page management. In Proceedings of the 2019 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC’19). IEEE, Los Alamitos, CA, 71–78.Google Scholar
Cross Ref
- [29] . 1996. Distributed shared memory: Concepts and systems. IEEE Parallel & Distributed Technology: Systems & Applications 4, 2 (1996), 63–71.Google Scholar
Digital Library
- [30] . 2021. HeMem: Scalable tiered memory management for big data applications and real NVM. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 392–407.Google Scholar
Digital Library
- [31] . 2017. ffwd: Delegation is (much) faster than you think. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP’17). ACM, New York, NY, 342–358.
DOI: Google ScholarDigital Library
- [32] . 2020. AIFM: High-performance, application-integrated far memory. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20). 315–332.Google Scholar
- [33] . 2018. LegoOS: A disseminated, distributed OS for hardware resource disaggregation. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). 69–87. https://www.usenix.org/conference/osdi18/presentation/shan.Google Scholar
- [34] . 2017. Distributed shared persistent memory. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC’17). ACM, New York, NY, 323–337.
DOI: Google ScholarDigital Library
- [35] . 2013. Ligra: A lightweight graph processing framework for shared memory. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 135–146.Google Scholar
Digital Library
- [36] . 2008. The GNU libstdc++ parallel mode: Software engineering considerations. In Proceedings of the 1st International Workshop on Multicore Software Engineering (IWMSE’08). 15–22.
DOI: Google ScholarDigital Library
- [37] . 2016. Efficient memory-mapped I/O on fast storage device. ACM Transactions on Storage 12, 4 (2016), 1–27.Google Scholar
Digital Library
- [38] . Sort Benchmark Home Page. Retrieved May 31, 2022 from http://sortbenchmark.org.Google Scholar
- [39] . 2015. Modern Operating Systems. Pearson.Google Scholar
- [40] . Userfaultfd—The Linux Kernel Documentation. Retrieved May 31, 2022 from https://www.kernel.org/doc/html/latest/admin-guide/mm/userfaultfd.html.Google Scholar
- [41] . 2015. DI-MMAP–A scalable memory-map runtime for out-of-core data-intensive applications. Cluster Computing 18, 1 (2015), 15–28.Google Scholar
Digital Library
- [42] . 2018. Managing non-volatile memory in database systems. In Proceedings of the 2018 International Conference on Management of Data. 1541–1555.Google Scholar
Digital Library
- [43] . 2022. Memory-mapped file. Wikipedia. Retrieved May 31, 2022 from http://en.wikipedia.org/w/index.php?title=Memory-mapped%20file&oldid=1089594834.Google Scholar
- [44] . 2022. Memory paging. Wikipedia. Retrieved May 31, 2022 from http://en.wikipedia.org/w/index.php?title=Memory%20paging&oldid=1068326108.Google Scholar
- [45] . 2022. NVM Express. Wikipedia. Retrieved May 31, 2022 from http://en.wikipedia.org/w/index.php?title=NVM%20Express&oldid=1090339430.Google Scholar
- [46] . 2022. Page cache. Wikipedia. Retrieved May 31, 2022 from http://en.wikipedia.org/w/index.php?title=Page%20cache&oldid=1068818367.Google Scholar
- [47] . 2022. PCI Express. Wikipedia. Retrieved May 31, 2022 from https://en.wikipedia.org/w/index.php?title=PCI_Express&oldid=1090153203.Google Scholar
- [48] . 2022. U.2. Wikipedia. Retrieved May 31, 2022 from http://en.wikipedia.org/w/index.php?title=U.2&oldid=1066844795.Google Scholar
- [49] . 2007. Karma: Know-it-all replacement for a multilevel cache. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07). https://www.usenix.org/conference/fast-07/karma-know-it-all-replacement-multilevel-cache.Google Scholar
- [50] . 2017. SPDK: A development kit to build high performance storage applications. In Proceedings of the 2017 IEEE International Conference on Cloud Computing Technology and Science (CloudCom’17). 154–161.
DOI: Google ScholarCross Ref
- [51] . 2016. Hash, Don’t cache (the page table). ACM SIGMETRICS Performance Evaluation Review 44, 1 (
June 2016), 337–350.DOI: Google ScholarDigital Library
- [52] . 2016. Apache Spark: A unified engine for big data processing. Communications of the ACM 59, 11 (Oct. 2016), 56–65.
DOI: Google ScholarDigital Library
- [53] . 2013. Toward millions of file system IOPS on low-cost, commodity hardware. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis (SC’13). ACM, New York, NY, Article
69 , 12 pages.DOI: Google ScholarDigital Library
- [54] . 2015. FlashGraph: Processing billion-node graphs on an array of commodity SSDs. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). 45–58. https://www.usenix.org/conference/fast15/technical-sessions/presentation/zheng.Google Scholar
- [55] . 2021. Revisiting swapping in user-space with lightweight threading.
arxiv:cs.OS/2107.13848 (2021).Google Scholar - [56] . 2021. Spitfire: A three-tier buffer manager for volatile and non-volatile memory. In Proceedings of the 2021 International Conference on Management of Data. 2195–2207.Google Scholar
Digital Library
- [57] . 2020. LiveGraph: A transactional graph storage system with purely sequential adjacency list scans. Proceedings of the VLDB Endowment 13, 7 (
March 2020), 1020–1034.DOI: Google ScholarDigital Library
Index Terms
TriCache: A User-Transparent Block Cache Enabling High-Performance Out-of-Core Processing with In-Memory Programs
Recommendations
Enabling union page cache to boost file access performance of NVRAM-based storage device
DAC '18: Proceedings of the 55th Annual Design Automation ConferenceDue to the fast access performance, byte-addressability, and non-volatility of non-volatile random access memory (NVRAM), NVRAM has emerged as a popular candidate for the design of memory/storage systems on mobile computing systems. For example, the ...
Migration based page caching algorithm for a hybrid main memory of DRAM and PRAM
SAC '11: Proceedings of the 2011 ACM Symposium on Applied ComputingAs the DRAM based main memory significantly increases the power and cost budget of a computer system, new memory technologies such as Phase-change RAM (PRAM), Ferroelectric RAM (FRAM), and Magnetic RAM (MRAM) have been proposed to replace the DRAM. ...
Improving the Performance of On-Board Cache for Flash-Based Solid-State Drives
NAS '12: Proceedings of the 2012 IEEE Seventh International Conference on Networking, Architecture, and StorageFlash-based Solid-State Drives (SSDs) are data storage devices that use flash memory to store persistent data. Previously we presented a centralized on-board cache for SSDs to improve the response time and reduce the physical writes to the flash media. ...

























Comments