Abstract
Many social-media and IoT services have very large working sets consisting of billions of tiny (≈100 B) objects. Large, flash-based caches are important to serving these working sets at acceptable monetary cost. However, caching tiny objects on flash is challenging for two reasons: (i) SSDs can read/write data only in multi-KB “pages” that are much larger than a single object, stressing the limited number of times flash can be written; and (ii) very few bits per cached object can be kept in DRAM without losing flash’s cost advantage. Unfortunately, existing flash-cache designs fall short of addressing these challenges: write-optimized designs require too much DRAM, and DRAM-optimized designs require too many flash writes.
We present Kangaroo, a new flash-cache design that optimizes both DRAM usage and flash writes to maximize cache performance while minimizing cost. Kangaroo combines a large, set-associative cache with a small, log-structured cache. The set-associative cache requires minimal DRAM, while the log-structured cache minimizes Kangaroo’s flash writes. Experiments using traces from Meta and Twitter show that Kangaroo achieves DRAM usage close to the best prior DRAM-optimized design, flash writes close to the best prior write-optimized design, and miss ratios better than both. Kangaroo’s design is Pareto-optimal across a range of allowed write rates, DRAM sizes, and flash sizes, reducing misses by 29% over the state of the art. These results are corroborated by analytical models presented herein and with a test deployment of Kangaroo in a production flash cache at Meta.
1 INTRODUCTION
Many web services require fast, cheap access to billions of tiny objects, each a few hundred bytes or less. Examples include social networks like Facebook or LinkedIn [16, 24, 69], microblogging services like Twitter [72, 73], ecommerce [18], and emerging sensing applications in the Internet of Things [37, 45, 46]. Given the societal importance of such applications, there is a strong need to cache tiny objects at high performance and low cost (i.e., capital and operational expense).
Among existing memory and storage technologies with acceptable performance, flash is by far the most cost-effective. DRAM and non-volatile memories (NVMs) have excellent performance, but both are an order-of-magnitude more expensive than flash. Thus, cost argues for using of large amounts of flash with minimal DRAM.
Flash’s main challenge is its limited write endurance; i.e., flash can only be written so many times before wearing out. Wearout is especially problematic for tiny objects because flash can be read and written only at multi-KB granularity. For example, writing a 100 B object may require writing a 4 KB flash page, amplifying bytes written by 40\( \times \) and rapidly wearing out the flash device. Thus, cost also argues for minimizing excess bytes written to flash.
The problem. Prior flash-cache designs either use too much DRAM or write flash too much. Log-structured caches write objects to flash sequentially and keep an index (typically in DRAM) that tracks where objects are located on flash [19, 34, 47, 61, 62, 65]. By writing objects sequentially and batching many insertions into each flash write, log-structured caches greatly reduce the excess bytes written to flash. However, tracking billions of tiny objects requires a large index, and even a heavily optimized index needs large amounts of DRAM [34]. Set-associative caches operate by hashing objects’ keys into distinct “sets,” much like CPU caches [16, 24, 53]. These designs do not require a DRAM index because an object’s possible locations are implied by its key. However, set-associative caches write many excess bytes to flash. Admitting a single small object to the cache requires re-writing an entire set, significantly amplifying the number of bytes written to the flash device.
Our solution: Kangaroo. We introduce Kangaroo, a new flash-cache design optimized for billions of tiny objects. The key insight is that existing cache designs each address half of the problem, and they can be combined to overcome each other’s weaknesses while amplifying their strengths.
Kangaroo adopts a hierarchical design to achieve the best of both log-structured and set-associative caches (Figure 1(a)). To avoid a large DRAM index, Kangaroo organizes the bulk of cache capacity as a set-associative cache, called KSet. To reduce flash writes, Kangaroo places a small (e.g., 5% of flash) log-structured cache, called KLog, in front of KSet. KLog buffers many objects, looking for objects that map to the same set in KSet (i.e., hash collisions), so that each flash write to KSet can insert multiple objects. Our insight is that even a small log will yield many hash collisions, so only a small amount of extra DRAM (for KLog’s index) is needed to significantly reduce flash writes (in KSet).
Fig. 1. (a) High-level illustration of Kangaroo’s design. (b) Miss ratio achieved on a production trace from Meta by different flash-cache designs on a 1.9 TB drive with a budget of 16 GB DRAM and three device-writes per day. Prior designs are constrained by either DRAM or flash writes, whereas Kangaroo’s design balances these constraints to reduce misses by 29%.
The layers in Kangaroo’s design complement one another to maximize hit ratio while minimizing system cost across flash and DRAM. Kangaroo introduces three techniques to efficiently realize its hierarchical design and increase its effectiveness. First, Kangaroo’s partitioned index lets it efficiently find all objects in KLog that map to the same set in KSet while using a minimal amount of DRAM. Second, since Kangaroo is a cache, not a key-value store, it is free to drop objects instead of admitting them to KSet. Kangaroo’s threshold admission policy exploits this freedom to admit objects from KLog to KSet only when there are enough hash collisions — i.e., only when the flash write is sufficiently amortized. Third, Kangaroo’s “RRIParoo” eviction policy improves hit ratio by supporting intelligent eviction in KSet, even though KSet lacks a conventional DRAM index to track eviction metadata.
Summary of results. We implement Kangaroo as a module in CacheLib [16] ( cachelib.org). We evaluate Kangaroo by replaying production traces on real systems and in simulation for sensitivity studies. Prior designs are limited by DRAM usage or flash write rate, whereas Kangaroo optimizes for both constraints. For example, under typical DRAM and flash-write budgets, Kangaroo reduces misses by 29% on a production trace from Meta (Figure 1(b)), lowering miss ratio from 0.29 to 0.20. Moreover, in simulation, we show that Kangaroo scales well with flash capacity, performs well with different DRAM and flash-write budgets, and handles different access patterns well. We break down Kangaroo’s techniques to see how much each contributes. Finally, we show that Kangaroo’s benefits hold up in the real world through a test deployment at Meta.
Contributions. This paper contributes the following:
Problem: We show that, for tiny objects, prior cache designs require either too much DRAM (log-structured caches) or too many flash writes (set-associative caches).
Key idea: We show how to combine log-structured and set-associative designs to cache tiny objects on flash at low cost.
Theoretical foundations: We develop a rigorous Markov model that shows Kangaroo reduces flash writes over a set-associative flash design without any increase in miss ratio.
Kangaroo design & implementation: Kangaroo introduces three techniques to realize and improve the basic design: its partitioned index, threshold admission, and RRIParoo eviction. These techniques improve hit ratio while keeping DRAM usage, flash writes, and runtime overhead low.
Results: We show that, unlike prior caches, Kangaroo’s design can handle different DRAM and flash-write budgets. As a result, Kangaroo is Pareto-optimal across a wide range of constraints and for different workloads.
2 BACKGROUND AND RELATED WORK
This section discusses the important class of applications relying on billions of tiny objects, why flash is needed to cache them and the challenges flash brings, and the shortcomings of existing flash-cache designs.
2.1 Tiny Objects are Important and Numerous
Tiny objects are prevalent in many large-scale systems:
At Meta, small objects are prevalent in the Facebook social graph. For example, the average social-graph edge size is under 100 B. Across edges, nodes, and other objects, the average object size is less than 700 B [16, 24]. This has led to the development of a dedicated flash caching system for small objects [16].
At Twitter, tweets are limited to 280 B, and the average tweet is fewer than 33 characters [55]. Due to the massive and growing number of tweets, Twitter seeks a cost-effective caching solution [74].
At Microsoft Azure, sensor updates from IoT devices in Azure Streaming Analytics are a growing use case. Before an update can be processed (e.g., to trigger a real-time action), the server must fetch metadata (the sensor’s unit of measurement, geolocation, owner, etc.) with an average size of 300 B. For efficiency and availability, it caches the most popular metadata [37]. Another use case arises in search advertising, where Azure caches predictions and other results [45, 46].
Each of these systems accesses billions of objects that are each significantly less than the 4 KB minimum write granularity of block-storage devices. For example, Facebook logs 1.5 billion users daily [9] and just friendship connections alone account for hundreds to thousands of edges per user [24, 66]. Twitter logs over 500 million new tweets per day and serves over 190 million daily users [10]. While IoT update frequencies and ad impressions are not publicly available, the number of connected devices is estimated to have surpassed 50 billion in 2020 [32], and the average person was estimated to see 5,000 ads every day as early as 2007 [63].
2.2 Caching Tiny Objects in Flash is Hard
While individual objects in the above applications are tiny, application working sets on individual servers still add up to TBs of data. To reduce throughput demands on back-end data-management systems, applications rely on large-scale, cost-efficient caches, as a single caching server can replace tens of backend servers [16]. Unfortunately, as described below, current caching systems are inefficient for tiny objects. There is therefore a need for caching systems optimized specifically for large numbers of tiny objects.
Why not just use memory?. DRAM is expensive, both in terms of acquisition and power cost per bit. This makes traditional DRAM caches hard to scale, particularly with data sizes increasing exponentially [4]. DRAM capacity is also often limited due to operational concerns. Data-center operators generally provision a limited number of server configurations to reduce management complexity. It is not possible in some real deployments to fine-tune server configuration for caching [64]. Moreover, DRAM is often in high demand, so all applications are encouraged to minimize DRAM usage. For example, the trend in recent years at Meta is towards less DRAM and more flash per server [16, 64].
Why flash?. Flash currently provides the best combination of performance and cost among memory and storage technologies, and is thus the technology of choice for most large-scale caches [16, 21, 22, 34, 62]. It is persistent, cheaper and more power-efficient than DRAM, and much faster than mechanical disks. While flash-based caches do use DRAM for metadata and “hot” objects, the bulk of the cache capacity is flash to reduce end-to-end system cost.
Challenges in flash caching.. Flash presents many problems not present in DRAM caches. A big one is that flash has limited write endurance, which means there is a limit on the number of writes before the flash device wears out and must be replaced [23, 40, 44]. Without care, caches can quickly wear out flash devices as they rapidly admit and evict objects [16, 34]. Hence, many existing flash caches over-provision capacity, suffering more misses in order to slow wearout [16, 22]. New flash technologies, such as multi-layer QLC (four bits per cell) and PLC (five bits per cell) [27], increase capacity and decrease cost but significantly reduce write endurance.
Exacerbating the endurance issue, flash drives suffer from write amplification. Write amplification occurs when the number of bytes written to the underlying flash exceeds the number of bytes of data originally written. Write amplification is expressed as a multiplier of the number of bytes written, and thus a value of of 1\( \times \) is minimal, indicating no extra writes. Flash devices suffer from both device-level write amplification and application-level write amplification [40].
Device-level write amplification ( dlwa ) [34, 47, 65] occurs when the flash translation layer (FTL) writes more flash pages than asked for by the storage application (e.g., file system, database, or cache). Current flash drives implement the age-old block-storage interface, wherein hosts read and write logical blocks in a numerical logical-block address (LBA) namespace. Generally, the internal flash-page size and the external logical-block size are the same, with 4 KB being common, even though the flash device can only erase pages in much larger (e.g., 256 MB) “erase blocks”. Most dlwa is caused by cleaning activity that copies live pages elsewhere before an erase block is cleared.
Generally speaking, dlwa worsens as more of the raw flash capacity is utilized and as access patterns consist more of small, random writes. A common approach to reduce dlwa is over-provisioning, i.e., only exposing a fraction of the raw flash capacity in the LBA namespace, so that cleaning tends to find fewer live pages in victim erase blocks [16, 22]. Figure 2 shows dlwa vs. utilized capacity for random 4 KB writes to a 1.9 TB flash drive. As expected, dlwa significantly increases as over-provisioning decreases, from \( \approx 1\times \) at 50% utilization to \( \approx 10\times \) at 100% utilization.
Fig. 2. The effect of flash over-provisioning on device-level write amplification (dlwa) of random writes of various sizes. dlwa increases as over-provisioning decreases.
Application-level write amplification ( alwa ) occurs when the storage application re-writes some of its own data as part of its storage management. One form of this is akin to FTL cleaning, such as cleaning in log-structured file systems [44, 57] or compaction in log-structured merge trees [6, 8]. Another form is caused by having to write an entire logical block. To write a smaller amount of data, the application must read the block, install the new data, and then write the entire block [52]. For example, installing 1 KB of new data into a 4 KB logical block involves rewriting the other 3 KB, giving alwa of 4\( \times \). Ideally, the unmodified data in the block would not have been rewritten.
Why caching tiny objects is hard. The size of tiny objects makes caching them on flash challenging. Tracking billions of tiny objects individually in large storage devices can require huge metadata structures [34], which either require a huge amount of DRAM, additional flash writes (if the index lives on flash), or both. To amortize tracking metadata, one could group many tiny objects into a larger, long-lived “meta-object”. This can be inefficient, however, if individual objects in the meta-object are accessed in disparate patterns.
Tiny objects are also a major challenge for write amplification. Traditional cache designs (i.e., for DRAM caches) freely re-write objects in place, leading to small, random writes, i.e., the worst case for dlwa. Since tiny objects are much smaller than a logical block, re-writing them in place would additionally involve substantial alwa — 40\( \times \) for a 100 B object in a 4 KB logical block — which is multiplicative with dlwa. Grouping tiny objects into larger meta-objects, as mentioned above, shifts alwa from logical blocks to meta-objects but does not address the problem.
2.3 Current Approaches and Related Work
This section discusses existing solutions for flash caching and their shortcomings for caching tiny objects.
Key-value stores: Flash-efficient key-value stores have been developed and demonstrated [8, 33, 48, 56, 70], and it is tempting to consider them when a cache is needed. But key-value stores generally assume that deletion is rare and that stored values must be kept until told otherwise. In contrast, caches delete items frequently and at their own discretion (i.e., every time an item is evicted). With frequent deletions, key-value stores experience severe write amplification, much lower effective capacity, or both [15, 22, 34, 65, 70].
As a concrete example, consider SILT [48], the key-value store that comes closest to Kangaroo in its high-level design. Like Kangaroo, SILT uses a multi-tier flash design to balance memory index size vs. write amplification. Unfortunately, SILT’s design is poorly suited to caching. For example, SILT’s two main layers, which hold \( \gt \)99% of entries, are immutable. Because those layers are immutable, delete operations are logged and do not immediately reclaim space. Thus, cache evictions result in holes (i.e., reduced cache capacity) until the next compaction (merge and re-sort) occurs. One can reduce the lost cache capacity with more frequent compactions, but at a large penalty to performance and alwa.
Similar issues with deletes affect most key-value stores, often with this same trade-off between compaction frequency and holes in immutable data structures. One may be able to reduce these overheads somewhat by coordinating eviction with compaction operations, but this is not trivial and not how these systems were designed. For instance, Netflix used RocksDB [8] as a flash cache and had to over-provision by 67% due to this issue [22]. Some key-value stores reduce alwa by making reads less efficient [49, 56, 70], but do not sidestep the fundamental challenge of deletes wasting capacity. In contrast, flash caches have the freedom to evict objects when convenient. This lets flash caches co-design data structures and policies so that deletes are efficient and minimal space is wasted.
Log-structured caches: To reduce write amplification, many flash caches employ a log structure on flash with an index in DRAM to track objects’ locations [19, 34, 47, 61, 62, 65]. While this solution often works well for larger objects, it requires prohibitively large amounts of DRAM for tiny objects, as the index must keep one entry per object. The index can spill onto flash [70], but spilling adds flash reads for lookups and flash writes to update the index as objects are admitted and evicted.
Even Flashield [34], a recent log-structured cache design for small objects, faces DRAM problems for larger flash devices. After optimizing its DRAM usage, Flashield needs 20 bits per object for indexing plus approximately 10 bits per object for Bloom filters. Thus, Flashield would need 75 GB of DRAM to track 2 TB of 100 B objects. In fact, Flashield’s DRAM usage is much higher than this because it relies on an in-memory cache to decide which objects to write to flash. The DRAM cache must grow with flash capacity or else prediction accuracy will suffer, leading to more misses.
Thus, the total DRAM required for a log-structured cache can quickly exceed the amount available and significantly increase system cost and power. Technology trends will make these problems worse over time, since cost per bit continues to decrease faster for flash than for DRAM [27, 71].
Set-associative flash caches: Metadata to locate objects on flash can be reduced by restricting their possible locations [53]. Meta’s CacheLib [16] implements such a design for small objects (\( \lt \)2 KB), e.g., items in the social graph [24]. CacheLib’s “small-object cache” (SOC) is a set-associative cache with variable-size objects, using a hash function to map each object to a specific 4 KB set (i.e., a flash page). With this scheme, SOC requires no index and only \( \approx \)3 bits of DRAM per object for per-set Bloom filters.
Although more DRAM-efficient, set-associative designs suffer from excessive write rates. Inserting a new object into a set means rewriting an entire flash page, most of which is unchanged, incurring 40\( \times \) alwa for a 100 B object and 4 KB page as discussed above. In addition, flash writes are a worst case for dlwa: small and random (Figure 2). The multiplicative nature of alwa and dlwa compounds the harmful effect on device lifetimes.
Set-associative flash caches limit their flash write-rate through two main techniques. To reduce dlwa, set-associative flash caches are often massively over-provisioned. For example, CacheLib’s SOC is run in production with over half of the flash device empty [16]. That is, the cache requires more than twice the physical flash to provide a given cache capacity. Additionally, to limit alwa, CacheLib’s SOC employs a pre-flash admission policy [16, 34] that rejects a fraction of objects before they are written to flash. Unfortunately, both techniques reduce the cache’s achievable hit ratio.
Summary: Prior work does not adequately address how to cache tiny objects in flash at low cost. Log-structured caches require too much DRAM, and set-associative caches add too much write amplification.
3 KANGAROO OVERVIEW AND MOTIVATION
Kangaroo is a new flash-cache design optimized for billions of tiny objects. Kangaroo aims to maximize hit ratio while minimizing DRAM usage and flash writes. Like some key-value stores [26, 48, 50], Kangaroo adopts a hierarchical design, split across memory and flash. Figure 3 depicts the two layers in Kangaroo’s design: (i) KLog, a log-structured flash cache and (ii) KSet, a set-associative flash cache; as well as a DRAM cache that sits in front of Kangaroo.
Fig. 3. Lookups in Kangaroo first check a tiny DRAM cache; then KLog, a small on-flash log-structured cache with an in-DRAM index; and finally KSet, a large on-flash set-associative cache. Kangaroo uses little DRAM because KLog is small and KSet has no DRAM index.
Basic operation. Kangaroo is split across DRAM and flash. As shown in Figure 3, ① lookups first check the DRAM cache, which is very small (\( \lt \)1% of capacity). ② If the requested key is not found, requests next check KLog (\( \approx \)5% of capacity). KLog maintains a DRAM index to track objects stored in a circular log on flash. ③ If the key is not found in KLog’s index, requests check KSet (\( \approx \)95% of capacity). KSet has no DRAM index; instead, Kangaroo hashes the requested key to find the set (i.e., the LBA(s) on flash) that might hold the object.
If the requested key is not in the small, per-set Bloom filter, the request is a miss. Otherwise, the object is probably on flash, so
the request reads the LBA(s) for the given set and scans for the requested key.
Insertions follow a similar procedure to reads, as shown in Figure 4. ① Newly inserted items are first written to the DRAM cache. This likely pushes some objects out of the DRAM cache, where they are either
dropped by KLog’s pre-flash admission policy or
added to KLog’s DRAM index and
appended to KLog’s flash log (after buffering in DRAM to batch many insertions into a single flash write). Likewise, inserting objects to KLog will push other objects out of KLog, which are either
dropped by another admission policy or
inserted into KSet. Insertions to KSet operate somewhat differently than in a conventional cache. For any object moved from KLog to KSet, Kangaroo moves all objects in KLog that map to the same set to KSet , no matter where they are in the log. Doing this amortizes flash writes in KSet, significantly reducing Kangaroo’s alwa.
Fig. 4. Objects are first inserted into the tiny DRAM cache, then appended to KLog, and finally moved — along with all other objects mapping to the same set — to KSet. Kangaroo significantly reduces write amplification because KLog is written sequentially and each write to KSet inserts multiple objects.
Design rationale. Kangaroo relies on its complementary layers for its efficiency and performance. At a high level, KSet minimizes DRAM usage and KLog minimizes flash writes. Like prior set-associative caches, KSet eliminates the DRAM index by hashing objects’ keys to restrict their possible locations on flash. But KSet alone suffers too much write amplification, as every tiny object writes a full 4 KB page when admitted. KLog comes to the rescue, serving as a write-efficient staging area in front of KSet, which Kangaroo uses to amortize KSet’s writes.
On top of this basic design, Kangaroo introduces three techniques to minimize DRAM usage, minimize flash writes, and reduce cache misses. (i) Kangaroo’s partitioned index for KLog can efficiently find all objects in KLog mapping to the same set in KSet, and is split into many independent partitions to minimize DRAM usage. (ii) Kangaroo’s threshold admission policy between KLog and KSet only admits objects to KSet when at least \( n \) objects in KLog map to the same set, reducing alwa by \( \ge n\times \). (iii) Kangaroo’s “RRIParoo” eviction improves hit ratio in KSet by approximating RRIP [41], a state-of-the-art eviction policy, while only using a single bit of DRAM per object.
4 THEORETICAL FOUNDATIONS OF KANGAROO
We develop a Markov model of Kangaroo’s basic design, including threshold admission, to analyze Kangaroo’s miss ratio and flash write rate. This model rigorously demonstrates that Kangaroo can greatly reduce alwa compared to a set-only design, without any increase in miss ratio.
Assumptions. For tractability, this analysis makes several simplifying assumptions that do not hold in our design (Section 5), implementation (Section 6), or evaluation (Section 7). We assume the independent reference model (IRM), in which each object has a fixed reference popularity, drawn independently from a known object probability distribution. However, we make no assumptions about the object popularity distribution; our model holds across any popularity distribution (uniform, Zipfian, etc.). We also assume that all objects are fixed-size and that KSet uses FIFO eviction. Similar assumptions are common in prior cache models [11, 17, 19, 30, 31, 38, 42, 54, 58].
We model a cache consisting of two layers: a log-structured cache and a set-associative cache, called KLog and KSet as in Kangaroo — but note that the model simplifies Kangaroo’s operation significantly. We assume that an object is first admitted to KLog. Once KLog fills up, it flushes all objects to KSet. KLog and KSet may employ an admission policy that drops objects instead of admitting them, as described below. Our goals are (i) to compute the miss probability and flash writes per cache access and (ii) to show that Kangaroo improves miss ratio for a given write rate vs. the baseline set-associative cache.
Modeling approach. We model how a single object moves through KLog and KSet. Figure 5 shows our simple continuous-time Markov model, which has three states: an object can be out-of-cache (\( O \)), in KLog (\( Q \)), or in KSet (\( W \)). To compute the miss probability \( m \), we need to know each object’s probability of being requested, which is fixed according to the IRM, and the probability that it is out-of-cache (in state \( O \)). To find the latter, we need to know each state’s stationary probabilities, \( \pi \), i.e., the likelihood of an object being in a given state once the cache reaches its steady-state behavior. To compute these probabilities and to find flash write rate, we require the transition rates between states, e.g., how often an object transitions \( O \rightarrow Q \) when an object is admitted to KLog. Table 1 summarizes the key variables in the model.
Fig. 5. The continuous Markov model for Kangaroo’s basic cache design (a) with no log, (b) with no admission policies, (c) with Kangaroo’s threshold admission before KSet, and (d) with probabilistic admission before KLog.
Table 1. Key Variables in the Markov Model
Summary of model results. Through building our model from a baseline set-associative cache, we will show that Kangaroo’s alwa follows Theorem 1 and that Kangaroo’s miss ratio is the same as a set-associative flash cache.
Suppose KLog contains \( q \) objects; KSet contains \( s \) sets with \( w \) objects each; objects are admitted to flash with a \( p \) probability; and objects are only admitted to KSet if at least \( n \) new objects are being inserted. Kangaroo’ app-level write amplification is (1) \( \begin{equation} {{alwa}}_{{Kangaroo}} = p \left(1 + \frac{w \, \overline{F}_X(n)}{\overline{F}_X(1)\mathbb {E}\left[X | X \ge n\right]}\right), \end{equation} \) where \( X \sim \text{Binomial}(q,1/s) \) and \( \overline{F}_X(n) = \sum _{i=n}^{\infty } \mathbb {P}\left[X=i\right] \) is the probability of a set being re-written. Furthermore, the probability of admitting an object to KSet is \( \mathbb {P}\left[X \ge n | X \ge 1\right] \).
For a reasonable parameterization of Kangaroo on a 2 TB drive with 5% of flash dedicated to KLog (\( q=5 \cdot 10^8 \), \( s=4.6 \cdot 10^8 \), \( w=40 \), \( p=1 \), and \( n=2 \)), Kangaroo’s alwa will be \( \approx 5.8 \). In contrast, a set-associative cache of the same size and admission probability, \( \mathbb {P}\left[X \ge n | X \ge 1\right] \approx 0.45 \), gets \( \text{{\scshape alwa}}_\text{Sets} = w \cdot 0.45 = 17.9\times \). That is, Kangaroo improves alwa by \( \approx 3.08\times \), a large decrease in alwa with only a small percentage of flash dedicated to KLog.
4.1 Baseline Set-associative Cache
We first analyze a baseline set-associative cache (i.e., without KLog) and build up to Kangaroo. This design has all objects admitted directly to KSet.
Each set in KSet holds \( w \) objects. Since we are modeling FIFO eviction of fixed-sized objects, KSet evicts each object after \( w \) newer objects are inserted into the same set. With \( s \) sets, each set only receives \( 1/s \) of misses, so the probability of writing a new object to a set is \( \frac{m}{s} \). Since there needs to be \( w \) newer objects in the set to evict an object, the transition rate from \( W \rightarrow O \) is \( \frac{m/s}{w} \).
Without knowing the popularity distribution \( \lbrace r_i \rbrace \), we cannot go further than this; yet we will see it is sufficient to show that Kangaroo’s design does not compromise miss ratio under our model.
The application-level write amplification (alwa) is the flash write rate divided by the miss rate, since each miss should ideally write exactly one object. Hence, for the baseline: (8) \( \begin{equation} \text{ALWA}_{\text{baseline}} = \frac{f_{\text{baseline}}}{m_{\text{baseline}}} = w, \end{equation} \) which matches our expectations for set-associative designs, since \( w \) is just the size of each set (Section 2.3).
4.2 Add KLog, No Admission Policies
Next we add KLog, a log-structured cache, in front of KSet, as shown in Figure 5(b). KLog’s operation in our simple model is to buffer objects until full, and then flush the log’s contents to KSet.
In our simplified Markov model, KLog flushes all objects in KLog to KSet when KLog is completely full, i.e., it has \( q \) objects. KLog starts with 0 objects and each miss inserts one object, so KLog is half-full when an object is admitted on average. Therefore, on average, \( q/2 \) objects need to be inserted until the next flush, and the transition rate from \( Q \rightarrow W \) is \( \frac{m}{q/2} = \frac{2m}{q} \). Finally, the transition rate from \( W \rightarrow O \) is the same as the baseline.
The approximation for \( \pi _{i,O} \) holds when \( q \ll s w \) (i.e., when KLog is much smaller than KSet). We find that Equation (9) is the same as Equation (4), demonstrating that adding KLog does not significantly affect the probability an object is out-of-cache, so long as KLog is small.
Writes to KSet are also cheaper because, even though \( w \) objects are still written to flash at a time, these writes are amortized across all objects in KLog that map to the same set. The number of objects admitted to each set is a balls-and-bins problem. Specifically, it follows a binomial distribution \( X \sim B(q,1/s) \). Each transition is amortized across \( {\mathbb {E}\left[B|B\ge 1\right]} \) objects, as KSet only writes the set if at least one object is admitted. The total flash write rate is: (16) \( \begin{equation} f_{\text{KLog-only}} = \sum _i r_i \cdot \pi _{i,O} \cdot 1 + \frac{2m}{q} \cdot \pi _{i,Q} \cdot \frac{w}{\mathbb {E}\left[X|X\ge 1\right]} \end{equation} \)
Which means every object suffers write amplification of: (17) \( \begin{equation} \text{ALWA}_{\text{KLog-only}} = 1 + \frac{w}{\mathbb {E}\left[X|X\ge 1\right]} \end{equation} \)
Deriving Equation (17) in detail: \( \begin{align*} f_\text{KLog-only} &= \sum _i \frac{r_i \cdot 2m \cdot 1 + \frac{2m}{q} \cdot q r_i \cdot \frac{w}{\mathbb {E}\left[X|X\ge 1\right]}}{q r_i + 2m + 2s w r_i} \\ &= \left(1 + \frac{w}{\mathbb {E}\left[X|X\ge 1\right]} \right) \times \left(\sum _i \frac{2 m r_i}{q r_i + 2m + 2s w r_i} \right) \\ &= \left(1 + \frac{w}{\mathbb {E}\left[X|X\ge 1\right]} \right) \times m_\text{KLog-only} \end{align*} \) This means that KLog is responsible for 1 object write, and the rest of the writes come from KSet.
4.3 Add Threshold Admission Before KSet
Next, we consider the impact of adding Kangaroo’s threshold admission policy (Section 5.3), which only admits objects to a set in KSet if at least \( n \) objects map to that set. For instance, a threshold of 2 means that if KLog has only one object mapping to a set, then that object is discarded (evicted) instead of being inserted into KSet. To represent threshold admission, we add an edge in the Markov model (Figure 5(c)) back from \( Q \rightarrow O \), denoting the discarded objects.
The stream of objects flushed from KLog are split between the transition \( Q \rightarrow O \) and \( Q \rightarrow W \). The added edge back from \( Q \rightarrow O \), represents the \( 1 - \tau \) fraction of the discarded objects. The remaining \( \tau \) fraction transition \( Q \rightarrow W \) as before. For the Markov model, the transition probabilities along those edges are multiplied by their probability.
The admission policy also reduces the admission rate to state \( W \), which in turn causes an object to spend more time in state \( W \). This is reflected in the transition rate \( W \rightarrow O \), which is scaled by \( \tau \).
Surprisingly, the threshold admission policy does not change the stationary probabilities in the Markov model. Hence, the miss ratio is also unchanged:
(22) \( \begin{equation} m_{\text{threshold}} = m_{\text{KLog-only}} \approx m_{\text{baseline}} \end{equation} \)
Kangaroo’s threshold admission policy thus greatly decreases alwa in KSet’s set-associative design by guaranteeing a minimum level of amortization on all flash writes.
4.4 Add Probabilistic Admission Before KLog
The above techniques — KLog and threshold admission — are Kangaroo’s main tricks to reduce flash writes. However, the design thus far always has write amplification at least 1\( \times \) because all objects are admitted to KLog. It is possible to achieve write amplification below 1\( \times \) by adding an admission policy in front of KLog. We now consider the effect of adding a probabilistic admission policy that drops objects with a probability \( p \) before they are admitted to KLog, as shown in Figure 5(d).
This insensitivity to admission probability reflects a limitation of the model: we assume static reference probabilities, so all popular objects will eventually make it into the cache. In practice, object popularity changes over time, so miss ratio decreases at very low admission probabilities because the cache does not admit newly popular objects quickly enough.
4.5 Modeling Results
With a full model of Kangaroo, we can not only show the alwa benefits of Kangaroo over the set-associative baseline, but we can also predict the effect of different parameters on Kangaroo’s alwa.
Figure 6 shows the effect of KLog’s size on alwa without an admission policy. As Kangaroo devotes more space to KLog, its alwa greatly decreases. For instance, doubling the percent of the cache dedicated to KLog from 2.5% to 5% decreases alwa by up to 38%. Since the overall amount of flash dedicated to KLog is small, there is only a small increase in DRAM overhead for this configuration of Kangaroo. Thus, even a small KLog greatly decreases alwa.
Fig. 6. Modeled alwa for Kangaroo with different percentages of the flash cache devoted to KLog, assuming 4 KB sets, 100% probabilistic admission, and no admission threshold.
Figure 7 shows the effect of thresholding on alwa and KSet’s admission probability for different object sizes using Theorem 1, keeping KLog at 5% of cache size. With no thresholding (\( n=1 \)), no objects are rejected; but as the threshold increases more objects are rejected (Figure 7(a)). Also, since more objects fit in the KLog when objects are smaller, smaller objects are more likely to be admitted. Thresholding significantly reduces alwa (Figure 7(b)). Importantly, the alwa savings are larger than the fraction of objects rejected, unlike purely probabilistic admission. For instance, with 100 B objects, threshold \( n=2 \) admits 44.4% of objects, but its write rate is only 22.8% of the write rate with threshold \( n=1 \).
Fig. 7. Modeled (a) admission percentage and (b) alwa for Kangaroo with different threshold values and object sizes, assuming 4 KB sets and KLog w/ 5% of capacity.
The alwa results are consistent with results in our experiments (Section 7), giving us confidence when using the model to explore the alwa design space.
5 KANGAROO DESIGN
This section describes the techniques introduced in KLog and KSet to reduce DRAM, flash writes, and miss ratio.
5.1 Pre-flash Admission to KLog
Like previous flash caches, Kangaroo may not admit all objects evicted from the DRAM cache [16, 28, 29, 34, 35, 36]. It has a pre-flash admission policy that can be configured to randomly admit objects to KLog with probability \( p \), decreasing Kangaroo’s write rate proportionally without additional DRAM overhead. Compared to prior designs, Kangaroo can afford to admit a larger fraction of objects to flash than prior flash caches due to its low alwa; in fact, except at very low write budgets, Kangaroo admits almost all objects to KLog.
5.2 KLog
KLog’s role is to minimize the flash cache’s alwa without requiring much DRAM. To accomplish this, it must support three main operations: Lookup, Insert, and Enumerate-Set. Enumerate-Set allows KLog to find all objects mapping to the same set in KSet. Lookup and Insert operate similarly to a conventional log-structured cache with an approximate index. However, the underlying data structure is designed so that Enumerate-Set is efficient and has few false positives.
Operation. Like other log-structured caches, KLog writes objects to a circular buffer on flash in large batches and tracks objects via an index kept in DRAM. To support Enumerate-Set efficiently, KLog’s index is implemented as a hash table using separate chaining. Each index entry contains an
Fig. 8. Overview of KLog operations.
Internal KLog structure. As depicted in Figure 8, KLog is structured internally as multiple partitions. Each partition is an independent log-structured cache with its own flash log and DRAM index. Moreover, each partition’s index is split into multiple tables, each an independent hash table.
This partitioned structure reduces DRAM usage, as described next, but otherwise changes the operation of KLog little. The table and partition are inferred from an object’s set in KSet. Hence, all objects in the same set will belong to the same partition, table, and bucket; and operations work as described above within each table.
Reducing DRAM usage in KLog. Table 2 breaks down Kangaroo’s DRAM usage per object vs. a naïve log-structured design as a standalone cache (“Naïve Log-Only”) and as a drop-in replacement for KLog (“Naïve Kangaroo”).
Bucket and LRU overhead assume 200 B objects.
Table 2. Breakdown of DRAM Per Object for a 2 TB Cache, Comparing Kangaroo to a Naïve Log-structured Cache and Kangaroo with a Naïve Log Index
Bucket and LRU overhead assume 200 B objects.
The flash
The
KLog’s structure also reduces the
In a naïve cache using LRU eviction, each entry keeps a pointer to adjacent entries in the LRU list. This requires \( 2 \cdot \log _2(LogSize/ObjectSize) \) bits. In contrast, Kangaroo’s RRIParoo policy (Section 5.4) is based on RRIP [41] and only needs 3 b per object in KLog (and even less in KSet).
Finally, each bucket in KLog’s index requires one pointer for the head of the chain. In naïve logs, this is a 64 b pointer. In KLog, it is a 16 b offset into the table’s memory. KLog allocates roughly one bucket per set in KSet. With 4 KB sets and 200 B objects, the per-object DRAM overhead is 3.1 b (Naïve) or 0.8 b (KLog) per object.
All told, KLog’s partitioned structure reduces the per-object metadata from 190 b to 48 b per object, a 3.96\( \times \) savings vs. the naïve design. Compared to prior index designs, KLog’s index uses slightly more DRAM per object than the state-of-the-art (30 b per object in Flashield [34]), but it supports Enumerate-Set and has fewer false positives. Most importantly, KLog only tracks \( \approx \)5% of objects in Kangaroo, so indexing overheads are just 3.2 b per object. Adding KSet’s DRAM overhead gives a total of 7.0 b per object, a 4.3\( \times \) improvement over the state-of-the-art.
5.3 KLog → KSet: Minimizing Flash Writes
Write amplification in KLog is not a significant concern because it has a alwa close to 1\( \times \) and writes data in large segments, minimizing dlwa. However, KSet’s write amplification is potentially problematic due to its set-associative design. Kangaroo solves this by using KLog to greatly reduce alwa in KSet: namely, by amortizing each flash write in KSet across multiple admitted objects.
Moving objects from KLog to KSet. A background thread keeps one segment free in each log partition. This thread flushes segments from the on-flash log in FIFO order, moving objects from KLog to KSet as shown in Figure 8(c). For each victim object in the flushed segment, this thread ① calls Enumerate-Set to find all other objects in KLog that should be moved with it;
if there are not enough objects to move (see below), the victim object is dropped or, if popular, is re-admitted to KLog;
otherwise, the victim object and all other objects returned by Enumerate-Set are moved from KLog to KSet in a single flash write.
Instead of flushing one segment at a time, one could fill the entire log and then flush it completely. But this leaves the log half-empty, on average. Flushing one segment at a time keeps KLog’s capacity utilization high, empirically 80–95%. Incremental flushing also increases the likelihood of amortizing writes in KSet, since each object spends roughly twice as long in KLog and is hence more likely to find another object in the same set when flushed.
Threshold admission to KSet. Kangaroo amortizes writes in KSet by flushing all objects in the same set together, but inevitably some objects will be the only ones in their set when they are flushed. Moving these objects to KSet would result in the same excessive alwa as a naïve set-associative cache. Thus, Kangaroo adds an admission policy between KLog and KSet that sets a threshold, \( n \), of objects required to write a set in KSet. If Enumerate-Set(\( x \)) returns fewer than \( n \) objects, then \( x \) is not admitted to KSet.
To avoid unnecessary misses to popular objects that do not meet the threshold when moving from KLog to KSet, Kangaroo readmits any object that received a hit during its stay in KLog back to the head of the log. This lets Kangaroo retain popular objects while only slightly increasing overall write amplification (due to KLog’s minimal alwa).
5.4 KSet
KSet’s role is to minimize the overall DRAM overhead of the cache. KSet employs a set-associative cache design similar to CacheLib’s Small Object Cache [16]. This design splits the cache into many sets, each holding multiple objects; by default, each set is 4 KB, matching flash’s read and write granularity. KSet maps an object to a set by hashing its key. Since each object is restricted to a small number of locations (i.e., one set), an index is not required. Instead, to look up a key, KSet simply reads the entire set off flash and scans it for the requested key.
To reduce unnecessary flash reads, KSet keeps a small Bloom filter in DRAM built from all the keys in the set. These Bloom filters are sized to achieve a false positive rate of about 10%. Whenever a set is written, the Bloom filter is reconstructed to reflect the set’s contents.
RRIParoo: Usage-based eviction without a DRAM index. Usage-based eviction policies can significantly improve miss ratio, effectively doubling the cache size (or more) without actually adding any resources [12, 19, 20, 41, 62]. Unfortunately, implementing these policies on set-associative flash caches is hard, as such policies involve per-object metadata that must be updated whenever an object is accessed. Since KSet has no DRAM index to store metadata and cannot update on-flash metadata without worsening alwa, it is not obvious how to implement a usage-based eviction policy. For these reasons, most flash caches use FIFO eviction [2, 5, 7, 8, 16, 25, 37, 39, 60, 67], which keeps no per-object state. Unfortunately, FIFO significantly increases miss ratio because popular objects continually cycle out of the cache.
Kangaroo introduces RRIParoo, a new technique to efficiently support usage-based eviction policies in flash caches. Specifically, RRIParoo implements RRIP [41], a state-of-the-art eviction policy originally proposed for processor caches, while using only \( \approx \)1 bit of DRAM per object and without any additional flash writes.
Kangaroo also uses RRIP to merge objects from KLog. Tracking hits in KLog is trivial because it already has a DRAM index. Objects are inserted into KLog at
Fig. 9. RRIParoo implements RRIP eviction with only \( \approx \) 1 b in DRAM per object and no additional flash writes.
DRAM usage. As shown in Table 2, KSet needs up to 4 bits in DRAM per object: one for RRIParoo and three for the Bloom filters. Combined with the DRAM usage of KLog that contains about 5% of objects, Kangaroo needs \( \approx \)7.0 b per object, 4.3\( \times \) less than Flashield [34]. Moreover, the 1 b per object DRAM overhead for RRIParoo can be lowered by tracking fewer objects in each set. Taken to the extreme, this would cause the eviction policy to decay to FIFO, but it allows Kangaroo to adapt to use less DRAM if desired.
6 EXPERIMENTAL METHODOLOGY
This section describes the experimental methodology that we use to evaluate Kangaroo in Section 7.
6.1 Kangaroo Implementation and Parameterization
We implement Kangaroo in C++ as a module within the CacheLib caching library [16]. Table 3 describes Kangaroo’s default parameters; we evaluate sensitivity to these parameters in Section 7.3.
6.2 Comparisons
We compare Kangaroo to (i) CacheLib’s small object cache (
We run experiments on two 16-core Intel Xeon CPU E5-2698 servers running Ubuntu 18.04, one with 64 GB of DRAM and one with 128 GB of DRAM. We use Western Digital SN840 drives with 1.92 TB rated at three device-writes per day. This flash drive gives a sustained write budget of 62.5 MB/s. We chose these configurations to be similar to those deployed in the large-scale production clusters that contributed traces to this work, but with extra DRAM to let us explore large log-structured caches.
Except where noted, all experiments are configured to stay within 16 GB of DRAM (all-inclusive — DRAM cache, index, etc.); 62.5 MB/s flash writes, as measured directly from the device (i.e., including dlwa); and 100 K requests/s, similar to what is achieved by flash caches in production [16, 22]. To mimic a memory-constrained system, we limit
6.3 Simulation
To explore a wide range of parameters and constraints, we implemented a trace-driven cache simulator for Kangaroo. The simulator measures miss ratio and application-level write rate. We estimate device-level write amplification based on our results in Section 2, using a best-fit exponential curve to the dlwa of random, 4 KB writes for
6.4 Workloads
Our experiments use sampled 7-day traces from Meta [16] and Twitter [73]. These traces have average object sizes of 291 B and 271 B, respectively. For systems experiments, we scale the Meta trace to achieve 100 K reqs/s by running it \( 3\times \) concurrently in different key spaces.
The simulator results use sampled-down traces, and we scale-up the measurements to a full-server equivalent based on the server’s flash capacity and desired load as described below, Section 6.6. We use 1% of the keys for the Meta trace and 10% of the keys for the Twitter trace. Unless otherwise noted, we report numbers for the last day of requests, allowing the cache to warm up and display steady-state behavior.
6.5 Metrics
Kangaroo is designed to balance several competing constraints that limit cache effectiveness. As such, our evaluation focuses on cache miss ratio, i.e., the fraction of requests that must be served from backend systems, under different constraints. We further report on Kangaroo’s raw performance, showing it is competitive with prior designs.
6.6 Scaling Traces
Our scaling methodology allows us explore a wide range of system parameters in simulation. This methodology builds on prior analysis of scaling caches [13, 14, 43, 51, 59, 68], and Table 4 summarizes its key parameters.
Table 4. Key Parameters in Scaling Methodology
The model involves three free parameters that let us: (i) choose the load on each server; (ii) choose the flash cache size on each server; and (iii) down-sample requests to accelerate simulations. Moreover, the methodology incorporates three constraints to exclude infeasible configurations: request throughput, flash write rate, and DRAM usage.
Goals for scaling traces. The starting point for our methodology is a trace collected from a real production system. For simplicity and without loss of generality, we assume that the trace is gathered from a single caching server. This trace’s requests arrive at a rate \( R_\text{o} \) (measured in, e.g., requests/s).
The goal of our methodology is to use this trace to explore other system configurations. In particular, we want to explore caching systems with fewer or more caching servers and with different amounts of resources at each individual server. We do this by modeling the performance of a single server in the desired system configuration. Last but not least, we want to be able to do this efficiently, i.e., without needing to actually duplicate the original production system, by running scaled-down simulations.
Load factor and request rate per server. The first choice in the methodology is the load factor on each server, which changes the number of servers in the cluster. In the original system, each server serves requests at a rate \( R_\text{o} \) — by increasing or decreasing this rate, we effectively scale the number of caching servers that are needed to serve all user requests.
The parameter \( \ell \) sets the relative load factor at each server. That is, the request rate at each server in the modeled system is (27) \( \begin{equation} R_\text{m} = \ell \cdot R_\text{o}, \end{equation} \) and the total number of caching servers in the modeled system scales \( \propto 1/\ell \).
The load factor is clearly an important parameter. In general, a higher load factor is desirable, as higher load reduces the number of servers needed to serve all user requests. However, load factor is constrained by the maximum request throughput at a single server \( R_\text{max} \). Specifically, the maximum load factor is (28) \( \begin{equation} \ell _\text{max} = R_\text{max} / R_\text{o}. \end{equation} \) Higher load factors may also not be desirable because higher load increases flash write rate and, at a fixed cache size per server, increases miss ratio. Hence, the best load factor will depend on a number of factors, including properties of the trace like object size and locality (i.e., miss ratio curve).
Flash cache size. The next choice is the per-server flash cache size, \( S_\text{m} \). This is a free parameter constrained only by flash write rate and the size of the flash device. (A log-structured cache size is also constrained by DRAM, as discussed below.)
This parameter is significant because it determines the miss ratio at each server. A bigger cache is usually better, until write amplification or DRAM usage exceed the server’s constraints. For a given cache design, at cache size \( S_\text{m} \) it will achieve miss ratio of \( m_\text{m}(S_\text{m}) \) and a flash write rate (excluding dlwa) of
(29) \( \begin{equation} W_\text{m} \propto m_\text{m}(S_\text{m}) \cdot R_\text{m}. \end{equation} \) The miss ratio \( m(S) \) depends on the system because load factor varies between systems. Application-level write rate \( W \) is scaled by an design-specific factor corresponding to the cache design’s alwa — this factor is large for set-associative designs like
The maximum cache size \( S_\text{max} \) is determined from the flash-write constraint, \( D_\text{max} \). Specifically, we multiply the application-level flash write rate \( W_\text{m} \) by the estimated dlwa to get the device-level write rate \( D_\text{m} \). We then sweep the flash cache size \( S_\text{m} \) to find the sizes that stay within the constraint. Increasing cache size has two competing effects on write rate: larger caches generally have fewer misses, leading to fewer insertions, but they also suffer higher dlwa, increasing the cost of each insertion. As a result, the maximum size usually lies on the “knee” of the dlwa curve (see Figure 2), though which size hits the knee depends on the cache design (via alwa), the load factor (via \( R_\text{m} \)), and the trace itself (via \( m_\text{m} \)).
We are now ready, in principle, to run experiments to model the desired system. By replaying the original trace, which has a request rate of \( R_\text{o} \), we are simulating a system at \( 1/\ell \)-scale of the desired system (since \( R_\text{o} = R_\text{m} / \ell \)). We therefore need to scale the cache size in our experiments by the same factor, simulating a cache of size \( S_\text{s} = S_\text{m} / \ell \). (This is why increasing load factor can hurt miss ratio: all else equal, a larger load factor reduces effective cache capacity.) We can then interpret results by scaling them up by a factor \( \ell \), e.g., rescaling the measured write rate \( W_\text{s} \) to report a modeled write rate of \( W_\text{m} = \ell \cdot W_\text{s} \). We can accelerate experiments further by employing the same trick more aggressively.
Accelerating simulations by sampling down. To speedup simulation, we downsample the original trace by pseudorandomly selecting keys to produce a new, sampled trace that we will use in the actual simulation experiments. This trace has a request rate of \( R_\text{s} \), yielding an empirically measured sampling rate of (30) \( \begin{equation} k = R_\text{s} / R_\text{o} \end{equation} \) Downsampling by \( k \ll 1 \) makes simulations take many fewer requests and also lets simulated flash capacity fit in DRAM, significantly accelerating each experiment.
We must scale down the other resources in the system to match the downsampled trace. The simulated cache size is (31) \( \begin{equation} S_\text{s} = k \cdot S_\text{m}. \end{equation} \) With this scaling, simulated write rate needs to be scaled up by \( 1 / k \) to compute the modeled system’s write rate (32) \( \begin{equation} W_\text{m} = W_\text{s} / k. \end{equation} \) However, simulated miss ratio does not change (33) \( \begin{equation} m_\text{m}(S_\text{m}) = m_\text{m}(S_\text{s}/k) = m_\text{s}(S_\text{s}). \end{equation} \) (Miss ratio is invariant under sampling because it is the ratio of rates, so the scaling factors cancel.)
DRAM constraints per server. In addition to other constraints, systems are constrained in their DRAM usage. This is particularly important for log-structured caches like
For each simulation, we compute the DRAM overhead for that cache design (e.g., for its DRAM index and Bloom filters), and use the remaining DRAM capacity as a DRAM cache. For
The methodology in practice. The above describes the methodology from a top-down perspective, starting from the decisions that have the largest impact on performance and cost. In practice, we use this methodology to understand the parameter limitations for the simulator. Then, the scaling methodology is applied in the opposite direction, starting from the parameters of a specific simulation experiment and backing out the modeled system configuration for any given simulation.
Specifically, we run each simulation with a DRAM capacity \( Q_\text{s} \), flash size \( S_\text{s} \), and trace sampled at rate \( k \). These experiments produce a miss ratio \( m_\text{s}(S_\text{s}) \) and application-level flash write rate \( W_\text{s} \).
Then, given a fixed DRAM budget in the modeled system \( Q_\text{m} \), we compute the full properties of the modeled caching system. We compute the size of the modeled flash cache as (35) \( \begin{equation} S_\text{m} = \frac{Q_\text{m} S_\text{s}}{Q_\text{s}}. \end{equation} \) This is the flash cache size that respects the modeled DRAM constraint and keeps DRAM : flash ratio constant. Moreover, to maintain miss ratio, the ratio of cache sizes \( S_\text{m}/S_\text{s} \) must equal the ratio of request rates \( R_\text{m}/R_\text{s} \). We want to model a system receiving \( R_\text{m} = \ell \cdot R_\text{o} \) requests, but actually run a simulation with \( R_\text{s} = k \cdot R_\text{o} \) requests. Hence, the load factor is (36) \( \begin{equation} \ell = \frac{R_\text{m}}{R_\text{s}} k = \frac{S_\text{m}}{S_\text{s}} k, \end{equation} \) yielding a modeled request rate of (37) \( \begin{equation} R_\text{m} = \frac{S_\text{m}}{S_\text{s}} R_\text{s} \end{equation} \) Finally, we scale the write rate and estimate dlwa for size \( S_\text{m} \) (38) \( \begin{equation} D_\text{m} = \text{DLWA}(S_\text{m}) \cdot \frac{W_\text{s}}{k}. \end{equation} \)
This methodology lets us run short simulations to estimate the behavior of a wide range of modeled caching systems, while obeying constraints faced by production servers.
7 EVALUATION
This section presents experimental results from Kangaroo and prior systems. We find that: (i) Kangaroo reduces misses by 29% under realistic system constraints. (ii) Kangaroo improves the Pareto frontier when varying constraints. (iii) In a production deployment, Kangaroo reduces flash-cache misses by 18% at equal write rate and reduces write rate by 38% at equal miss ratios. We also break down Kangaroo by technique to see where its benefits arise.
7.1 Main Result: Kangaroo Significantly Reduces Misses vs. Prior Cache Designs Under Realistic Constraints
Kangaroo aims to achieve low miss ratios for tiny objects within constraints on flash-device write rate, DRAM capacity, and request throughput. This section compares Kangaroo against
Miss ratio: Figure 10 shows that Kangaroo reduces cache misses by 29% vs.
Fig. 10. Miss ratio for all three systems over a 7-day Meta trace. All systems are run with 16 GB DRAM, a 1.9 TB drive, and with write rates less than 62.5 MB/s.
By contrast, Kangaroo uses 93% of flash capacity, increasing cache size by 15% vs.
Request latency and throughput: Kangaroo achieves reasonable throughput and tail latencies. When measuring flash cache performance without a backing store, Kangaroo’s peak throughput is 158 K gets/s,
In any reasonable caching deployment, request tail latency will be set by cache misses as they fetch data from backend systems. However, for completeness and to show that Kangaroo has no performance pathologies, we present tail latency at peak throughput. Kangaroo’s p99 latency is 736 µs,
7.2 Kangaroo Performs Well as Constraints Change
Between different environments and over time, system constraints will vary. Using our simulator, we now evaluate how the cache designs behave when changing four parameters: device write budget, DRAM budget, flash capacity, and average object size.
Device write budget. Device write budgets change with both the type of flash SSD and the desired lifetime of the device. To explore how this constraint effects miss ratio, we simulate the spread of miss ratios we can achieve at different device-level write rates. To change the device-level write rate, we vary both the utilized flash capacity percentage and the admission policies for all three systems while holding the total DRAM and flash capacity constant. Note that
Figure 11 shows the tradeoff between device-level write rate budget and miss ratio. At 62.5 MB/s (the default) on both the Meta and Twitter workloads, Kangaroo consistently performs better than both
Fig. 11. Pareto curve of cache miss ratio at different device-level write rates under 16 GB DRAM and 2 TB flash capacity. At very low write rates, LS is best, but it is limited by DRAM from scaling to larger caches. Thus, for most write rates, Kangaroo outperforms both systems because it can take advantage of the entire cache capacity, has a lower write rate than SA, and has a better eviction policy than the other two systems.
DRAM capacity. Over time, the typical ratio of DRAM to flash in data-center servers is decreasing in order to reduce cost [64]. Figure 12 compares miss ratios for DRAM capacities up to 64 GB, holding flash-device capacity fixed at 2 TB and device-write rate at 62.5 MB/s. DRAM capacity does not greatly affect
Fig. 12. Pareto curve of cache miss ratio as DRAM capacity varies from 5 to 64 GB. Flash capacity is fixed at 2 TB and device write rate is capped at 62.5 MB/s. The amount of DRAM does not greatly affect SA or Kangaroo, which are both write-rate-constrained, but has a huge effect on LS by increasing its cache size.
Larger flash capacities will shift the lines right as the DRAM : flash ratio decreases. Assuming write budget and request throughput scale with flash capacity, a 4 TB flash device requires 60 GB DRAM to achieve the same miss ratio as a 2 TB flash device with 30 GB DRAM. This makes the left side of the graph particularly important when comparing flash-cache designs.
Flash-device capacity. As stated in the previous section, the size of the flash device greatly impacts miss ratio and the significance of write-rate and DRAM constraints. As we look forward, flash devices are likely to increase while DRAM capacity is unlikely to grow much and may even shrink [64]. Figure 13 shows the miss ratio for each system as the device capacity changes. Each system can use as much of the device capacity as it desires while staying within 16 GB DRAM and 3 device writes per day.
Fig. 13. Pareto curve of cache miss ratio at different device sizes. The DRAM capacity is limited to 16 GB and the device write rate to 3 device writes/day (e.g., 62.5 MB/s for a 2 TB drive).
Except at smaller flash capacities, Kangaroo is Pareto-optimal across device capacities. At smaller device capacities (\( \lt \)1.2 TB for the Meta trace and \( \lt \)1 TB for the Twitter trace), Kangaroo and
Object size. The final feature that we study is object sizes. Figure 14 shows how miss ratio changes for each system as we artificially change the object sizes. For each object in the trace, we multiply its size by a scaling factor, but constrain the size to [1 B, 2 KB]. To study the impact of cache design as object sizes change, we keep the working-set size constant by scaling up the sampling rate (Section 6.6).
Fig. 14. Pareto curve of cache miss ratio vs. average object size. Object sizes are limited to \( [1\,B, 2048\,B] \) . The write rate is constrained to 62.5 MB/s for a 2 TB flash drive with 16 GB of DRAM.
The cache designs are affected differently as object size scales.
Kangaroo is also affected as objects get smaller, but not as much as prior designs. KSet’s alwa increases with smaller objects, but less than
7.3 Parameter Sensitivity and Benefit Attribution
We now analyze how much each of Kangaroo’s techniques contributes to Kangaroo’s performance and how each should be parameterized. Figure 15 evaluates Kangaroo’s sensitivity to four main parameters on the Meta trace: KLog admission probability, KSet eviction policy, KLog size, and KSet admission threshold. All setups use the full 2 TB device capacity and 16 GB of memory. We build up their contribution to miss ratio and application write rate from a basic set-associative cache with FIFO eviction.
Fig. 15. Miss ratio vs. application-level write rate based on various design parameters in Kangaroo: (a) KLog admission probability, (b) RRIParoo metadata size, (c) KLog size (% of flash-device capacity), and (d) KSet admission threshold.
Pre-flash admission probability. Figure 15(a) varies admission probability from 10% to 100%. As admission probability increases, write rate increases because more objects are written to flash. However, the miss ratio does not decrease linearly with admission probability. Rather, it has a smaller effect when the admission percentage is high than when the admission percentage is low. Since Kangaroo’s other techniques significantly reduce alwa, we use a pre-flash admission probability of 90%.
Number of RRIParoo bits. Figure 15(b) shows miss ratios for FIFO and RRIParoo with one to four bits. Although changing the eviction policy does slightly change the write rate (because there are fewer misses), we show only miss ratio for readability. RRIParoo with one bit suffers 3.4% fewer misses vs. FIFO, whereas RRIParoo with three bits suffers 8.4% fewer misses. Once RRIParoo uses four bits, the miss ratio increases slightly, a phenomenon also noticed in the original RRIP paper [41]. Since three-bit RRIParoo uses the same amount of DRAM as one-bit RRIParoo (Section 5.4), we use three bits because it achieves the best miss ratio.
KLog size. Figure 15(c) shows that, as KLog size increases, the flash write rate decreases significantly, but the miss ratio is unaffected (\( \lt \).05% maximum difference). However, a bigger KLog needs more DRAM for its index. Thus, KLog should be as large enough to substantially reduce write amplification, but cannot exceed available DRAM nor prevent using a DRAM cache. Flash writes can be further reduced via admission policies or by over-provisioning flash capacity as needed. We use 5% of flash capacity for KLog.
KSet admission threshold. Figure 15(d) shows the impact of threshold admission to KSet. Thresholding reduces flash write rate up to 70.4% but increases misses by up to 72.9% at the most extreme. Note that these results assume rejected objects are re-admitted to KLog if they have been hit, since re-admission reduces misses without significantly impacting flash writes. We use a threshold of 2, which reduces flash writes by 32.0% while only increasing misses by 6.9%.
Benefit breakdown. In this configuration, Kangaroo reduces misses by 2% and decreases application write rate by 67% vs. a set-associative cache that admits everything. Most of the miss ratio benefits over
7.4 Production Deployment Test
Finally, we present results from two production test deployments on a small-object workload at Meta, comparing Kangaroo to
To find appropriate production configurations, we chose seven configurations for each system that performed well in simulation: four with probabilistic pre-flash admission and three with a machine-learning (ML) pre-flash admission policy. The first production deployment ran all probabilistic admission configurations and the second ran all ML admission configurations. Since these configurations ran under different request streams, their results are not directly comparable.
Figures 16(a) and 16(b) present results over a six-day request stream for configurations with similar write rates (“equivalent WA”) as well as configurations that admit all objects to the flash cache (“admit-all”). Kangaroo reduces misses by 18% vs.
Fig. 16. Results from two production test deployments of Kangaroo and SA, showing (a) flash miss ratio and (b) application flash write rate over time using pre-flash random admission and (c) application flash write rate over time using ML admission. With random admission at equivalent write-rate, Kangaroo reduces misses by 18% over SA. When Kangaroo and SA admit all objects, Kangaroo reduces write rate by 38%. With ML admission, Kangaroo reduces the write rate by 42.5%.
We also tested both systems with the ML pre-flash admission policy that Meta uses in production [16]. Figure 16(c) presents results over a three-day request stream. The trends are the same: Kangaroo reduces application flash writes by 42.5% while achieving a similar miss ratio to
8 CONCLUSION
Kangaroo is a flash cache for billions of tiny objects that handles a wide range of DRAM and flash-write budgets. Kangaroo leverages prior log-structured and set-associative designs, together with new techniques, to achieve the best of both designs. Experiments using traces from Meta and Twitter show DRAM usage close to the best prior DRAM-optimized design, flash writes close to the best prior write-optimized design, and miss ratios better than either. Kangaroo has been implemented in CacheLib [16] and is open-sourced for use by the community. Kangaroo shows that flash caches can support tiny objects, an adversarial workload for DRAM usage and write amplification, while maintaining flash’s cost advantage.
ACKNOWLEDGMENTS
We thank Meta, Twitter, and the members and companies of the PDL Consortium (Alibaba, Amazon, Datrium, Google, HPE, Hitachi, IBM, Intel, Meta, Microsoft, NetApp, Oracle, Pure Storage, Salesforce, Samsung, Seagate, Two Sigma and Western Digital) and VMware for their interest, insights, feedback, and support.
Footnotes
1 Processor caches reduce tag size vs. a fully associative cache similarly; each index table in KLog corresponds to a “set” in the processor cache.
Footnote
- [1] [n. d.]. Amazon DynamoDB. https://aws.amazon.com/dynamodb/features/. 5/5/21.Google Scholar
- [2] [n. d.]. Apache Traffic Server. https://trafficserver.apache.org.
Accessed: 2019-04-22 .Google Scholar - [3] [n. d.]. Azure Cache for Redis. https://azure.microsoft.com/en-us/services/cache/#what-you-can-build. 5/5/21.Google Scholar
- [4] [n. d.]. Big Data Statistics, Growth, and Facts 2020. https://saasscout.com/statistics/big-data-statistics/. 5/6/21.Google Scholar
- [5] [n. d.]. Fatcache. https://github.com/twitter/fatcache.Google Scholar
- [6] [n. d.]. LevelDB. https://github.com/google/leveldb.Google Scholar
- [7] [n. d.]. Redis on Flash. https://docs.redislabs.com/latest/rs/concepts/memory-architecture/redis-flash/.Google Scholar
- [8] [n. d.]. RocksDB. http://rocksdb.org.Google Scholar
- [9] 2020. Facebook reports first quarter 2020 results. investor.fb.com (
Apr. 2020).Google Scholar - [10] 2021. Twitter first quarter 2021 results. investor.twitterinc.com (
May 2021).Google Scholar - [11] . 1971. Principles of optimal page replacement. J. ACM (1971).Google Scholar
Digital Library
- [12] . 2018. LHD: Improving hit rate by maximizing hit density. In USENIX NSDI.Google Scholar
- [13] . 2015. Talus: A simple way to remove cliffs in cache performance. In IEEE HPCA.Google Scholar
- [14] . 2015. Scaling distributed cache hierarchies through computation and data co-scheduling. In IEEE HPCA.Google Scholar
- [15] . 2019. Small refinements to the DAM can have big consequences for data-structure design. In ACM SPAA.Google Scholar
- [16] . 2020. The CacheLib caching engine: Design and experiences at scale. In USENIX OSDI.Google Scholar
- [17] . 2018. Practical bounds on optimal caching with variable object sizes. In ACM SIGMETRICS.Google Scholar
- [18] . 2018. Robinhood: Tail latency aware caching–dynamic reallocation from cache-rich to cache-poor. In USENIX OSDI.Google Scholar
- [19] . 2017. AdaptSize: Orchestrating the hot object memory cache in a content delivery network. In USENIX NSDI.Google Scholar
- [20] . 2017. Hyperbolic caching: Flexible caching for web applications. In USENIX ATC.Google Scholar
- [21] . 2016. Application Data Caching Using SSDs. https://netflixtechblog.com/application-data-caching-using-ssds-5bf25df851ef.Google Scholar
- [22] . 2018. Evolution of Application Data Caching: From RAM to SSD. https://netflixtechblog.com/evolution-of-application-data-caching-from-ram-to-ssd-a33d6fa7a690.Google Scholar
- [23] . 2010. Write endurance in flash drives: Measurements and analysis. In USENIX FAST.Google Scholar
- [24] . 2013. TAO: Facebook’s distributed data store for the social graph. In USENIX ATC.Google Scholar
- [25] . 2018. FASTER: An embedded concurrent key-value store for state management. VLDB (2018).Google Scholar
- [26] . 2008. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS) 26, 2 (2008), 1–26.Google Scholar
Digital Library
- [27] . 2020. Could cloud storage be disrupted in the next decade? In USENIX HotStorage.Google Scholar
- [28] . 2016. Cliffhanger: Scaling performance cliffs in web memory caches. In USENIX NSDI.Google Scholar
- [29] . 2017. Memshare: A dynamic multi-tenant key-value cache. In USENIX ATC.Google Scholar
- [30] . 1973. Operating Systems Theory. Prentice Hall Professional Technical Reference.Google Scholar
Digital Library
- [31] . 1990. An approximate analysis of the LRU and FIFO buffer replacement schemes. In ACM SIGMETRICS.Google Scholar
- [32] . 2018. 2020: Life with 50 billion connected devices. In IEEE International Conference on Consumer Electronics. 1–1.Google Scholar
- [33] . 2011. SkimpyStash: RAM space skimpy key-value store on flash-based storage. In ACM SIGMOD.Google Scholar
- [34] . 2019. Flashield: A hybrid key-value cache that controls flash write amplification. In USENIX NSDI.Google Scholar
- [35] . 2013. MemC3: Compact and concurrent memcache with dumber caching and smarter hashing. In USENIX NSDI.Google Scholar
- [36] . 2004. Distributed caching with memcached. Linux Journal 2004, 124 (2004), 5.Google Scholar
Digital Library
- [37] .[n. d.]. Microsoft. Personal communication.Google Scholar
- [38] . 2012. Performance evaluation of the random replacement policy for networks of caches. SIGMETRICS Perform. Eval. Rev. 40, 1 (
June 2012), 395–396.Google ScholarDigital Library
- [39] .[n. d.]. McDipper: A Key-value Cache for Flash Storage. https://www.facebook.com/notes/facebook-engineering/mcdipper-a-key-value-cache-for-flash-storage/10151347090423920/.Google Scholar
- [40] . 2017. The unwritten contract of solid state drives. In ACM EuroSys.Google Scholar
- [41] . 2010. High performance cache replacement using re-reference interval prediction. In ISCA-37.Google Scholar
- [42] . 1999. Performance of the move-to-front algorithm with Markov-modulated request sequences. Oper. Res. Lett. 25, 3 (
Oct. 1999), 109–118.Google ScholarDigital Library
- [43] . 1994. A comparison of trace-sampling techniques for multi-megabyte caches. IEEE Trans. Comput. 43, 6 (1994), 664–675.Google Scholar
Digital Library
- [44] . 2015. F2FS: A new file system for flash storage.Google Scholar
- [45] . 2017. Workload analysis and caching strategies for search advertising systems. In ACM SoCC.Google Scholar
- [46] . 2018. Better caching in search advertising systems with rapid refresh predictions. In WWW.1875–1884.Google Scholar
- [47] . 2017. Pannier: Design and analysis of a container-based flash cache for compound objects. ACM Transactions on Storage 13, 3 (2017), 24.Google Scholar
Digital Library
- [48] . 2011. SILT: A memory-efficient, high-performance key-value store. In ACM SOSP.Google Scholar
- [49] . 2016. WiscKey: Separating keys from values in SSD-conscious storage. In USENIX FAST.Google Scholar
- [50] . 2009. Modular data storage with Anvil. In ACM SOSP.Google Scholar
- [51] . 2016. Whirlpool: Improving dynamic cache management with static data classification. In ASPLOS.Google Scholar
- [52] . 1994. Opportunistic log: Efficient installation reads in a reliable storage server. In USENIX OSDI.Google Scholar
- [53] . 1990. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.Google Scholar
Digital Library
- [54] . 2016. Online coded caching. IEEE/ACM Transactions on Networking (2016).Google Scholar
Cross Ref
- [55] . 2018. Twitter’s doubling of character count from 140 to 280 had little impact on length of tweets. Tech Crunch (2018). Available at https://techcrunch.com/2018/10/30/twitters-doubling-of-character-count-from-140-to-280-had-little-impact-on-length-of-tweets/. 5/4/2021.Google Scholar
- [56] . 2017. PebblesDB: Building key-value stores using fragmented log-structured merge trees. In ACM SOSP.Google Scholar
- [57] . 1991. The design and implementation of a log-structured file system. In ACM SOSP.Google Scholar
- [58] . 2010. Approximate models for general cache networks. In IEEE INFOCOM.Google Scholar
- [59] . 2013. Reuse-based online models for caches. In ACM SIGMETRICS.Google Scholar
- [60] . 2016. Optimizing flash-based key-value cache systems. In USENIX HotStorage.Google Scholar
- [61] . 2018. Didacache: An integration of device and application for flash-based key-value caching. ACM Transactions on Storage (TOS) 14, 3 (2018), 1–32.Google Scholar
Digital Library
- [62] . 2020. Learning relaxed Belady for content distribution network caching. In USENIX NSDI.Google Scholar
- [63] . 2007. Anywhere the eye can see, it’s likely to see an ad. The New York Times 15, 1 (2007). Available at https://www.nytimes.com/2007/01/15/business/media/15everywhere.html. 9/6/2020.Google Scholar
- [64] . 2020. Twine: A unified cluster management system for shared infrastructure. In USENIX OSDI.Google Scholar
- [65] . 2015. RIPQ: Advanced photo caching on flash for Facebook. In USENIX FAST.Google Scholar
- [66] . 2011. The anatomy of the Facebook social graph. arXiv preprint arXiv:1111.4503 (2011).Google Scholar
- [67] . 2016. The Varnish Book for Varnish 4.0. Varnish Software AS.Google Scholar
- [68] . 2017. Cache modeling and optimization using miniature simulations. In USENIX ATC.Google Scholar
- [69] . 2013. Using set cover to optimize a large-scale low latency distributed graph. In USENIX HotCloud.Google Scholar
- [70] . 2015. LSM-trie: An LSM-tree-based ultra-large key-value store for small data items. In USENIX ATC.Google Scholar
- [71] . 2016. Bluecache: A scalable distributed flash-based key-value store. VLDB 10, 4 (2016), 301–312.Google Scholar
Digital Library
- [72] . 2021. A large-scale analysis of hundreds of in-memory key-value cache clusters at Twitter. ACM Transactions on Storage (TOS) 17, 3 (2021), 1–35.Google Scholar
Digital Library
- [73] . 2020. A large scale analysis of hundreds of in-memory cache clusters at Twitter. In USENIX OSDI.Google Scholar
- [74] . 2021. Segcache: A memory-efficient and scalable in-memory key-value cache for small objects. In USENIX NSDI.Google Scholar
- [75] . 2020. Taming Tail Latency and Achieving Predictability. https://twitter.github.io/pelikan/2020/benchmark-adq.html.Google Scholar
Index Terms
Kangaroo: Theory and Practice of Caching Billions of Tiny Objects on Flash
Recommendations
Kangaroo: Caching Billions of Tiny Objects on Flash
SOSP '21: Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems PrinciplesMany social-media and IoT services have very large working sets consisting of billions of tiny (≈100 B) objects. Large, flash-based caches are important to serving these working sets at acceptable monetary cost. However, caching tiny objects on flash is ...
File-Level, Host-Side Flash Caching with Loris
ICPADS '13: Proceedings of the 2013 International Conference on Parallel and Distributed SystemsAs enterprises shift from using direct-attached storage to network-based storage for housing primary data, flash-based, host-side caching has gained momentum as the primary latency reduction technique. In this paper, we make the case for integration of ...
High-Performance and Endurable Cache Management for Flash-Based Read Caching
Flash-based SSDs are widely used as storage caches, which can benefit from both the higher performance of SSDs and lower price of disks. Unfortunately, issues of reliability and lifetime limit the use of flash-based cache. One way to solve this problem ...






















Comments