An Empirical Analysis on Memcached's Replacement Policies

The performance of large-scale web services heavily relies on the hit ratio of the key-value caches. One core component of a high-performance key-value cache is the replacement policy. A right replacement policy can help the caching system achieve a better hit ratio with no extra space cost, thereby improving the system’s throughput and end-to-end latency. Memcached and Redis are two widely used in-memory key-value caching software in many production systems. Both Memcached and Redis are simple to use and capable of ensuring end-to-end latency requirements for latency critical services. Memcached and Redis use different policies for cache replacement. In contrast, Memcached uses LRU or a variant of segmented-LRU (SegLRU) for replacement, while Redis uses KLRU, a random sampling-based LRU policy which evicts the LRU object from K randomly selected samples. This naturally leads to the question: “how does one compare to the other in actual production usage?” To answer the question, we implement the KLRU policy on Memcached. We evaluate the effectiveness of these three policies using both synthetic and actual production workloads. Our empirical analysis shows that, both SegLRU and KLRU outperform LRU in scalability for write-intensive workloads. However, despite the fact that SegLRU and KLRU are considerably different in terms of their heuristic and implementation, they yield very similar cache hit ratios, throughput, and scalability, with the random sampling-based LRU slightly winning over write-heavy workloads. KLRU also shows advantages in its simplicity in data structures and flexibility in adjusting the sampling size K to adapt to different workloads.


INTRODUCTION
Modern large-scale web services rely on caching extensively; inmemory key-value (KV) caches are placed between front-end services and back-end storage systems to achieve high throughput and overcome the long latency gap.In-memory KV caches are widely used and discussed in industry and research communities; Memcached [29] and Redis [23] are two in-memory caching solutions commonly deployed in many production environments.Large web service providers like Facebook and Twitter also developed their general-purpose caching frameworks, Cachelib [4] and Pelikan [42], respectively, to handle their caching use cases.
The performance of these caching systems is largely impacted by its replacement/eviction policy, i.e., the algorithm that decides whether an item should be cached or evicted.The Least Recently Used (LRU) is one of the most commonly known replacement policies, which evicts an item based on the item's access recency.Despite its simplicity, the LRU policy has proven quite effective in many caching systems [4,9,32,41].There are also many other advanced eviction policies, such as [2,6,27,33,43], which make eviction decisions based on a combination of item's metadata.The effectiveness of an eviction policy primarily depends on two factors: First, the caching workload; With the rise of cloud and datadriven services, the diversity of in-memory caching workloads has grown drastically compared to the past [1,4,10,41].Many existing studies have shown that no existing heuristic-based eviction policies can consistently outperform others under every caching use case [27,33,43].As a result, many special purposes caching frameworks have adopted different replacement policies to accommodate different use cases [13,26,39,42].
Second, the underlying cache's structure and the storage medium also constrain the choice of replacement policy.For example, flashbased cache suffers from application-level write amplification (ALWA); even though FIFO does not deliver a good hit ratio, using FIFO helps avoid metadata updates on reads which reduces ALWA [15].Due to this, FIFO is often the top choice for replacement on a flash-based cache.
Memcached and Redis use replacement policies based on more conservative heuristics to prevent poor performance on unpredictable extreme workload patterns.The early versions of Memcached used the traditional doubly-linked list implementation of LRU as default replacement [12].The LRU policy helps the cache retain the most recent data; however, the serialized LRU update procedures severely hindered Memcached's multithreaded scalability, especially on write-heavy workloads.To address the thread contention problem, the later version of Memcached (after 1.5.0)added a multi-queue LRU and asynchronous updates, named Segmented LRU (SegLRU), which significantly improved Memcached's performance on write-heavy workloads [12].Similar multi-list approaches are observed in other caching systems such as [4,35,43].On the other hand, Redis uses random sampling to approximate the true LRU replacement.On cache eviction, the random samplingbased LRU, or KLRU for short, randomly selects K objects from the cache and evicts the least recently used object among the K objects [22].Such sampling-based technique is commonly used in priority function-based replacement policies [2,6,22,31].Although both Memcached and Redis use similar replacement heuristics, they are implemented based on very different approaches, which motivates the need for a thorough comparison of their impacts on the in-memory caching system's performance.In this work, we implement Redis-like KLRU in Memcached.Then, we present a detailed comparison based on their impact on Memcached's hit ratios, scalability, throughputs, and latencies.More specifically, this paper is organized as follows: (1) Section 2 describes the higher-level overview of Memcached and its cache replacement mechanism.(2) Section 3 presents our design and implementation of KLRU replacements with supports for handling expired items.(3) Section 4 describes the workloads used in our evaluation.(4) Section 5.1 compares the miss ratio difference of Memcached configured with SegLRU and KLRU.Our results confirm that both SegLRU and KLRU yield similar miss ratios, especially under a sufficiently large cache.(5) Section 5.2 presents an empirical performance evaluation for Memcached under SegLRU and KLRU.We observe similar read performance for both policies and slightly higher write performance on KLRU.We also show the impacts of the network latency and global slab allocator lock on Memcached's throughput.Based on our evaluation, both SegLRU and KLRU perform noticeably better than the early version of Memcached, with KLRU slightly leading SegLRU on write-intensive workloads.

MEMCACHED'S REPLACEMENTS OVERVIEW
Memcached is a multithreaded key-value cache that is typically used to reduce latency and increase throughput by caching objects in memory from the slow back-end storage system.Memcached uses a slab-based memory allocator for internal memory management with a default slab size of 1MB.Internally, the memory is divided into slab classes, with each class storing items with the corresponding size.Initially, slabs are distributed to slab classes based on demand.When a slab class exhausts all of its slabs, it will first try to request a new slab; then, if no slab remains, it begins evicting items from the slab class according to the replacement algorithm.The replacement policy and related data structures are thus on a per-class basis.

LazyLRU Policy
The early version of Memcached(1.4.x) uses the standard doublylinked list to implement its LRU replacements, where every slab class maintains its LRU list.Memcached uses multiple worker threads to process client requests concurrently.Client requests are commonly in the form of GET or SET requests.When handling a GET request, a worker thread first performs a hash table lookup.
If the requested item is found, the worker thread will update the item's metadata and position in the LRU list and then send back the retrieved item to the client.If not found, Memcached will notify the client it's a miss.When handling a SET request, the worker thread first checks whether there is enough free space left for storing the item.The new item will be stored in memory and inserted into the LRU list head if space is sufficient.In case of insufficient memory, the worker thread will trigger the cache replacement/eviction algorithm to remove an old item to make space for the new one.
When the client requests an item, Memcached maintains the LRU list by repositions the newly referenced item into the head of the LRU list.When the slab class exceeds its capacity, it starts evicting items from the tail of the LRU list.These manipulation operations on the LRU list must be serialized to prevent LRU list corruption.To handle that, a dedicated LRU mutex lock is held during LRU list update and eviction so that only one worker thread can modify the LRU list at once.This approach worked at the time, but as Memcached scales to more cores, the LRU lock becomes a bottleneck on Memcached's throughput [37].One optimization Memcached made to reduce the LRU lock contention is introducing an item_update_interval, which is set to one minute by default.The update interval restricts items on the LRU list from being moved to the head of the list more than once per minute.The idea is that, on the LRU list, those recently used items are always closer to the head of the LRU list, which is less likely to age out and be evicted.Thus, constantly updating these recently accessed items on the LRU list is unnecessary.Setting the update interval to one minute drastically decreases the number of LRU list updates, hence reducing LRU lock contention.For convenience, we use LazyLRU1 to denote the above approach in all of the following discussions.
The LazyLRU approach addresses the excessive LRU locking issue in most read-intensive scenarios (see Section 5.2), but it does not eliminate the problem completely.Workloads with bursts of re-accesses longer than the item_update_interval are likely to suffer from performance degradation.Furthermore, the item_update_interval only alleviates the LRU locking issue on the read path; for writeintensive workloads, Memcached must hold the LRU lock when inserting a new item into the LRU list.

SegLRU Policy
The Segmented LRU (SegLRU for short) policy was designed to replace Memcached's original LRU policy (LazyLRU), and it was set as Memcached's default replacement policy starting from 1.5.0 [12].The Segmented LRU's design is inspired by the OpenBSD's variant of the 2Q algorithm [35].
In the segmented LRU policy, the original LRU list is split into three separate segments: HOT, WARM, and COLD, with each list protected by its own mutex lock.Unlike LazyLRU, where the LRU list updates directly lie on the read request's execution path, the Segmented LRU shifts cached items between/within each segment asynchronously by a background maintainer thread.Thus, it directly avoids potential lock contentions on read-intensive workloads.Next, we briefly outline the semantics of each segment based on the Memcached site post [12]: (1) HOT behaves like a FIFO queue.A newly arrived item is added to the head of the HOT segment and gradually sinks to the tail of the segment as more items continue to flow into the HOT segment.When the HOT segment reaches its limit, the background maintainer starts shifting the tail item to the head of the Warm segment if it is an active item or the Cold segment if it is inactive.An item is active if it has been re-accessed at least once in the process of the item flowing from segment head to tail.(2) WARM only admits active items.When the WARM segment reaches its limit, the background maintainer asynchronously moves overflowed inactive items to COLD and re-admits active items back to the head of WARM.(3) COLD admits inactive items from both HOT and WARM.
If an item becomes active in COLD, it will asynchronously move back to the WARM.In case the slab class is full, it starts evicting items from the tail of the COLD.
Here we highlight three main improvements of SegLRU when compared to LazyLRU.(1) Similar to the 2Q replacement [21], SegLRU achieves scan resistance by sinking all inactive items directly to COLD, evicting them from the tail, and using the WARM queue to keep all active items protected.(2) It's more tunable.SegLRU allows you to change the size of the HOT and WARM queues while running.(3) The LRU mutex lock is wholly removed from the read path, thus avoiding all potential waits on the mutex locks for a read request.However, mutexes on the write/update path still exist, which is still a potential bottleneck of scaling Memcached to more cores for write-intensive workloads.end if 28: end procedure

KLRU IN MEMCACHED
This section presents our design and implementation of KLRU in Memcached and its impact on background crawling of expired items.

KLRU Design and Implementation
The KLRU replacement is a variant of the LRU algorithm in which, at eviction time, it randomly samples  items and evicts the LRU item among the  items.When  = 1, the KLRU is equivalent to random replacement.It's easy to see that as the value of  increases, the probability of selecting a less recently used item also increases.In practice, we notice that when  = 16, KLRU behaves nearly identical to the exact LRU replacement.In order to make a fair comparison between Memcached's replacements and KLRU, we implement the KLRU algorithm on top of Memcached, similar to the RankCache design from LHD [2].On a GET/SET request from the client, KLRU does not maintain the doubly-linked list to keep track of the recency order of items in the cache.Instead, it simply updates the timestamp of the items being referenced.By removing the doubly-linked LRU list, it also no longer needs the LRU mutex lock to safeguard the list manipulation, hence avoiding the potential lock contention problem [37].Algorithm 1 outlines the steps for the eviction process in KLRU.When the cache is full, worker threads attempt to randomly select K items from the corresponding slab class by generating random indexes for the items (lines 10-12).These selected items are then compared to potential candidates (lines [19][20][21][22].Note that only the actual item removal process is serialized (line26) to prevent multiple workers from removing the same item simultaneously.The rest of the algorithm 1 is all done in parallel to maximize the thread concurrency.
Compared to LazyLRU and SegLRU, the most noticeable distinction of the sampling-based replacement policy (KLRU) is it does not rely on any data structure to maintain the ordering of cached items.This unique characteristic of KLRU has its advantages and drawbacks.We summarized three major advantages of KLRU as follows: (1) KLRU uses random sampling for eviction, thus completely eliminating the LRU locks from both read and write paths.
(2) The overhead per item is much smaller.KLRU only uses item timestamps during the eviction to compare items' recency.Hence, it saves 16 bytes on the two extra pointers for the doubly-linked LRU list.(3) Random sampling eviction provides great flexibility.Two potentially tunable factors can optimize cache performance under different cache use cases.First, the eviction process described by Algorithm 1 is orthogonal to the item's priority; even though KLRU only uses recency information to decide the lowest priority item on eviction (line 19), the priority function can be easily changed to adapt to even more diverse cache use cases [2,6,22].The second factor is the sampling size K; when workloads favor LRU replacement, one can dynamically increase the value of K (up to 32).When workloads favor random replacement, a smaller value of K can be chosen to increase randomization on the eviction.However, note that this also leads to the drawback of the sampling-based eviction; it does not guarantee that popular or recently accessed items could live longer in the cache, especially for a smaller value of K; thus, it might lead to some occasional latency spike on popular items.

Handling Item Expiration
Similar to other In-memory caching software [3,23,42], Memcached also supports item expiration.The client can specify an item's time-to-live (TTL) on a SET request.The TTL sets the time limit for how long an item will remain valid in the cache before it's expired.Memcached employs a separate background crawler that periodically walks over the LRU list starting from the tail of the list, then removes the expired item and reclaims the memory.However, this approach is not feasible for Memcached with KLRU.The KLRU implementation described earlier completely removes the LRU lists from the system.Thus, the background crawling of expired items can no longer be done by walking through LRU lists.Instead, we scan for expired items directly at the slab level.Since each slab is a piece of continuous 1MB memory, scanning for expired items directly at the slab level is more cache-friendly than scanning from the LRU list level.After the system runs for a while, items start to be allocated randomly on the slab class slabs.One downside to scanning at the slab level is that it wastes resources for scanning through some unused memory chunks.However, this phenomenon should be rare since in-memory caches are usually full when crawling is triggered.Secondly, Memcached also supports a passive expired item reclaim mechanism triggered whenever eviction occurs.On eviction, Memcached checks a fixed number of items from the tail of LRU list and removes the expired item.Note that this passive mechanism operates on the lock-protected LRU list and lies directly on the eviction execution path; thus, it could potentially limit the scalability due to the heavy locking of the LRU list.To add passive expired item reclaim supports to the KLRU, we adopt a similar passive reclaim mechanism used in Redis [23].When the KLRU randomly selects K items from the slab class on eviction, we remove all the selected expired items directly (Algorithm 1 line 14-17).

TRACES DESCRIPTION
We use three publicly available production workloads and wellknown synthetic workloads to evaluate the difference between LazyLRU, SegLRU, and KLRU.
MSR MSR Cambridge traces [34] is a collection of one-week block I/O traces from 13 different enterprise servers, such as web proxy (prxy), media server (mds), and source control (src1, src2), etc.The MSR workload consists of a diverse set of access patterns that previously been used in many caching system evaluations [5,30,36,38,40].
Twitter Twitter Cache traces [41] are a collection of one-weeklong caching traces from 54 Twitter's in-memory caching clusters.The trace suite is around 14 TB in file size and comprises various caching use cases.For our evaluation, we choose sub-traces from 9 different caching clusters, each with approximately 100 million requests and various combinations of read/write ratios.
IBM COS The IBM COS traces [14] consist of 99 traces, and each was collected over a week-long period from IBM's public cloudbased object storage service.Each of these IBM traces has different requests-to-distinct-objects ratios.To ensure that enough items are stored in the cache, we use IBM COS traces containing at least 1 million distinct items and have a steady miss ratio of less than 30%.
Our evaluation uses all four workloads to test different cache replacements' miss ratio (or, conversely, the hit ratio) performance.Then, when evaluating the impact of the replacement policy on Memcached's throughput and scalability, we mainly use Twitter's caching workloads, as it contains traces with a variety of read/write ratios.

EVALUATION
This section presents miss ratio difference, empirical throughput and latency evaluation for Memcacheds configured with different cache replacement policies, mainly LazyLRU, SegLRU, and KLRU.Our experiments run on a server with two NUMA nodes, each with a 2.20 GHz Intel(R) Xeon(R) CPU E5-2650 v4 containing 12/24 cores/threads with hyper-threading enabled and 250 GB DRAM.The operating system is Fedora 32 with Linux kernel 5.6.15.
In the following evaluation, we use Memcached 1.6.10 with modifications to support KLRU replacement.Moreover, we deploy the Memcached instance only on one NUMA node to eliminate possible memory access latency discrepancies.We use a modified version of mc-crusher [28] on a second node to generate requests for Memcached.We add support for item setbacks on cache misses, as well as support for loading/replaying existing traces on mc-crusher.In Section 5.1, we compared the effectiveness of KLRU and SegLRU in terms of cache miss ratio and concluded that both implementations have their advantage (Figure 3); there is no clear winner from the perspective of miss ratio.Memcached's replacement policies are constructed based on the doubly-linked LRU list, where any modification of the LRU list is protected by mutex lock.In contrast, KLRU is completely lock-free and only requires an update in the timestamp.This section focuses on the impacts of the data structures and locks of the three policies, so we would like to eliminate differences caused by miss ratio.Unless otherwise specified, we over-provision the Memcached memory size so that the entire workload can fit in the cache.Therefore, there will be no capacity misses.Additionally, we always warm the cache for a sufficient time to eliminate discrepancies from the cold start (cold misses) on our results.

Miss Ratio Comparison
One of the most important metrics to consider when evaluating different cache replacement policies is the cache's miss (or hit) ratio.This section presents the cache's miss ratio results of the workloads mentioned in Section 4. Since Memcached manages eviction independently on each slab class using the same replacement policy, we fix item size for all items in a workload so that all items can fit into exactly one slab class for easier understanding and analysis of the cache's miss ratio.
We generate Miss Ratio Curves (MRC, a plot of cache's miss ratios against cache sizes) for every trace with different replacement policies.When generating MRCs under LazyLRU and SegLRU policies, we use Memcached's default setting for all tunable parameters associated with these policies.For Memcached with KLRU policy, we also do not tune for optimal sampling size K; instead, we conservatively choose K = 16.Figure 1 illustrates the MRCs of 9 representative traces from MSR, Twitter, and IBM COS.When the      for LazyLRU is set to high value, LazyLRU behaves similarly to FIFO.Therefore, we have also plotted FIFO MRCs for different traces on figure 1 for reference purposes.Based on the MRCs, we observe that KLRU, LazyLRU, and SegLRU perform very similarly to each other in most instances; they approach steady miss ratios within the roughly same amount of memory size.The largest gap appears in ibm.029,where segLRU results in much higher miss ratios for certain cache sizes.For a more detailed illustration, Figure 2 depicts miss ratio differences between SegLRU and KLRU for all traces tested under three different cache sizes with the 95%, 75%, and 50% working set sizes, respectively.As expected, with larger cache sizes, miss ratio differences between SegLRU and KLRU are close to zero.We observe more deviation between the two policies when the cache size is small relative to the working set (50%).When the cache size increases, more items, especially those popular items, fit into the cache, resulting in minor differences.
Next, Figure 3 illustrates the impact of cache misses on application performance.This experiment chooses two different traces to show the latency change over time with Memcached configured with SegLRU and KLRU.Miss penalties, i.e., the time to request missed data from the back-end database, are randomly distributed between 100-300us for both tests.To reflect the impact of cache misses on request latencies, we include the cache hit ratio along with the latency plot in Figure 3.The first trace, Figure 3(a), is a synthetic trace that follows Zipfian popularity distribution with alpha = 0.99, interrupted by a long scanning request pattern in the middle.As expected, we observe that SegLRU is much better at protecting the cache from long scanning requests.KLRU appears to recover from the scanning slower than SegLRU, which is reasonable given that we use the sampling size of K=16; at this sampling size, KLRU behaves similarly to an LRU cache.The second trace, Figure 3(b), is IBM COS 029.As shown in Figure 1, the KLRU cache's overall miss ratio is significantly lower than the SegLRU cache's miss ratio under IBM COS 029 trace.We observe a significant latency reduction during the time interval from 2000 to 3000 seconds, where the KLRU cache's hit ratio wins over the SegLRU cache's hit ratio.
Based on our miss ratio comparison results, it's simple to see that there is no clear winner that consistently yields the lowest miss ratio.The replacement policy's effectiveness largely depends on the workload's request pattern.Thus, it's crucial that caching systems, like Memcached, provide runtime tuning capability for their replacement algorithm so that users can manually tune the replacement to better suit their use cases.policies, we use the same size (96bytes) for every KV pair so that all items can fit into a single slab class.Figure 4 depicts Memcached throughput under LRU and three other LRU variants.Under the read-intensive case (Fig 4a), Memcached configured with LazyLRU, SegLRU, and KLRU achieves similar throughput.These LRU variants relax LRU locks on the read request, and as a result, they outperform the naive LRU implementation (i.e., locks LRU queue on every read request) by 45%.Under the write-intensive case (Fig 4b), we do not observe any major performance difference between naive LRU and three other variants.Although Memcached configured with KLRU is completely LRU lock-free on write requests, the improvement over throughput compared to naive LRU is insignificant.The similar throughput between Memcached with KLRU and naive LRU implies that the LRU lock is not the primary limiting factor of Memcached performance under write-intensive workloads.Furthermore, we also observe that Memcached's LazyLRU underperforms naive LRU by 35% in the write-intensive case.Inspecting Memcached's source code, we notice that under LazyLRU, Memcached always attempts to reclaim the memory first before allocating memory for the new item.The memory reclamation is done by walking up a few elements from the LRU tail and removing expired/invalidated items along the walk.This LRU lock-protected reclamation process severely limits LazyLRU's throughput and scalability (Sec.5.2.3).After removing the reclamation on the write request's execution path, the throughput of LazyLRU recovers to the same level as other variants.

Throughput, Latency, and Scalability
5.2.2 Latency. Figure 5 illustrates Memcached's latency cumulative distribution under read-intensive, write-intensive, and read/write mixed cases.And Table 1 shows the tail latencies of these three corresponding cases.Memcached shows similar latency distribution and tail behavior for read-intensive workloads under different policies.When the fraction of write requests increases, KLRU and SegLRU show lower end-to-end latency than LazyLRU.In the case of Memcached configured with limited memory, one should expect better tail latency from KLRU as both SegLRU and LazyLRU require mutex lock on eviction, which could hurt tail behavior as Memcached scales.

5.2.3
Scalability.We present the scalability results in Figure 6, which shows the change in Memcached's throughput as the number of worker threads increases from 1 to 20.Our experiment shows that Memcached scales nearly linearly up to 20 worker threads for all three replacement policies under read-intensive workloads.Memcached's scalability decreases when the workloads shift toward the more write-intensive end.Even though KLRU is totally LRU lock-free, throughputs are still capped at 1 MQPS after 12 worker threads.SegLRU, which locks the LRU queue on write, achieves only slightly lower throughput for the given number of worker threads than KLRU.Similar scalability between KLRU (LRU lock-free write) and SegLRU (LRU locked write) indicates that the LRU lock on write is not the dominating factor that limits Memcached's write capability.For LazyLRU, the throughput degrades when Memcached scales beyond 12 worker threads; As mentioned before, the long LRU-locked reclamation process lies on the write path of LazyLRU creates heavy lock contention as the number of worker threads increases, which bottlenecked Memcached's performance.In summary, for read-intensive workloads, our experiment shows that all three LRU variants achieve significantly higher throughput than the naive LRU implementation, and all three have close to linear scalability.For write-intensive workloads, Memcached with KLRU shows slightly better performance compared to SegLRU.Nonetheless, we find that Memcached's write capability stops scaling past 12 worker threads regardless of the replacement policies.

Metadata Overheads.
The KLRU design scrapes off the entire LRU lists layer from Memcached, which leads to two apparent benefits.First, it simplifies the read/write execution path and lowers the overall system complexity.Second, it saves 16 bytes per item by removing two pointers used for the doubly-linked LRU list.For workloads with large item size (> 1), 16 bytes saving in metadata will not be significant, but for small item size (< 100 bytes) workload, reducing metadata overhead could help save a significant portion of memory resource.For example, under SegLRU, the trace used in Figure 7 would take 96 bytes (including metadata) for each item, but under KLRU, it only takes 80 bytes per item, representing a 17% reduction in total memory consumption.

Eviction
Overheads.Upon eviction, KLRU randomly samples  elements from the slab class and evicts the oldest item among selected  items.the size of  increases.Fortunately, it's been shown that KLRU with a sampling size as small as 16 can very well approximate the actual LRU [2,40].In our experiments, we use 16 as the default sampling size.To compare the impacts of eviction overhead with SegLRU, we configure Memcached with three different memory sizes, such that the cache miss ratio was the same for KLRU and SegLRU.Figure 7 shows that the throughput for SegLRU and KLRU is nearly the same in this setting, which indicates that random sampling up to sample size 16 does not negatively impact the Memcached throughput.with the original Memcached implementation.To compare their effectiveness in handling expired items, we use a read-intensive trace with all items' TTL set to 60 seconds so that items that stay in the cache for longer than one minute are considered expired.Figure 8(a) shows Memcached's throughput change over an 8-hour interval.and (b) shows Memcached's memory consumption over the same period.We observe that Memcached with KLRU and SegLRU yield very similar throughputs, with KLRU slightly winning over.In terms of memory consumption, we show that both crawling mechanisms are capable of bounding the memory usage compared to Memcached with expired reclaim disabled.

Impacts of Network Latency and Slab Allocation Lock
In this section, we further compare Memcached replacements by peeling off other performance-limiting factors on Memcached.
LRU LazyLRU SegLRU KLRU 0  We first consider the impact of network latency.Network latency can account for a significant portion of the overall end-to-end request latency, and high network latency can mask the performance differences between different replacement policies.To focus the evaluation on replacement policies, we bypass the network latency by buffering the entire workload into Memcached and then replay requests directly in the process.In Figure 9, we compare Memcached throughputs with and without network bypass enabled.After bypassing the network stack, we observe no major throughput change for Memcached with naive LRU implementation for read-intensive workload, as it is bottlenecked by heavy LRU lock contention, but all three other LRU variants show significant throughput increases.KLRU's throughput is notably higher than SegLRU and LazyLRU.For the write-intensive workload, Memcached with and without network latency yields similar performance differences among replacement policies, which backs our previous claim (Sec.5.2.1) that LRU lock is not the major bottleneck on Memcached's write path.Lastly, besides the LRU lock, the Memcached's write path is also guarded by a global slab allocation lock.The slab allocation lock ensures that all internal memory management operations, such as memory allocation and memory reclaim, are serialized.To examine the impacts of slab allocator lock, we temporarily change the global slab lock to the different levels of fine-grained slab locks.Figure 10 shows throughputs of Memcached running write-intensive workload on the first eight default slab classes sharing 1, 4, and 8 slab locks, respectively.As the number of slab locks increases, the contention pressure on each lock decreases, leading to higher throughput.

RELATED WORK
There is a great amount of research on cache eviction policies.In this section, we review various policies used in state-of-the-art production caching systems and academia.
Production caching Facebook -Cachelib [4] is general-purpose caching engine developed by Facebook in an effort of balancing the generality and specialization of a wide variety of caching systems, including CDN caches, K-V caches, media caches, and social-graph caches.Cachelib is a hybrid cache engine that supports caches composed of DRAM and Flash.Its eviction policy is configurable for the two different underlying storage media.For DRAM cache, each separate  [21], and TinyLFU [13].For Flash cache, FIFO and a pseudo-LRU policy are used in Large Object Cache to amortize the computational cost of flash erasures, while the Small Object Cache only supports policies with no state updates on hits such as FIFO.Twitter -Segcache [42], a new storage back-end dedicated to small objects developed by Twitter, believes macro management over a contiguous block of object segments could improve both throughput and scalability by reducing CPU cycles on maintaining object indexes for eviction and other operations.It uses a mergebased algorithm to perform eviction by segments.Multiple consecutive, un-expired object segments of the same TTL range are combined into one, and per-object eviction decisions are made while traversing through those segments by evaluating each object's frequency-over-size ratio.
Caffeine [3] is a high performance caching library for Java.It adopts Window TinyLFU eviction policy [13], which involves two cache areas: main cache and window cache.The main cache uses the Segmented LRU eviction policy and TinyLFU admission policy, where the two separate regions of Segmented LRU space are partitioned into 80% of hot items and 20% of non-hot items.The window cache adopts LRU eviction and no admission policy.The size of the main cache and window cache can be adaptively determined by a hill climbing optimization.Their evaluation shows Caffeine's implementation provides a hit rate near Belady's optimal theoretical upper bound over a range of workloads.
Research caching MemC3 [16] uses Cuckoo hashing, removes Memcached's LRU chain pointers and locks by implementing an approximate LRU cache based on CLOCK replacement algorithm to improve concurrency and throughput, but at the cost of sacrificing memory efficiency, and only works good for workloads with targeted characteristics.MICA [24] adopts an append-only circular log data structure that is write-friendly by placing new objects only at the end of the log.It maps accesses to specific CPU cores and data partitioning to improve scalability and throughput.But it only supports FIFO and approximated LRU policy.HotRing [8], is a Key-Value store with an ordered-ring hash structure that is lock-free to support massive concurrent accesses and improve system throughput.

CONCLUSION
This paper presents the results of an empirical study on the performance impacts of two popular LRU implementations on Memcached.Our result reveals that the KLRU implementation, which is LRU (list, lock)-free, results in slightly better write performance and fewer metadata overhead.Our evaluation demonstrates that both implementations of Memcached exhibit close to linear scalability under read-intensive workloads.However, Memcached's throughput under write-intensive workloads stops scaling after 12 threads, even with the LRU lock-free implementation.This implies that the LRU lock is not currently Memcached's write performance bottleneck.Moreover, we demonstrate that relaxing the global slab allocator lock enhances the write performance, but Memcached's write performance appears to still not scale well with more threads.We believe these findings can offer valuable insights for the development of future in-memory cache designs.

Figure 3 :Figure 4 :
Figure 3: Request Latency v.s.Time.The latency is measured as average latency on every second.The cache's hit ratios, at some time points, are indicated with corresponding colors, respectively.a) shows a workload favors SegLRU.b) shows a workload favors KLRU.

5. 3 . 3 Figure 8 :
Figure 8: (a) is the Memcached's throughput over time measured every 300 sec.(b) is the Memcached's memory consumption over time measured every 300 sec.The Memcached is configured with 10 worker threads and the trace used here is Twitter 034 trace.

Figure 9 :
Figure 9: Throughput Differences after Bypass Network

Table 1 :
Table 2 shows the cost of sampling  items from slab class with  from 1 to 32.The sampling cost grows linearly asFigure 6: Memcached Thread Scalability under workloads with Different R/W Ratio.The four traces used are Twitter 034, 026, 041, 032, respectively.Memcached Request Tail Latency Impacts of Slab Allocator Lock on Memcached with 20 worker threads running Twitter trace 32.0 DRAM cache pool or slab class is capable of applying different recency or frequency-based eviction policies, including LRU, LRU with multiple insertion points, 2Q