MC-ELMM: Multi-Chip Endurance-Limited Memory Management

Non-volatile memories (NVMs) have become a staple of architectural research. NVMs naturally enable techniques such as crash consistency; persistent snapshots, logs, and heaps; security-enhanced memories; and even near-or in-memory processing. However, all current NVM technologies suffer from limited write endurance. Monolithic 3D integration (M3D), an NVM-enabled, near-memory technique, drastically increases compute-to-memory connectivity, improving the energy-delay product (EDP), especially for data-intensive workloads. However, M3D systems have another constraint: scaling M3D memory capacity adds multiple compute-plus-memory chips in a NUMA arrangement. In response, we first develop a lifetime extension mechanism for endurance-limited memories (ELMs) that extends chip lifetime from mere minutes to several years. Our page-based scheme minimizes execution disruption. Second, we extend our single-chip scheme to multiple M3D chips. We show that NUMA policies for DRAM systems are ill-suited for M3D because they either incur too many costly off-chip accesses or sacrifice lifetime. Our technique preserves NUMA locality benefits while significantly improving overall system lifetime. For homogeneous multi-threaded workloads running across multiple NUMA nodes (chips), our multi-chip scheme increases lifetime over our single-chip scheme by a geometric mean of 48%. For a heterogeneous mixture of workloads running on a multi-monolithic-chip cluster, we increase cluster lifetime by a factor of 4.7 × , bounded by a 6% energy and 6% runtime overhead (commonly with no runtime overhead at all).


INTRODUCTION
Recently, non-volatile memories (NVMs) have given rise to a host of new architectural techniques.Like flash memory, NVMs retain their contents without power, but add byte-level addressing to enable techniques such as crash consistency; persistent snapshots, logs, and heaps; security-enhanced memories; and even near-and in-memory processing.However, all current NVM technologies suffer from limited write endurance: they are endurance-limited memories ("ELMs").
Due to fabrication constraints, M3D systems must use ELMs instead of DRAM.ELMs include phase-change RAM (PCRAM), resistive RAM (RRAM), and spin-torque transfer magnetic RAM (STT-MRAM).Bit cells within these memories can be written only 10 5 − 10 9 times before failure (see Sec. 3.1).A system that naïvely attempts to treat ELMs just like unlimited-write-endurance DRAM will fail within one day's time (see Sec. 4).
M3D systems achieve large energy and execution time benefits by keeping memory accesses on-chip whenever possible.However, in order to scale compute and memory capacity, we must also support multiple M3D chips in a cluster.Many past papers have looked at extending ELM lifetime within a single chip.However, we show that in a multi-chip cluster environment, extending lifetime individually within chips leaves around 80% of cluster lifetime unrealized (see Sec. 8.4).This is because different workloads wear the chips at different rates.Thus, to extend lifetime, we should efficiently remap workloads across chips.Existing work either (i) assumes the presence of off-chip DRAM, which penalizes EDP [8][9][10], (ii) does not consider multiple chips, or (iii) in a multi-chip system, assumes that all execution must pause while the entire contents of chips' memories are swapped at once.In today's availability-and performance-critical datacenters, pausing execution in such a way is unacceptable.
We are thus tasked with improving lifetime not just on a single ELM chip, but across an entire cluster comprised of ELM chips, with minimal execution disruption.To accomplish this, we modify the virtual memory system to and first-class support for endurance tracking in units of pages, instead of entire chip memories.In Secs. 5 and 6.4, we show that our mechanism increases singlechip lifetime by hundreds of times for a 1% energy penalty and 1% runtime penalty, commonly with no runtime penalty at all.Then, Fig. 1: Multiplicative lifetime benefits from stacking our techniques.
in Sec. 8, we extend our single-chip scheme to a multi-M3D cluster, showing that it extends lifetime a further 4.7× over our singlechip scheme, bounded by an additional 6% energy and 6% runtime penalty (and commonly with 0% runtime penalty).

OBJECTIVE
We define lifetime as the minimum amount of runtime required to make any single bit in a chip's memory fail.For simplicity, we do not consider complementary, compatible techniques such as backup arrays, error-correcting codes, and ChipKill [16,17].We use the term "frame" to refer to the smallest unit of physical memory manageable by the operating system (for example, 4 KiB frames, though we use larger 1 MiB "jumbo" frames to reduce overhead).Similarly, a "page" denotes a unit of virtual memory which maps onto a physical frame.A "line", or "cache line", denotes the smallest unit of memory tracked by a cache.In later sections, in our distributed multi-chip scheme, an individual chip forms a "node".
In Sec. 4, we use a standard LLC to enhance single-chip lifetime and show that meeting lifetime goals with an LLC alone quickly becomes impractical.In Sec. 5, we introduce a new page-level abstraction with bit-level rotation and write elimination.In Sec. 6, we use our new page abstraction and ELMs' byte-level addressability to massively enhance single-chip lifetime without interrupting execution.In Sec. 8, we extend our abstraction further to function across multiple chips, also while minimizing execution interruption.Fig. 1 shows the cumulative effects of these stacked techniques using geometric means across workloads.Table 1 shows array-level bit cell write endurance values for ELMs from literature.Throughout this work, we conservatively assume a bit cell write endurance of 10 6 , with which we show multi-year system lifetimes for CPU-based systems (see Sec. 6.4).Achieving our lifetime goals even with a pessimistic 10 6 limit gives us more freedom to select memory technologies for their other favorable properties, such as energy and latency.Device advances beyond 10 6 allow our technique to meet multi-year lifetimes for architectures which generate much more write traffic, such as GP-GPUs (see Sec. 7.2).

Simulated M3D System
Prior work has shown significant EDP benefits for M3D systems [2][3][4][5][6][7][8][9][10][11][12][13][14][15].Using workloads from Table 3, we performed independent simulations of an M3D system and compared against an off-chip DDR4 baseline, an HBM2 [27] system, and an all-SRAM-main-memory system, and found EDP benefits in line with prior publications.With past work having established EDP benefits, this paper focuses on system lifetime.The M3D system we use for lifetime analyses is shown in Table 2.We use DESTINY [28,29] to derive the monolithically-integrated RRAM [30] memory access characteristics, including wire and interconnect.First, we run DESTINY using 28nm monolithic RRAM technology parameters from a major manufacturer.These parameters include bit cell area and aspect ratio; bit cell resistances and capacitances for the low-and high-resistance states; SET and RE-SET voltages, pulse durations, and energies; access transistor width; and access transistor voltage drop.We then scale the 28nm results to 7nm using contacted gate pitch, supply voltage, gate capacitance, and drive current formulas.DESTINY assumes that all write-verify logic resides off of the critical path.
Throughout the paper, we simulate workloads using the zsim [31] architectural simulator, GCC 10 compiler at optimization level -O3, and Debian 11 operating system.For all workloads, we simulate 50 billion instructions while the workload is running at steady state (i.e., past any initialization phases).

Workload Selection
Our evaluations include the SPEC CPU 2017 [32] suite, as it contains workloads spanning business, scientific, creative, and technical domains.Some SPEC workloads are single-threaded ("spec_st"), and some are multi-threaded ("spec_mt"). 1 All other workloads are multi-threaded.We also evaluate several machine learning workloads for both inference and training: the ResNet-152 [34] convolutional neural network (CNN), a long short-term memory (LSTM) [35] network, a Transformer attention-mechanism [36] natural language processing (NLP) network, and a generative adversarial network (GAN) [37]. 2or graph analytics, we run breadth-first search (BFS), connected components (CC), maximal independent subset (MIS), PageRank [40], and graph radii algorithms using the Ligra [41] suite.For these workloads, execution characteristics rely heavily not just on the algorithm being run, but on the dataset being analyzed.For this reason, we evaluate each graph algorithm on two distinct inputs: (i) the real-world LiveJournal dataset [42,43], and (ii) a randomized R-MAT [44] graph.
We use Linux perf to measure last-level cache misses per kilo instruction (LLC MPKI), as well as memory read and write bandwidth.We use GNU time to measure maximum resident set size (RSS): the maximum amount of physical memory a process has mapped throughout its runtime.All profiling runs were performed on a dual-socket, 8-cores-per-socket Intel Haswell system with standard hardware prefetchers enabled.To normalize results across our single-threaded and multi-threaded workloads, we express memory traffic in units of bytes per kilo instruction.Our results are in Table 3.
The SPEC workloads span a wide range of memory intensities.The machine learning workloads, while bandwidth-intensive, generally have low MPKI values.This is likely because modern machine learning frameworks use batch processing to enhance dataflow to compute units.The graph analytics workloads are especially memory-intensive, with both high read/write bandwidths, and high LLC MPKI values.
In Sec. 7, we analyze general-purpose GPU (GP-GPU) workloads and in Sec.8.4 we analyze workloads running across a production HPC cluster.In these, we observe similar trends.

EXTENDING LIFETIME BY BUFFERING WRITES 4.1 First Line of Defense: The Last-Level Cache
Write-back caches inherently prevent many writes from reaching main memory.Thus, before considering purpose-built write buffers, we first explore the effect of last-level cache (LLC) size in aiding system lifetime.
Fig. 2 shows overall system lifetime as LLC size is increased.All other specs are as in the M3D system in Table 2, with a 10 6 cell write endurance.The LLC's 16-way set associativity is similar to our real Haswell system's 20-way LLC, and its maximum 8-MiBper-core size is also similar.Increasing LLC size quickly runs into diminishing returns.Our workloads' RSSes (Table 3) are gigabytes in size; thus, at some point, dirty data must be evicted from the Fig. 2: Single-chip lifetime vs. LLC size (higher is better).Benchmarks from Table 3.
megabytes-large cache.The LLC alone cannot meet a multi-year system lifetime.

Aside: An Explicit Write Buffer
We now consider the effect of an explicit write buffer [45][46][47].Similar to most LLC implementations, we use a single shared write buffer (and not one per core) to coalesce requests across multiple cores.We place the write buffer after the LLC to eliminate writes immediately before they reach memory.The write buffer allocates lines only upon their eviction from another cache (here, the LLC); a read will not fault a line into the write buffer.Like many LLCs, it uses a physically-indexed, physically-tagged (PIPT) scheme to avoid issues with coherence and aliasing, and is 16-way set-associative.For the write buffer's eviction policy, we compare LFU (least frequently used, which counts both reads and writes as a "use"), LFW (least frequently written, which counts only writes), LRU (least recently used), and LRW (least recently written).All four of the these policies can be implemented efficiently in hardware: LFU/W, by maintaining a small frequency table within each set, and LRU/W by maintaining a small timestamp table.The simulated M3D system is otherwise configured as in Table 2.
Each write buffer configuration is paired with an 8 MiB LLC.For comparison, we also measure the effect of simply increasing LLC size with no write buffer.Fig. 4 shows our results.The write buffer's effectiveness vs. a similarly-sized LLC drops off substantially at larger sizes.For this reason, we elect not to include the write buffer in our simulated system elsewhere in the paper.

WEAR-LEVELING WITHIN A MEMORY FRAME 5.1 Prequel: Redundant Write Elimination
Our within-the-frame scheme uses redundant write elimination (RWE) [48] as a subroutine.Before writing a dirty cache line back to main memory, that cache line's old data is read in from memory.Then, circuitry in the memory controller computes a bitmask, where each mask bit is 1 if that bit differs in the cache line and main memory, and 0 otherwise; i.e., bitmask = (dirty cache line ⊕ line in memory).Finally, when writing the cache line back to memory, only bit cells enabled by the bitmask are actually toggled.Table 2's write energy and latency are inclusive of the extra read for RWE.
In Fig. 3, for each workload, we plot the average probability that a bit will be toggled upon a write.Different workloads are shown on the x-axis and are numbered as in Table 3.To collect bit toggle statistics, we use a modified version of DynamoRIO's [49] drcachesim [50] tool.In every workload we observe, bit toggle probability was less than 0.5, with a mean of 0.24.

A Second Observation about Bit Write Patterns
We use RWE to only write bits that "actually" changed in memory.However, some bits may change much more frequently than others; for example, the low-order bits of a counter change exponentially more often than high-order bits.We therefore desire to "spread out" bit writes via rotation, a form of remapping.[48] rotates the contents of a cache line by randomized bytealigned values as they are written back to memory.We propose a conceptually-similar subroutine, dubbed "randomized rotation" (RR).Our RR differs from [48] in two ways: (i) We allow shifting by any number of bits, not just 8-bit (byte)-aligned, for greater uniformity, and (ii) our RR metadata is compatible with paged virtual memory, which enables our multi-frame (Sec.6) and multinode (Sec.8) techniques.The combined efficacy of RWE & RR over LLC-only is shown in Fig. 1.

Hardware & OS Support for RWE & RR
Our goal is to support redundant write elimination (RWE) and randomized rotation (RR) at the page level, to provide an abstraction to base our single-chip and multi-chip schemes on.RWE and RR require minor hardware and software modifications.For RWE, computation and application of the write bitmask can be done with small modifications to the memory controller (Fig. 5), without any intervention from processor cores.Existing ELM macros from major suppliers already contain the bitmask-enable circuitry.For RR, page table entries must be modified to support the addition of the bit-level frame rotation field ("RR value").The field indicates the bit-level offset by which the frame's contents should be shifted before applying a read or write.

Page Table Entry
Size.Fortunately, for RR, the number of bits required to be added to each page table entry is small.In our example system, we use 1 MiB pages. 1 MiB falls in between conventional 4 KiB pages and 2 MiB Linux-x86-64 huge pages [51], which are commonly used to reduce paging overhead.Thus, to represent a bit-level RR value in [0, frame size in bits), we require log 2 (  multiple of 8 bytes).Thus, many bits in the x86-64 PTE are currently simply ignored or reserved.Specifically, there are 26 such bits.Thus, we can fit our 23-bit RR value into just the ignored/reserved bits, with 3 bits to spare.Our RR metadata therefore does not increase PTE size at all.Lookups.Recall that each PTE now stores the RR value of that page as metadata.However, we cannot look up the RR value of the highest level of the page table ("page directory") from the page containing the page directory itself.To remedy this, for each process, the OS additionally stores the RR value of the page directory.This enables us to bootstrap the virtualto-physical page table walk.

Bootstrapping Page Table
Of course, not all physical frames are guaranteed to be mapped to a process at any given time.Yet, we still wish to keep track of their RR metadata.For this, the OS additionally maintains a "null page table" of unmapped frames, which acts as a sort of free list.The PTEs within the null page table contain the RR values for all unmapped frames.

Impact on TLB.
As in any modern page table implementation, we use a (multi-level) TLB to cache page table entries to speed lookups for virtual-to-physical translations.Each TLB entry must now additionally include the RR bits.For a 1024-entry TLB, using 1 MiB pages, this is an extra (1024 × 23 bits per entry) = 3 KiB.

5.3.4
An RR Cache for LLC Writebacks.Before performing any memory access (read or write), the memory controller must have the RR offset available.For reads, this information is provided via the aforementioned RR field in the page table or TLB.However, writebacks require a different approach.When the PIPT LLC initiates a writeback, the physical frame address of the evicted line cannot be used for RR lookups on the virtually-addressed page table or TLB.
For this reason, separately from the LLC, we maintain a small RR cache (RRC) that maps physical frame addresses to those frames' RR values.Upon LLC writeback, the RRC is consulted to determine the frame's RR value.Upon an RRC miss, the RR value must be read from memory; specifically, it is read from the frame's header region residing at a fixed offset from the start of the frame itself (see Fig. 8).Besides faulting a fresh RR value into the RRC via a read of the frame header, we must also support a simple update operation, where the RR value corresponding to a given frame address is updated.Sec.6.2 details the circumstances of this update.
An RRC miss upon LLC writeback incurs an extra memory read, so the RRC must be highly effective.Fig. 6 shows the weighted mean across benchmarks for RRC hit rate.Even at small sizes and low associativies, the RRC's hit rate approaches 100%.

WEAR-LEVELING AMONG MULTIPLE FRAMES
We use RWE and RR to "spread out" bit writes within the frame.Likewise, a similar phenomenon is present at the page level: some pages (e.g., those containing the call stack) will be written more often than others.If we periodically remap these heavily-written virtual pages throughout physical memory, we can increase the lifetime of the system.To that end, we devise a hybrid hardwaresoftware scheme dedicated to managing the wear level of all frames within a chip's main memory.To simplify and reduce its hardware footprint, our scheme leverages standard system main memory for bookkeeping; yet it accelerates some critical-path management operations in hardware to reduce overhead.Crucially, our scheme can extend to multi-chip systems, all while minimally interrupting execution.
Our wear-leveling system operates via a new metadata structure, which resides in the ELM itself.This metadata structure is separate from, and complementary to, our modified page table.In Sec.6.2, we demonstrate the low endurance overhead of managing metadata directly in ELM.

New Metadata Structure
The OS maintains a multi-level hierarchy of queues.Each queue is implemented as a linked list of descriptors of physical frames.The nodes (frame descriptors) within each queue correspond to frames with a similar (within a specified margin) level of write wear.Specifically, "write wear" is the number of bit toggles that have occurred, anywhere in the frame, since manufacture.Fig. 7 shows the queue hierarchy data structure.F is the number of bits in a frame, C is the cell write endurance, and N is the number of levels (queues).
Upon the first power-up of the system after manufacture, all frames initially reside in the first queue.The queues data structure persists across reboots per the use of non-volatile memory.To implement the queues, we use intrusive doubly-linked lists, where each list node is a "frame descriptor" (see Fig. 8).To improve memory reference locality, each frame descriptor is stored as a header within the frame itself, as in Alloy Cache [53] and Unison Cache [54].Each frame descriptor contains the following: (1) A randomized rotation (RR) value, indicating the amount by which the frame's non-header contents have been rotated.(2) A pointer to a PTE, or list of PTEs (for aliased pages), for the virtual page(s) mapped onto the frame.This enables a fast inverse mapping.(3) The index of the queue that this frame descriptor is currently a member of.This enables the next-highest queue to be found in constant time.(4) prev and next pointers for the intrusive linked list.
We also store a "lifetime bit writes" counter for the frame at a variable (randomly-rotated) offset.Unlike other frame metadata fields, this counter changes upon every write, so it is important to spread out its bit writes via RR.Other fields' writes are bounded by N, and N < C, so write wear upon them is not problematic.These fields are therefore stored at fixed offsets within the header.
The objective of the hierarchy of queues is to continuously "bin" frames by their amount of bit write wear.We progressively "promote" frames up the hierarchy as they increase in write wear.When a dirty cache line is evicted from the LLC, the memory controller does the following: (1) Write the cache line back to main memory, using the RWE subroutine.Though we do require the RR value (via the RR cache) to apply the write, we do not shift/remap the entire frame contents at this point; only later, upon frame promotion.
(2) Compute the popcount (Hamming weight) of the bitmask from the RWE.This value equals the number of bits that were actually toggled in memory.(3) Read the previous lifetime bit writes field from the frame descriptor, and increment it by the new bits-toggled count.(4) If the frame's new lifetime bit write count exceeds the threshold value for its current queue, the memory controller fires an interrupt for the OS to promote the frame.
Threshold values are calculated as follows.First, the system designer chooses a fixed number of queues (levels of the hierarchy), which we call N.Then, the threshold value T between each level is defined as T = F×C N .Each level of the hierarchy represents 1  N of the maximum expected number of bit writes that the frame could sustain with RWE and RR spreading writes uniformly.N should be chosen to be high enough to rebalance often for uniformity, yet low enough to not cause excessive write wear from the swaps themselves.In our experiments, we choose N = 50,000, as this budgets 5% of overall cell write endurance (C = 10 6 ) for swap-instigated writes.

Frame Promotion Algorithm
Upon frame promotion, the OS does the following: (1) Remove the descriptor for the to-be-promoted frame from its current queue.(2) Append the promoted descriptor to the tail of the nexthigher queue.(3) Pop the head of the lowest active queue.(4) Push the descriptor we just popped in (3) onto the tail of the same (lowest) queue.(5) Swap the memory contents of the frame from (2) with the frame from (4).While performing the swap, shift each frame's contents to its new RR value.(6) Update the "current queue" field of both frame descriptors to reflect their new queues.(7) Follow the PTE pointer of both frame descriptors to their PTE(s), and update both frames' PTE(s) + TLB entries to reflect their new virtual-to-physical mappings and RR values.(8) In the RRC, update the RR values for the two frames.Fig. 9: Single-chip workload lifetimes (higher is better).
Note that we perform steps 3 and 4 (pop and push) to more uniformly select frames for promotion within the same level.This is why each level of our hierarchy is a queue.
The promotion process itself does not add significant write wear to the frames.For the most-written frame, promotion only occurs once every 1 fraction of the expected lifetime of the system, and is even lower for all other frames.For example, for an expected system lifetime of 3 years, and N = 50,000 queues, we promote (swap) the most-written frame once every half an hour.Our simulations pessimistically assume no special write elision when swapping "clean" pages.
Neither do the frames representing the page table experience much additional wear from the updates.spec_mt.lbminduces frame promotions more frequently than any other workload, at 135 per second, or once every 7.5 ms.Upon promotion, 11 bytes (8 for virtual-to-physical mapping, and 3 for RR value) must be updated for each of the two PTEs participating in the swap.At 135 promotions per second, this results in an extra 3 KiB/s of write traffic.The data written to update the queues structure is only slightly larger: each of the two frames update their RR (3 bytes), queue index (2 bytes), and 4 × 8-byte pointer fields, for 74 bytes/promotion, which is 10 KiB/s at 135 promotions/s.

Asymptotics
Step 1 can be done in constant time, as we just wrote the frame, so we have its (fixed-offset) descriptor.Step 2 is also constant-time, as each frame descriptor contains a current queue index.Steps 3 and 4 are constant-time, as we maintain the heads and tails of all queues in an array.The swap in step 5 is constant-time (all frames are the same size).Finally, all frame descriptors contain backpointers to the PTE(s) mapped onto them, with a singly-indirect fast path for non-aliased frames.Thus, page table metadata can be updated in constant time (that of a single pointer dereference plus write) for non-aliased frames, and linear time for aliased frames.The two RRC updates are constant-time.

Single-Chip Lifetime
We simulate the combined RWE, RR, and frame-promotion algorithms using a system-level in-house simulator written in C++.This simulator takes in zsim execution traces from each workload as input, and outputs frame promotion frequency and lifetime statistics.Lifetimes are shown in Fig. 9.For comparison, the effect of within-the-frame-only techniques is shown in blue.For the combined techniques, we report lifetime as a function of main memory gigabytes per core for several reasons.First, server systems found in datacenters and HPC today commonly have a memory-to-core ratio of around 2 to 16 gigabytes per physical core.We also do so to show that lifetime in our endurance scheme scales practically linearly with the total capacity of main memory.If we have twice as much memory to spread writes out over, we can expect twice the lifetime, and our simulations confirm this.

Simulated Overhead
To simulate the overhead of our frame-promotion scheme, we make some pessimistic simplifying assumptions.Chief amongst these is that, upon frame promotion, all execution is paused while the promotion process is occurring.In reality, this need not be the case: promotions can be queued and deferred while the main processor runs uninterrupted.
To support frame promotions, we add a small 128 KiB scratch buffer alongside the LLC for swaps (see Table 2).All data that is swapped flows through this buffer only, which eliminates the issue of caches becoming polluted with swap data upon promotion.
We simulate the time and energy required for each promotion by summing the following: (i) the round-trip cost to trap in and out of the kernel to service the promotion-triggered interrupt; (ii) the cost to update all data structures once in the kernel; and (iii) the cost required to rotate and swap the contents of the frames (two frame reads + two frame writes).Using C microbenchmarks, we measure kernel trap round-trip cost on our real Haswell system at 55 µs, and measure kernel data structure update cost at 10 µs.Core energy is calculated per these latencies and energy per instruction; ELM bandwidth & energy (Table 2) determine the cost for the data swap itself.For our most promotion-intensive workload, spec_mt.lbm,we observe a 0.9% runtime and 1.0% energy overhead, with lower overheads for all other workloads.

Promotions Without Interrupting Execution
Frame promotions may be performed in the background.For this, a small interrupt-handling core/unit should be included in the system, so that the main cores can run uninterrupted.Across all workloads, the highest steady-state promotion rate we observe is 135 promotions per second, or once every 7.5 ms.One frame promotion, inclusive of all data transfer and metadata updates, takes around 70 µs.Thus, promotion requests can be serviced in the background, with a bounded queue depth, at steady-state.Though multiple promotions may be triggered within 70 µs of each other, they can be queued to be serviced sequentially, in the background, by the hardware.We perform another experiment measuring the maximum promotion queue depth at any point in the workload's execution under these assumptions.The highest maximum queue depth we observe for any workload is 15, and the mean maximum queue depth is 5.A system designer can thus pick a relatively small fixed size for such a queue, with execution pausing only if the queue becomes full.For these simulations, we use N = 50,000 queues, with a bit cell write endurance of C = 10 6 .The full system specifications are per the M3D system in Table 2.We use a 1-MiB-per-core, 16-way LLC only, with no explicit write buffer.

ALTERNATE ARCHITECTURE: GP-GPU 7.1 Workload Characteristics
GP-GPUs, with thousands of simple parallel cores coupled with wide memory buses, make for an interesting point of comparison with CPU-based systems.We measure write bandwidth (in MiB/s), as well as bit toggle probability, as these, along with overall system memory capacity, are the primary determinants of memory lifetime.CPU workloads are from Table 2, with write bandwidth measured on our Intel Haswell system at 2.0 GHz (as in Sec.3.3), and bit toggle probability simulated as in Sec.5.1.For the GP-GPU, we use Accel-Sim [55][56][57][58] to simulate an NVIDIA Quadro GV100 GP-GPU [59,60], featuring 5120 CUDA cores running at 1132 MHz and 32 GB of HBM2 memory with 870 GB/s of combined memory bandwidth.
For GP-GPU, we simulate the CUTLASS [61], DeepBench [62], Parboil [63], Rodinia 3.1 [64,65], and Accel-Sim "µbench" [58] suites, and plot the maximum write bandwidth observed in any kernel within each suite.We were unable to find an easy way to gather bit-toggle statistics for GP-GPU execution.As a remedy, we additionally simulate the Polybench [66] suite, which contains both CPU C and GP-GPU CUDA kernel implementations, and simulate its bit-toggle probability on a CPU (as in Sec.5.1), and its write bandwidth on the GP-GPU.Fig. 10 shows our results.The GP-GPU workloads' write bandwidths exceed the largest CPU write bandwidth we observe by around 500×.The absolute-highest GP-GPU write bandwidth we observe, 661 GB/s, represents around 76% of the GV100's 870 GB/s peak combined memory bandwidth.Bit-toggle probability was also higher for the Polybench GP-GPU workload.The GP-GPU's architecture is capable of generating much more write traffic than a typical CPU.

Lifetime
We now quantify the effect of the GP-GPU's massively higher write bandwidth on system lifetime.We simulate all previous CPU and GP-GPU workloads, using our endurance techniques from Bit cell write endurance Fig. 11: CPU-and GP-GPU-based lifetimes, using 64 GiB memory (higher is better).
Sec. 5 and 6.The simulated CPU-based system is the M3D one from Table 2.The GP-GPU simulated is the GV100 with endurancelimited memory.To make the comparison more straightforward, the CPU and GP-GPU systems both have the same main memory capacity (64 GiB).For the CPU system, this is equivalent to 8 GiB per core on multithreaded workloads from Sec. 6.4; for the GP-GPU, it is 2× the GV100's memory capacity.Fig. 11 shows our results.The dashed line indicates a 3-year lifetime.In summary, because GP-GPUs generate massively-higher write traffic, they require bit cell write endurances around 10 9 to achieve multi-year lifetimes, even using our techniques (and are well under one year without them).Because our techniques from Secs. 5 and 6 support general-purpose CPU-facing virtual memory, they can be applied to GP-GPUs as well, with the addition of a small management core (which many GP-GPUs already contain).

MULTI-CHIP 8.1 Multi-Node Execution: Defining the Problem
Traditional DRAM systems are commonly arranged in a non-uniform memory access (NUMA) fashion.For any given compute core, some regions of memory can be accessed with lower overhead than others.Typically, this involves different regions of memory being "managed" by different groups of cores ("sockets"), with inter-socket links providing connectivity between sockets.When using multiple M3D chips networked together, we similarly have a NUMA setup.
When running workloads on a NUMA system, we must choose which NUMA pool of memory to allocate pages from.There are two common policies for doing so: 1. First-touch, and 2. Interleave.In the first-touch policy, the page is mapped to the NUMA node of the first core that touches it.This is the default allocation policy of many systems, including the Linux kernel.First-touch is often advantageous because it assumes subsequent accesses to the same page of memory will be primarily made by the same core that first accessed them.If this is indeed the case, then the first-touch policy minimizes off-NUMA-node accesses, which is usually advantageous for execution time and energy.
On the other hand, interleaving seeks to do the opposite: it distributes new page requests across all possible NUMA nodes in Fig. 12: NUMA on-chip memory access percentage, first-touch vs. interleave (higher is better).Fig. 13: NUMA write imbalance factor: most-written chip vs. average (lower is better).
the system, usually in a basic round-robin fashion.While this is often bad for locality, resulting in many off-chip accesses, it can sometimes lead to more-even utilization of memory interconnects.In Fig. 12, we select every multi-threaded workload from our set and simulate them in a NUMA setting.Each workload runs across 8 cores, and each of those 8 cores is assigned to a separate NUMA domain.For each workload, we simulate both a first-touch and an interleave policy.Across the board, we see that first-touch is indeed better at minimizing off-chip accesses.This is particularly crucial in our M3D setting, where off-chip accesses are much costlier than on-chip.
In Fig. 13, we now measure the amount of "write imbalance" incurred by the first-touch and interleave policies."Write imbalance" is defined as the number of bytes written to the most-written chip, divided by the average number of bytes written across chips.A write imbalance of 1 indicates perfectly-balanced writes.
Here, first-touch loses substantially to interleave.Consider that multi-threaded workloads commonly have one "master" thread, with all other threads being worker threads.Suppose, for example, that the master thread does much more writing than the worker threads.In this case, the chip hosting the master thread will incur higher write wear than the others.This imbalance presents an opportunity for further extending system lifetime.

Multi-Chip Wear Leveling
Most of the time, to wear-level, we prefer to move pages within the same chip.This lets us avoid the large off-chip bandwidth, latency, and energy penalties.However, as we saw, many multinode workloads produce an uneven amount of write wear across chips.Therefore, we desire a way to balance the amount of write wear evenly among multiple chips.We combine our insights from Secs. 5, 6, and 6.4 to offer a lightweight scheme for multi-chip wear leveling, featuring the locality benefits of a first-touch policy, but with a lower write imbalance.
Our solution is to adopt a first-touch allocation policy, but to periodically swap the entire contents of one node's memory with another node (while performing process migration [67][68][69] as well).
We must now determine how often to swap nodes' memories and which nodes' memories should be swapped.In our single-chip setup, we mapped virtual pages onto physical frames.Here, we likewise perform a mapping of "jobs" onto physical M3D chips.For this, we use a scheme extremely similar to our single-chip one: a hierarchical system of queues (Fig. 7).Instead of frame descriptors, we now have node descriptors; instead of F, the number of bits in a frame, we now use M, the number of bits in memory for an entire node.We again choose T = 50,000 as 5% of our 10 6 cell write endurance C; this means a swap will be initiated after every 1   50,000   of the node's overall write endurance.Each node's memory controllers maintain a counter, which counts the total number of bits written (after RWE) to that node's memory, for its entire lifetime (since manufacture).Whenever a frame is written anywhere on that node, we increment the counter by the number of bits toggled by that frame write.When the counter exceeds its threshold value, the memory controller fires an OS interrupt to promote the node in the queue hierarchy.Similarly to our single-chip scheme, promotion involves a swap of memory contents.
All nodes' OSes maintain a coherent copy of the node descriptor hierarchy.Promotions happen infrequently enough that coherence overhead is not a concern (Sec.8.5).When a promotion is triggered, the triggered node's OS does the following: (1) Remove its own node descriptor from its current queue.
(2) Append that descriptor to the tail of the next-higher queue.
(3) Pop the head of the lowest active queue.This chooses the "target" node.(4) Push the target node descriptor onto the tail of the same (lowest) queue.(5) Broadcast to all other nodes that the triggered node and target node will swap.(6) Swap the triggered and target nodes' memory contents.(7) Update the "current queue" field of both node descriptors to reflect their new queues.Fig. 14: System lifetimes, with multi-chip wear leveling disabled vs. enabled (higher is better).
(8) Broadcast to all other nodes that the swap is complete.
Each node descriptor contains the following: (1) The index of the node it represents.
(2) The index of the queue that this node descriptor is currently a member of.(3) prev and next pointers for the intrusive linked list.
Because each node maintains its own lifetime bit write counter, this value need not be included in the node descriptors.
Since the number of nodes in a system is generally assumed to be significantly fewer than the number of pages on a single node, each node can maintain a copy of the node descriptor hierarchy in memory with low overhead compared to the single-chip scheme.Alternately, a centralized controller (or consensus-elected leader) and single canonical database can be used to store the hierarchy.In this case, nodes must still message the controller/leader (which is responsible for serializing those requests).

Multi-Node Lifetime Benefits
Fig. 14 shows the benefits of our multi-node wear-leveling scheme.Here, we select every multi-threaded workload from our set and simulate them in NUMA mode on our simulator from Sec. 6.4.Each workload is configured to run with 8 cores, and each of those 8 cores is assigned to a separate NUMA domain.Each NUMA domain (chip) has 8 GiB of main memory and uses a first-touch allocation policy.We simulate expected lifetime with our multi-node endurance mechanism disabled, and again with it enabled.For homogeneous multi-threaded workloads running across multiple NUMA domains, our multi-chip scheme increases lifetime over our single-chip scheme by a geometric mean of 48%.

Heterogeneous Mixture of Applications
Finally, we simulate our multi-chip endurance scheme using data from a real HPC cluster.We collected memory read and write statistics for jobs running on NERSC's "Cori" Intel Haswell nodes over the span of ten days [80,81].The CDF in Fig. 15 shows the read and write bandwidths across all Cori's Haswell nodes, sampled every one second.These write bandwidth statistics are a key input to our multi-node endurance simulation.We simulate using 8 nodes with write bandwidth values drawn uniformly from the CDF.
The other two inputs to our multi-node simulation (besides write bandwidth) are memory capacity per node, and bit toggle probability (using redundant write elimination as in Sec.5.1).Each Cori Haswell node has 32 physical cores and 128 GiB RAM.Because measuring bit toggles requires intensive dynamic binary instrumentation and is not supported by built-in architectural performance counters, we were not able to gather bit toggle probabilities on Cori.In this experiment, we therefore assume a bit toggle probability of 0.5.We justify this conservative assumption by noting that 0.5 is higher than any probability we observe for any workload from Sec. 3.3, and, furthermore, that an average bit toggle probability of 0.5 can be achieved for arbitrary data, by passing the contents to be written through a pseudorandom function (PRF) such as a block or stream cipher before they are physically written.With multi-node wear leveling disabled, the simulated cluster achieved a 2.5 year lifetime.With wear leveling enabled, it achieved a 11.7 year lifetime, a 4.7× gain.

Simulated Overhead
To simplify our overhead analyses, we again pessimistically assume that application execution on all nodes stops during a node promotion.To swap node contents, the swap buffer (Table 2) is again used for temporary space.As in our single-chip scheme, promotion delay is simulated as interrupt, plus data structure update, plus data transfer cost.We use the same 55 µs round-trip interrupt cost, with added 50 µs data structure update cost.For multi-node swaps, off-chip bandwidth (not intra-chip memory bandwidth) is the bottleneck for data transfer.For this reason, we simulate using two different node-to-node interconnects: 400 Gbit/s ethernet at 200 pJ/bit, and a 1.6 Tbit/s silicon interposer at 50 pJ/bit (see Table 2).
Swap energy is modeled by summing several quantities across the two participating nodes.For both nodes, we pessimistically assume that during swap kernel bookkeeping (105 µs trap + data structure update, as above), all cores are spinning at IPC = 1.(In reality, only one core/thread need perform the update.)To calculate data energy, each node (i) reads its entire 128 GiB memory and (ii) transmits that data over the wire, where it is (iii) received into the other node's swap buffer and then (iv) written into that node's memory.Thus, each of (i-iv) occurs twice, and all these energies are added to the total.Finally, leakage energies for the duration of the swap (bookkeeping latency plus data transfer latency) are added in.
We simulate multiple independent copies of our most swapintensive application (spec_mt.lbm)running on Cori.The application invokes a node-to-node swap every 71 seconds.For the 400 Gbit/s interconnect, we observe a 5.4% runtime and 5.4% energy overhead, and for the 1.6 Tbit/s interconnect, a 2.7% runtime and 2.8% energy overhead, inclusive of all overheads for our single-chip scheme.

Promotions Without Interrupting Execution
As with single-chip frame promotions, multi-chip cross-node promotions (memory swaps) can be performed in the background.As in VM live migration, we recommend (but do not require) that each node participating in the swap has around half of its memory free, to enable concurrent buffering of the other node's sent memory contents and any updates.For two nodes with 128 GiB of memory each, connected by 400 Gbit ethernet, the swap delay is around 2.5 seconds.On the cluster, we observe a swap every 71 seconds, so swaps can be performed in the background at steady-state, with a bounded promotion queue depth.In simulation, the promotion queue depth never exceeded the number of nodes (eight).

RELATED WORK
Many works have investigated the use of alternative memory technologies as a supplement to, or a substitute for, DRAM.[48, 84, 85, 89, 91, 94, 96-98, 100, 103, 106-109] propose PCRAM or RRAM as either partial or total replacements for DRAM, but do not consider multi-node NUMA systems.[110] analyzes ELM-NUMA systems at the filesystem, rather than the main memory, level.[101] takes an alternate approach to ELM-NUMA lifetime via node bandwidth sharing and mellow writes [99].[10] and [106] detail wear-leveling schemes that use a very small amount of metadata, but require that every single byte in main memory be shifted periodically.[108] maintains a hierarchy of frames, but does not integrate RWE and RR, and is single-node only.
Flash-based systems also employ wear-leveling schemes [111].These can be broadly categorized as either static [112][113][114][115][116] or dynamic [117][118][119][120][121][122].Static schemes attempt to move cold data to moreworn blocks, whereas dynamic schemes repeatedly reuse blocks with lesser erase counts, but do not attempt to move cold data.Our technique encompasses all frames in memory, and thus incorporates benefits of both static and dynamic wear-leveling.Unlike flash, ELMs such as PCRAM, RRAM, and STT-MRAM do not require high-overhead erase cycles before data can be reused (and consequently do not require garbage-collection routines).
[123] thoroughly analyzes non-ELM NUMA systems and draws the conclusion that, in comparison to then-current wisdom, contention and queueing delays, not wire delays, were responsible for the bulk of NUMA performance degradation.M3D is a drasticenough paradigm change to warrant revisiting these assumptions, since, in M3D, on-chip communication is vastly more efficient than off-chip.Hardware-assisted page placement [124,125], Carrefour [123], and AutoNUMA [126] can be thought of as firsttouch/interleave hybrids, but both consider DRAM systems only, and do not attempt to co-optimize NUMA locality with ELM endurance.
Past work on M3D memory endurance has primarily focused on special-purpose machine learning accelerators, and not generalpurpose CPU cores [105].CPU-focused M3D work has thus far not deeply investigated the NUMA locality effect, and how best to preserve lifetime within that constraint.In contrast, our added hardware and software (Sec.5.3) are structured to multiplicatively "stack" (Fig. 1) all techniques from Table 4, providing millionsof-times higher lifetime over naïve execution, while preserving runtime and energy benefits (Sec.6.5, Sec.8.5).

CONCLUSION
All NVM-enabled techniques, including monolithic 3D integration, must respect the issue of write endurance.M3D significantly improves upon the energy-delay product, but imposes the additional NUMA scaling constraint.Our single-chip scheme delivers hundreds-of-times-higher endurance, achieving multi-year lifetimes even with a small bit cell write endurance of 10 6 .Our multichip scheme ensures that the performance-favorable first-touch allocation policy does not result in degraded lifetime.In fact, we see the opposite -that the use of multiple chips can enhance overall system lifetime, because they provide more "surface area" over which to wear-level writes.Our combined schemes, with runtime and energy overheads bounded in the single digits, thereby preserve the EDP benefits of monolithic 3D while addressing some of its most challenging limitations.

Fig. 5 :
Fig. 5: LLC and memory controller which support RR, RWE, and frame promotion.

Fig. 15 :
Fig. 15: CDF of node-wide memory read/write bandwidth from NERSC's Cori Haswell nodes over ten days.

Table
: Array-level bit cell write endurances.

Table 2 :
M3D system specification for lifetime analyses.

Table 3 :
Workloads and their characteristics.

Table 4 :
Comparison with related work.Our hardware & algorithms enable multiplicative lifetime benefits from the techniques below, with low runtime and energy cost.