Persistent Processor Architecture

This paper presents PPA (Persistent Processor Architecture), simple microarchitectural support for lightweight yet performant whole-system persistence. PPA offers fully transparent crash consistency to all sorts of program covering the entire computing stack and even legacy applications without any source code change or recompilation. As a basis for crash consistency, PPA leverages so-called store integrity that preserves store operands during program execution, persists them on impending power failure, and replays the stores when power comes back. In particular, PPA realizes the store integrity via hardware by keeping the operands in a physical register file (PRF), though the stores are committed. Such store integrity enforcement leads to region-level persistence, i.e., whenever PRF runs out, PPA starts a new region after ensuring that all stores of the prior region have already been written to persistent memory. To minimize the pipeline stall across regions, PPA writes back the stores of each region asynchronously, overlapping their persistence latency with the execution of other instructions in the region. The experimental results with 41 applications from SPEC CPU2006/2017, SPLASH3, STAMP, WHISPER, and DOE Mini-apps show that PPA incurs only a 2% average run-time overhead and a 0.005% areal cost, while the state-of-the-art work suffers a 26% overhead along with prohibitively high hardware and energy costs.


INTRODUCTION
Nonvolatile memory (NVM) technologies such as ReRAM [3,13], 3D XPoint [39], PCM [68,121,125,139], and STT-MRAM [14,47,66,74,114] have emerged as alternatives to DRAM.Thanks to their byteaddressability, high areal density, and in-memory persistence, they are to be used as nonvolatile main memory (NVMM)-also known as persistent memory (PMEM).That is, they can transparently replace DRAM to accommodate persistent applications with large memory footprint and obviate the need for serializing data in a block device to survive power failure.However, it is not easy to make this obvious use case (i.e., transparent NVMM) in reality.For example, while Intel Optane persistent memory (PMEM) [22,62,69,144,151] provides the transparent use of PMEM called memory mode where DRAM is used as the last-level cache atop PMEM, the Optane manual states that the PMEM works as volatile memory [50].The Optane persistent memory is not persistent at all; this is mainly due to the difficulty of maintaining crash consistency in the memory mode 1 .As a result, under the memory mode, users have no choice but to risk the loss of all PMEM data in case of power failure.
Although PMEM offers app-direct mode where DRAM is used as main memory and PMEM serves as persistent heap [50], it pawns off the hard work of persistent programming on users, trading the transparency for in-memory persistence.In this partial-system persistence (PSP) model [11,20,21,25,35,75,128,146], users must delineate a part of code that requires persistence, rewrite the data structures used therein with crash consistency and memory persistency [32] in mind, and often devise application-specific recovery code tailored to the data structures [40,49,70,79,106,127].Besides, PSP requires dedicated PMEM allocation interfaces such as pmalloc [23], rendering already error-prone persistent programming more complex [29, 91-93, 96, 99, 110].While using transactions [10,20,72,84,140] or failure-atomic sections [11,46,51,85] mitigates the programming complexity, the resulting persistent program is slower than the original one due to the undo/redo logging involving persistence barrier (clwb and sfence for x86).
Given the limitations of PSP and the demand for transparent use of PMEM without sacrificing the in-memory persistence and crash consistency, there is an increasing interest in whole-system persistence (WSP) [57,107] which covers all sorts of applications-rather than being limited to a small set of PSP application domains such as in-memory index structures/databases and key-value stores.That is, WSP is agnostic to program semantics yet capable of recovering any kind of program from power failure no matter when it occurs!One naive approach to WSP is flushing all volatile states (register files, SRAM caches, and DRAM cache) to PMEM when power is about to be cut off.For example, Narayanan et al. [107] propose to use residual energy in uninterruptable power supply (UPS) and persist all volatile data before impending power failure, which requires a considerable amount of energy to be secured for flushing.In a similar vein, Intel's extended asynchronous DRAM refreshing (eADR) flushes the entire cache contents to PMEM upon power failure using a backup battery.However, eADR also leads to significant energy cost requiring a bulky supercapacitor of 3400  3 [4]; this situation gets even worse for a deeper cache hierarchy that is driven by ever-increasing working sets of data-intensive applications [12,71].Apart from the inability to persist other volatile states such as registers, eADR cannot guarantee crash consistency for PMEM's memory mode-as it is unaffordable to reserve a sufficient amount of energy for flushing the data of large DRAM cache to PMEM; typical servers in data centers are equipped with more than 1TB DRAM.Given all this, it has been practically impossible to achieve WSP on the cheap.
To this end, this paper presents Persistent Processor architecture (PPA), the first of its kind to realize transparent, lightweight, and performant WSP without recompilation for all program embracing legacy software whose source code is unavailable.We found that crash inconsistency is caused by unpersisted stores left behind power failure and can be corrected by replaying (persisting) them in the wake of the power failure.Suppose the program commits 3 stores (; ; ) in a row, and due to cache replacement, the youngest  is persisted in PMEM before the older ones.Although this violates the program semantics if a power outage occurs while others are cached, it is possible to fix the inconsistency by replaying  and -unpersisted before the outage-when power comes back.We can even relax this for simple hardware implementation, i.e., rather than tracking the (un)persistence of each individual store, PPA instead replays all 3 committed stores and resumes the interrupted program following the last committed instruction in the wake of the outage.
To achieve that, it is essential to preserve the registers of stores (for replay) and other committed instructions (for resumption of the interrupted program) across power failure.The implication is twofold: (1) PPA should prevent store registers from being overwritten; this is so-called store-integrity [152].(2) Both store registers and other committed instruction registers must be able to survive power outage, i.e., PPA should save the registers on the outage-using a tiny capacitor whose energy is six orders of magnitude smaller than what eADR requires-for the replay and the resumption in the wake of the outage.
In particular, PPA realizes the store integrity in the core microarchitecture at a low cost.The key insight is that the values of store registers are retained in the corresponding physical registers 2 until they are deallocated.For example, once the architectural register  0 of a store is renamed to a physical register 0, PPA can retrieve  0 by reading the value from 0 unless it is remapped and overwritten by another instruction.To preserve the physical registers to which architectural registers of stores are renamed, PPA proposes to delay the deallocation of the physical registers-though the reorder buffer (ROB) already commits the store instructions 3 .Recall that out-oforder cores have a lot more physical registers than architectural ones to minimize the stalls caused by the lack of physical registers [33]; a physical register file (PRF) tends to be underutilized most of the time since only a part of instructions in ROB (30% in our experiments), e.g., loads and ALU operations, define new registers.Prior work also observes this phenomenon, which leads to the advent of simultaneous multi-threading (SMT) [102,103,[136][137][138]156], PRF bank switching [118], and physical register inlining [83].The takeaway is that due to PRF underutilization, PPA can delay the deallocation of store registers with minimal run-time overhead.
Such register-renaming-based store integrity is a building block of PPA enabling region-level persistence, where store integrity is ensured within each region (epoch) [59] for crash consistency as well as lightweight yet performant WSP.PPA dynamically delineates the regions, performing region-level persistence and physical register reclamation across their boundaries; whenever PRF runs out, PPA starts a new region (epoch) with a persist barrier, which ensures the committed stores of the prior region have already been written to PMEM and reclaims those physical registers mapped by the stores.To persist the stores of each region efficiently, PPA uses asynchronous writeback overlapping them with the execution of other instructions in the region as prior work [9,54,56,60,111,130].It turns out that the region size is long enough to fully hide the store persistence latency, thanks to the large PRF of modern outof-order cores.If any region is interrupted by a power outage, PPA checkpoints minimal architectural states, e.g., a part of PRF and hardware structures related to register renaming [42].In the wake of the outage, PPA restores those checkpointed states, replays the committed stores of the interrupted region, and resumes the program from the last commit point before the outage-rather than rolling back to the beginning of the interrupted region-for correct and efficient recovery.
To evaluate PPA, we test it with 41 applications from SPEC CPU2006/2017 [8,43], SPLASH3 [123], STAMP [101], WHISPER [105], and DOE Mini-apps [63,135].The experimental results show that PPA incurs only an average of 2% run-time overhead compared to the baseline (running original applications on PMEM's memory mode lacking in-memory persistence and crash consistency support).In summary, PPA makes the following contributions: • PPA is the first lightweight yet performant whole-system persistence that introduces minor modifications on the hardware, e.g., 2 registers and 1 queue, and only needs a tiny capacitor of 21.7 J, unlike eADR that requires a supercapacitor of 550mJ.• PPA outperforms the complex state-of-the-art compiler and architecture codesign approach [57] in terms of all aspects, such as run-time performance, energy requirement, and hardware cost.• PPA treats the underlying cache hierarchy as a black box, thus being suitable for current/future caches with an arbitrary depth of the hierarchy, e.g., CXL (Compute Express Link) based far persistent memory [34,53,61,97,98].• PPA only incurs an average of 2% run-time overhead and 0.005% areal cost, which we believe paves the way to practical whole-system persistence for all, driving the revival of persistent memory production with its cost-effectiveness.

BACKGROUND AND MOTIVATION 2.1 Register Renaming
register dependence and thus enables more instruction-level parallelism (ILP).To efficiently rename architectural registers, out-oforder processors are equipped with a unified PRF as in Alpha 21264 [65], MIPS 10K [38], ARM Cortex A-series out-of-order cores [142], RISC-V SonicBOOM [157], and modern Intel processors from Pentium 4 onwards [131].For renaming an instruction, the processor picks a register from a Free List (tracking free physical registers) and maintains such a mapping from architectural register to physical one in a register alias table (RAT), i.e., any data access to the architectural register is referred to the corresponding physical register by consulting the RAT.Once ROB retires the instruction, the processor puts the mapping to a commit rename table (CRT) for facilitating exception handling and debugging.
In particular, the physical register can only be reclaimed to the Free List when a later instruction redefining the associated architectural register gets retired from ROB-because the physical register value is no longer used thereafter.

PSP vs WSP
PSP has been a de facto standard for server-class systems backed with Intel Optane persistent memory (PMEM) to ensure the crash consistency of their user applications.However, this paper argues that PSP is inferior to WSP for 3 reasons: high performance overhead, programming/maintenance burden, and the risk of losing all system-level states upon power failure.
First, the app-direct mode of PMEM cannot take advantage of the deep cache hierarchy despite the ever-increasing data footprint of PSP applications.Our experiment (Section 7.2) indicates that due to the inability to leverage DRAM as a cache, even an ideal PSP design is significantly (up to 2.4x and 1.39x on average) slower than the memory mode of PMEM for memory-intensive applications.Second, PSP is not transparent and requires programmers either to redesign their data structures with persistence and recoverability in mindincurring severe bugs during development [29, 91-93, 96, 99, 110] and maintenance costs in the future [5,124,130,154]-or to leverage transactions for mitigating the programming burden 4 .Third, PSP can only recover the states of user applications and hence puts operating systems at the risk of losing their entire states upon power failure, while WSP like PPA can ensure that the entire system states are consistent across power failure; see Section 5 for details.
Not only does WSP eliminate PSP programming and maintenance costs, but it also makes persistent applications faster with the DRAM cache.Of course, for those using PMEM's memory mode to leverage the deep cache hierarchy, WSP offers them persistence and crash consistency without hurting the transparency and performance.This is particularly beneficial for HPC applications (e.g., Mini-apps) whose states must be saved to storage on a regular basis.We believe that lightweight persistence/recoverability, e.g., PPA, can enable performant application-level resilience-related to one of the nation's exascale challenges [58,113,160]-by obviating the need for expensive periodic global checkpointing to storage.

Region-Level Persistence for WSP
Prior techniques [18,152,161] recently investigate region-level persistence to provide crash consistency in energy harvesting systems (EHSs) [15,19,86,95] where WSP is the norm.These techniques partition the program into a series of regions (akin to recoverable epochs) where their boundaries serve as recovery points.Either compiler [18,152,161] or hardware (this work) is responsible for the region formation and the persistence of each region.In particular, each region should ensure that all its stores are persisted before the next region starts so that the program can be recovered by restarting the power-interrupted region upon power back.
However, such a region-level persistence scheme incurs a nonnegligible performance overhead, since the program must wait at each region boundary for the preceding region to persist its stores, i.e., pausing until they are all written back to nonvolatile memory (NVM).While the prior work leverages ILP to overlap the persistence latency with the execution of other instructions, they still cause significant performance degradation-especially in the presence of a more deep cache hierarchy-because their regions are too short to fully hide the long latency with ILP.

Store Integrity for Performant WSP
The key observation PPA builds upon is that we can safely recover the system states by replaying stores that are potentially unpersisted before power outage.Although this principle has been investigated and adapted by many prior approaches as a concept of atomic stores with logging them all [9,18,56,84,140], the prior schemes suffer from the problem of doubling NVM stores-known as write amplification.To achieve high-performance WSP, we make another observation that crash inconsistency is essentially caused by the mismatch between the program order of committed stores and the order in which their cache blocks are written back to NVM.To be specific, a younger store might be evicted (persisted) to NVM while the older ones are cached; if power failure happens before their persistence, NVM status becomes inconsistent across the failure on which the data of the older stores are lost since they have not been persisted.This finding inspires us to recover the inconsistent NVM status by rewriting only those potentially unpersisted stores to NVM in the wake of the power failure-unlike traditional undo loggings that checkpoint all stores.The upshot is that no matter which random order of persisted stores is across power failure, it is always possible to correctly recover by replaying all committed stores left behind the failure and resuming from the last commit point.Zeng et al. show that store replaying needs compilers to prevent the store registers from being clobbered by following redefinitions, which requires a special register allocator; they call this store integrity in their energy harvesting work, ReplayCache [152], and use compiler-based region-level persistence to divide the  program to a series of regions where store integrity is enforced to guarantee crash consistency.Unfortunately, ReplayCache incurs too much performance overhead (5x average slowdown as shown in Figure 1) when used to achieve WSP for server-class cores; see Table 2.The reason is 2fold: (1) ReplayCache's regions are so short (average 12 instructions in regions) that they cannot accomplish enough ILP to hide the region-level persistence latency through multi-level caches.That is mainly due to the inherent issues of ReplayCache's compiler analyses, e.g., function calls/loops, scarce architectural registers, and energy-aware region splitting for avoiding stagnation [17,18] in EHSs.Hence, the short region leads to frequent pipeline stalls at each region boundary serving as a persist barrier; (2) ReplayCache inserts clwb after each store to write it back to NVM, which doubles the instruction count and places high pressure 5 on store queue whose overflow stalls the pipeline as well.Unlike ReplayCache, PPA achieves performant WSP for server-class cores causing only a 2% overhead (Section 7.1).

PPA OVERVIEW
PPA aims to achieve a lightweight WSP that works for a deep cache hierarchy, where DRAM cache is used as in PMEM's memory mode, without sacrificing the transparency (i.e., keeping the entire software stack as is and obviating the need for recompilation) and the performance.PPA adopts store integrity for crash consistency, but its novel hardware design for the integrity enforcement makes it possible to realize a performant WSP at a low cost.In particular, PPA leverages ample physical registers in out-of-order cores to preserve store registers; it dynamically delineates the region (epoch) boundary whenever physical registers run out.In this way, sufficiently long store-integrity regions serve as the basis for failure recovery, thus effectively hiding the store persistence latency.
Figure 2 depicts how PPA realizes WSP based on register renaming of a modern out-of-order core 6 .In the figure, commit rename table (CRT), register alias table (RAT), and Free List are existing microarchitectural components.CRT keeps the mapping from an architecture register to a physical register for committed instructions, while RAT records that for in-flight instructions.The free list maintains free registers for later renaming use.PPA proposes MaskReg, a bit vector, to record which physical register is used by prior committed stores and therefore should not be remapped (overwritten) by the following redefinitions.
In Figure 2, upon renaming a destination architectural register  0 (i.e., △  0 =  0 + 1), the processor removes a physical register 0 from the free list and puts the mapping from  0 to a physical register 0 into RAT as usual.Thus, for renaming the following store (⃝), i.e.,   0, [100], the reference to  0 is replaced by 0.Once the addition instruction commits (▲), making the defined value of  0 architecturally visible, the processor puts the mapping  0 → 0 in CRT as usual.In particular, on the commit of the store (⃝), PPA starts to track 0 in MaskReg, watching it for store integrity.When the following redefinition of  0 is renamed (♢), the multiplication instruction obtains 1-not 0 since it is already masked-from the free list with RAT updated accordingly.Additional pipeline details are deferred to Section 3.3.

Dynamic Region Formation
Similar to prior techniques [18,152], PPA also provides regionlevel persistence.However, what makes PPA stand out from them is its ability to build regions dynamically without user intervention, recompilation, and significant performance loss.PPA instead leverages an existing microarchitectural feature to deliver the region formation with the store integrity enforced at a low cost.In particular, PPA considers the number of free physical registers to decide when to place a region boundary (persist barrier).As shown in Figure 2, PPA places the boundary (barrier) when no free physical register is available at the renaming stage of the out-of-order pipeline (ȫ).Once PPA ensures at each region boundary that the committed stores of the finished region are all persisted, it reclaims their physical registers with MaskReg cleared-before starting the next region, as shown at the left bottom of the figure.

HW-Based Asynchronous Store Persistence
Although prior software-logging-based PSP techniques guarantee consistent NVM status across power failure, they incur significant performance overhead because of a persist barrier (e.g., clwb and sfence in x86).In contrast, PPA does not block the pipeline execution while stores are being persisted to NVM.That is, once the data being stored is merged into the L1 data cache (⃝ in Figure 2), the L1 data cache controller immediately asynchronously writes back the resulting dirty cacheline to NVM in the background, keeping the pipeline busy with other instruction executions in the meantime.
To ensure all stores prior to the end of a region are already persisted in NVM before committing following instructions, PPA treats every region boundary (the last instruction of each region) as a special persist barrier.Therefore, the core pipeline waits until the acknowledgment of persisting the region's all prior stores in NVM is received by the core before entering the next region.While stalling the pipeline can lead to a slowdown due to the wasted cycle time, our experimental results show that our hardware-based store persistence has a minimal impact on the pipeline performance due to long enough regions (see Section 7.5) and thus resulting in negligible stall cycles at the end of regions (see Section 7.3).

Dynamic Enforcement of Stores Integrity
Figure 2 shows how PPA ensures store integrity on the fly during the pipeline execution.Upon retiring   0, [100] (⃝ in the figure) whose  0 was renamed to 0, PPA masks 0 in MaskReg to notify it is occupied by the store, which makes the target register of the following multiplication instruction renamed to 1 (♢) instead of 0.Unlike conventional cores, upon retiring the multiplication (♦  0 =  0 * 2) with updating CRT with  0 → 1, PPA does not reclaim the physical register 0 which is associated with  0's prior definition  0 =  0 + 1-though its value can no longer be used due to the retirement of the multiplication overwriting  0. That is because 0 is masked as a committed store register in MaskReg, and it should be preserved in case of power failure so that the store can be replayed in the wake of the failure.In this way, PPA not only guarantees store integrity in each region but also achieves performant WSP with a much longer region size than the compilerbased prior work [152], thus hiding the store persistence latency.

Checkpoint and Recovery Protocol
To achieve correct program execution across power outage, all the store registers preserved by our register renaming trick must survive power failure.For this reason, PPA should maintain necessary microarchitecture status such as CRT across the outage.Also, in the wake of power failure, PPA should be able to resume the program right after the last commit point behind the outage.
In light of this, PPA exploits just-in-time (JIT) checkpointing to save minimal architectural states-e.g., physical register 0, CRT, and the last committed PC as shown in Figure 2 (①)-to a designated checkpoint storage in NVM, when power is about to be cut off.Owing to its simplicity, PPA only requires a tiny capacitor to secure energy for JIT checkpointing, while Narayanan's [107] and eADR's demand a significantly large bulky Li-thin battery or supercapacitor [4] (Section 7.13).When the power comes back, PPA first replays all committed stores behind the failure, e.g.,  0, [100] in Figure 2 (②), and restores other checkpointed states such as CRT (③).Then, PPA resumes the interrupted region from the latest uncommitted instruction following the last committed PC to continue program execution.More details are deferred to Section 4.5 and Section 4.6.

PPA IMPLEMENTATION DETAILS
Figure 3 shows PPA's microarchitecture with its 3 newly added components; Last Committed Program Counter (LCPC), Store Operands Mask Register (MaskReg), and Committed Store Queue (CSQ).The LCPC register keeps the PC of the last committed instruction so that a power-interrupted program can resume thereafter in the wake of power failure.Note that PPA does not save or recover architectural status related to speculation, such as in-flight instructions in ROB.The MaskReg comprises as many bits as the PRF size.Each set bit of MaskReg indicates that the corresponding physical register has been used as an operand of any committed store in the current region and thus prevents those physical registers from being updated by the following instructions of the region.Finally, the CSQ is a circular FIFO queue for tracking committed stores per region.When a store retires from ROB, a pair of (1) the index of the source physical register and (2) the destination physical address of the store is inserted into the rear position of CSQ.Actions of PPA across an Outage: Upon a power outage, PPA has 5 components (shaded in Figure 3) JIT-checkpointed in NVM: CSQ, LCPC, CRT, MaskReg 7 , and the physical registers tracked by CSQ/CRT.When power comes back, PPA (1) restores the checkpointed registers, MaskReg, CRT, LCPC, and CSQ from NVM, (2) scans CSQ entries from front to rear re-executing the stores committed before the outage8 , (3) populates RAT with the restored CRT, and (4) resumes the power-interrupted program right after LCPC.
Note that once a region is persisted at the boundary, i.e., the pipeline receives an acknowledgment that all the committed stores of the region have been persisted to NVM, PPA clears both the CSQ and the MaskReg-reclaiming the store registers masked thereinbefore starting the next region.Thanks to the long region size (Section 4.1) and the asynchronous writebacks (Section 4.3), PPA effectively hides the store persistence latency at each region end.At first glance, forming store-integrity-preserving regions seems easy, i.e., placing a region boundary right before the redefinition of store registers to preserve their values within a region.For example, placing a region boundary (persist barrier) after store  2 in Figure 4 ensures store integrity and post-crash consistency but yields short regions because of a write-after-read (WAR) dependence on store register  2. A sophisticated compiler approach might form relatively longer regions by renaming the redefinitions of previous store registers-unless architectural registers run out-as in ReplayCache [152].However, the prior approach [152] to store integrity still generates short regions due to a limited number of architectural registers, e.g., 16 general-purpose registers in x86.The crux of the problem is that ReplayCache pays for persist barrier overheads so often at each end of such short regions.Fortunately, the out-of-order cores already have the ability to eliminate WAR dependence with register renaming [42], e.g., renaming  2 to 1 for the subtraction  3 in Figure 4.That way the With the above observation in mind, PPA proposes a minimal change to the instruction pipeline so that it enforces store integrity during the execution of each region.That is, PPA dynamically partitions program to a series of regions by placing a region boundary, i.e., a persist barrier, upon a pipeline stall at the renaming stage; if a register-defining instruction cannot be renamed due to the lack of a free physical register in free list, PPA injects a persist barrier right before the instruction (see Figure 6 for details).Once the pipeline retires the persist barrier at the boundary of a region, i.e., all its stores are already sure to have persisted, PPA acknowledges the renaming stage to reclaim the physical registers masked by MaskReg, clears it, and resumes the pipeline to start the next region.

Enforcing Store Integrity Efficiently
Figure 6 shows a step-by-step example of how to perform dynamic region formation while preserving registers of stores.We assume a 4-bit MaskReg for total 4 physical registers 1 − 4.Initially, MaskReg is empty, and 1 − 3 are occupied by previous definitions of register  1 − 3, and a free list contains only 4.When renaming  1 of the addition instruction  1 at step ①, PPA maps  1 to the only free physical register 4 and updates the RAT and the free list accordingly.Then, at step ②, when renaming a store instruction  2, the references to architectural registers are replaced by physical registers as usual.At the same time, the pipeline retires  1 instruction, deallocating 1 associated with  1's previous definition (not shown in the figure) and updating the CRT with  1 → 4.At step ③, the pipeline renames  2 of  3 to 1, with RAT and the free list correspondingly updated, and commits the store  2 setting the bits of 2 − 4 in MaskReg to 1 10 and populates a CSQ entry in its back position for the committed store  2.
In particular, at step ④ where the pipeline commits  3 and renames  4, PPA takes different actions from the traditional out-oforder pipeline that allows 2 to be remapped.PPA does not deallocate physical register 2 associated with  2's previous definition (not shown in the figure), though  3 commits redefining  2. This is because 2 is masked in  as a store operand.However, at this moment, there is no physical register in the free list, which makes PPA fail to rename the register  1 of  4. Thus, PPA injects a persist barrier here as a region boundary right before instruction  4. Once the barrier gets retired, PPA reclaims all masked physical registers 2 − 4 to the free list, clears MaskReg, and starts a new region allowing them to be reused therein.Full CSQ as an Implicit Region Boundary: If CSQ becomes full, PPA cannot accommodate stores anymore, thus being unable to replay them for power failure recovery.Thus, PPA treats this event as a virtual region boundary where it waits for all prior stores to be persisted.Once the core receives the acknowledgment of persisting all prior stores, PPA starts a new region with CSQ and MaskReg cleared.Our experiment (Section 7.9) shows that a 40-entry CSQ rarely overflows, thus incurring a minimal performance impact.

Region-Level Asynchronous Persistence
In addition to preserving store registers for their integrity in each region, PPA also ensures that its stores are persisted to NVM before moving on to the next region.Instead of leveraging cacheline writeback instruction (clwb in x86) that has a lot of drawbacks, as shown in Table 1, e.g., occupying a store queue entry for each store, requiring inter-core snooping, and not being able to flush data from core to NVM main memory through a DRAM cache above, PPA leverages the asynchronous store writeback which effectively takes it off the critical path [9,54,56,60,111,130].That is, when the data being stored is merged into L1 data cache after cache coherence transactions are completed, an asynchronous store persistence operation is generated in the write buffer (WB) 11 of L1 data cache and then issued by its controller.The implication is 2-fold: (1) the store persistence happens in the background while the core continues the execution of following instructions, achieving ILP; (2) once a store persistence operation is issued, all other cores already have up-to-date memory data.
Unlike clwb, i.e., a cacheline writeback instruction that occupies a store buffer entry, PPA instead uses a counter register in the L1 data cache controller to record the number of stores being persisted, rather than tracking each individual store with clwb; the 10 While MaskReg could record all operand registers of each store, we opt to keep only a data register as an optimization.See Section 4.6 for more details. 11WB already exists in Intel processors sitting between L1D and L2 cache for buffering dirty cacheline eviction.counter increases for each store performed and decreases every time the controller receives the acknowledgment of the writeback completion.In particular, to lower write traffic towards NVM, PPA performs persist coalescing [130] on the WB for the data being persisted.That is, a younger store being persisted is merged with the old unpersisted one of the same address sitting in the WB.This is correct because persist barriers ensure that the WB's data-to be persisted-are from the same region, and the stores of the following regions are not performed yet.
When the counter hits zero, the controller tells the core that all prior stores in a region are persisted to NVM, allowing both CSQ and MaskReg to be cleared.In this way, PPA determines if the pipeline needs to stall at the region boundary by simply comparing the counter with zero.Although such a stall might slow down the pipeline by waiting for the counter to be zero at each region boundary, it turns out that the performance impact is not significant.The reason is that the region-level persistence latency is fully overlapped with the execution of other instructions in the long regions dynamically formed by PPA (Section 7.3).Moreover, PPA's asynchronous store writeback does not generate coherence traffic-since each core is responsible for its own writeback-thus reducing the persistence latency further.

Lightweight Hardware for Recovery
To achieve highly energy-efficient checkpointing and recovery, PPA needs to checkpoint only essential architectural statuses, e.g., a part of physical registers used by committed stores or linked with committed instructions in the interrupted region, committed stores of the region, CRT, and program counter (PC) of the latest committed instruction, upon power failure.With checkpointing such minimal states, we can still restore consistent memory status by re-executing those stores and then resume the program execution following the latest committed instruction.
To facilitate this, PPA proposes a simple yet hardware-efficient FIFO queue called committed store queue (CSQ) and Last Committed PC (LCPC).Each CSQ entry keeps the source physical register index and the destination (physical) address of committed stores in program order, and LCPC gets updated with the PC value after committing an instruction.Note that CSQ and LCPC do not affect the existing pipeline's timing logic at all because they are out of the critical path.More importantly, CSQ is organized with a read/write port eliminating an expensive CAM structure, making a large CSQ realistic; we only need 40 CSQ entries at most as shown in Section 7.9 though.During normal program execution, the port is used to populate a CSQ entry in its rear position and to checkpoint the entire CSQ to NVM upon power failure.Finally, PPA clears CSQ at each region boundary as with MaskReg emptied, i.e., when all committed stores in the finished region become persistent in NVM, before moving on to the next region.

Just-In-Time (JIT) Checkpointing on Power Failure
To ensure correct program recovery across power failure, PPA should checkpoint necessary states when power is about to be cut off.Figure 7 shows how such just-in-time checkpointing works with its circuitry implementation; upon the delivery of Power_Fail signal, PPA saves the contents of its 5 structures to NVM, i.e., MaskReg, commit rename table (CRT), committed store queue (CSQ), a part of PRF, and last committed PC (LCPC).Note that PPA only checkpoints those physical registers marked by CRT or CSQ entries in that neither free registers (1 in the figure) nor uncommitted registers (3 defined by  3) affect correct program recovery.Similarly, PPA does not have to checkpoint any other status of in-flight instructions, e.g., their RAT and ROB entries.This is because PPA can resume the execution of power-interrupted program from the latest uncommitted instruction following LCPC, when power comes back.
As with prior work on JIT checkpointing [36,94,95,120,126,133,143] developed for energy-harvesting systems [6,16,19,152] to realize power failure recovery, PPA implements a controller that governs checkpointing and recovery 12 operations, according to each signal delivered on power failure and its wake-up.As shown in the middle of Figure 7, the controller consists of 3 components: (1) Control Finite State Automaton (FSM), (2) Source Index Generator (SIG), and (3) NVM Address Generator (NAG).FSM is responsible for generating control signals to checkpoint PPA's 5 structures, i.e., MaskReg, CRT, CSQ, PRF, and LCPC, into their storage in NVM.During the checkpointing process, FSM triggers SIG and NAG that share the same logic-shown in the bottom right of the figure-for the sum of the inputs Base and Offset to determine (1) what to be checkpointed and (2) where to save in the NVM, respectively.
It is worth noting that PPA activates its checkpointing controller only on power failure, and therefore it is out of the critical path most of the time as long as power is on, i.e., PPA does not have to optimize the controller's circuitry for latency.This allows PPA to keep the controller's hardware design simple by sequentially checkpointing PPA's 5 structures 13 one entry at a time.To illustrate, as shown at the bottom left of Figure 7, FSM is triggered upon Power_Fail to transit from Idle stage to Stop_Pipeline stage, where PPA stops the core pipeline to preserve the contents of the 5 structures.Then, FSM moves to Read stage, raising the read signal Core_Rd on the control path so that the entry indexed by SIG can be read in each of 5 structures across which Base and Offset are properly updated.Upon the delivery of Read_Finish signal, FSM enters Write stage enabling the write signal NVM_Wr to write the data to the NVM address generated by NAG.Once the writing is done, FSM either goes back to Read stage or exits to Idle provided if Ckpt_All is activated, i.e., all 5 structures are completely checkpointed.
To realize the above sequential checkpointing while maintaining a low hardware complexity, PPA exploits the existing non-temporal path [24] in x86 processors to deliver data to NVM-other than introducing a new data path.This indicates that PPA checkpoints its 5 structures at an 8-byte granularity as with their entry size (Section 7.12).Likewise, FSM reads PRF and CRT at an 8-byte granularity, which seems possible given that they are implemented with SRAM [38,132,157].The takeaway is that the aforementioned JITcheckpointing logic is lightweight, i.e., a few hundred logic gates, keeping the overall hardware cost of PPA minimal (Section 7.12).

Power Failure Recovery Protocol
To achieve correct program recovery, in the wake of power failure, PPA restores MaskReg, CRT, and checkpointed physical registers by reloading their data from NVM as an opposite operation of the JIT checkpointing.PPA then re-executes those potentially unpersisted stores by reading the CSQ entries checkpointed in NVM.To be specific, for each CSQ entry, PPA gets both the data value by retrieving the restored PRF with an index of the checkpointed physical register and the destination address.That way, PPA writes the data value to the target address.Finally, PPA resets the PC to the instruction following the LCPC to continue the program execution.

INTERACTION WITH OS
This section describes how PPA interplays with the rest of the computing stack, such as the operating systems (OS), to enable system-level crash consistency.
Handling I/O Operations: To the best of our knowledge, supporting irrevocable operations such as I/O remains an open problem.PPA can be extended to have a battery-backed buffer for crashconsistent I/O operations.In this way, PPA considers any store to the buffer as persisted.
Context Switching: PPA treats context switching as is without any special consideration.In particular, PPA does not differentiate between kernel code and user program thanks to the benefits of WSP.While keeping context switching as is, PPA still guarantees correct process (de)scheduling and resumption.That is because PPA ensures that the architectural states, e.g., stores and architectural registers, of a descheduled process are crash-consistent by following PPA 's JIT checkpoint and recovery protocol.That being said, PPA might have an indirect impact on performance, provided that a region boundary is introduced during context switching.In reality, such a case rarely occurs because PPA forms reasonably long regions (see Section 7.5), keeping the frequency of encountering region boundaries low.Even if the case occurs, i.e., PRF runs out in the middle of the context switching, and the resulting region boundary incurs the region-level persistence overhead, PPA can still minimize the stall cycles at the boundary leveraging the asynchronous store persistence, e.g., only a few stall cycles occur on average (see Section 7.3).It turns out that they are negligible compared to typical context switching overhead (e.g., 5-20 ) [134,141,145].Consequently, the context-switching performance would practically be the same with PPA.
Interrupt Handling and System Calls: There is no special treatment of PPA for Interrupt handling 14 and system calls-that rely on trap instructions (syscall in x86_64)-for the same reason above.That is, PPA guarantees that any architectural state is consistent across power failure.As such, PPA can resume interrupt handlers and system calls exactly from the power failure point without rollback.For an interrupt handler that encounters power failure in the middle of the execution, PPA can recover all committed but unpersisted stores and architectural registers and resume the handler from the last commit point in the wake of the failure.

DISCUSSION
Recovery for Multi-Cores: To guarantee correct recovery for multi-threaded applications on multi-core processors, we assume data-race-free (DRF) applications as required from the C/C++11 onward.DRF implies that conflicting accesses should be explicitly ordered by a synchronization primitive, e.g., serializing them in a lock-protected critical section or leveraging an RMW (readmodify-write) instruction.PPA treats all synchronization primitives, including atomics and fences, as a region boundary so that their actions comply with PPA's original recovery protocol in case of power failure; for each synchronization primitive running on a core, it cannot be committed until all stores of its region are sure to have been persisted to NVM with the CSQ of the core emptied.For example, the stored data before a lock release can exist in the CSQ of at most one core.The implication is two-fold: (1) there cannot be multiple pending stores to the same address in the CSQs of different cores due to the absence of data races; (2) thus, we may replay stores in the cores' CSQs in an arbitrary order, which still achieves correct recovery-because each core's CSQ entries are disjoint with any other core's CSQ entries.That is, PPA can restore consistent NVM states of DRF applications-though it lets each core perform the recovery protocol (Section 4.6) individually-without maintaining the recovery order among the cores.Memory Consistency Model: Although PPA is evaluated with X86 ISA (total store ordering), it works well for other consistency models, e.g., relaxed memory ordering (RMO) in ARM and RISC-V, because PPA leaves load/store unit (LSQ) as is by proposing a tiny CSQ.One might think of gating those retired stores in store buffer (SB) without merging them to L1 cache as an alternative.However, it complicates the hardware design and limits the performance optimizations of RMO for 3 reasons: (1) region-level persistence 14 We use the term interrupt to describe software exception and hardware interrupt.prohibits inter-region store coalescing and out-of-order store writeback from SB to L1 data cache; (2) it is hard to enlarge the SB size for hiding long memory latency.That is because SB's CAM searching structure is expensive, and it must provide data within L1-hit time, which would otherwise complicate the scheduling loads with variable latency; (3) data being stored exists in both SB and PRF, wasting the energy to checkpoint the same data twice.In-Order Cores and ROB-Style Register Renaming: Our design can be easily extended to provide WSP for both cores by accommodating data values (rather than indexes to PRF as in the current PPA) and destination addresses of committed stores in the CSQ as usual.Across power failure, the CSQ entries can be checkpointed and thus restored to recover inconsistent NVM status via replaying.Multiple Memory Controller (MC) Support: PPA naturally supports multiple memory controllers without any hassle.This is because PPA only moves on to the next region once all stores of the prior region are persisted in NVM with the help of region-level persistence (Section 4.3); this makes it impossible to persist a younger store (in program order) destined to a near MC before the older one to a far MC, if the two stores are separated in different regions.Even if the stores exist in the same region and its power failure exposes the possible ordering violation, PPA replays them all together with other stores of the power-interrupted region in the wake of the failure.Consequently, either way PPA prevents crash inconsistency from occurring in the presence of multiple MCs.

EVALUATION AND ANALYSIS
All programs are compiled with -O3 flag and are statically linked.We use the Clang/LLVM 13.0.1 compiler [76,77] to build the baseline binaries with default compilation flags.We implement the same ReplayCache region formation in the same compiler to build store-integrity binaries with disabling ReplayCache's energy-aware region splitting to enlarge the region size as much as possible.We use the cycle-accurate simulator gem5 [7] to model an 8-core (one thread per hardware core) x86_64 Skylake-X processor with two integrated memory controllers, each of which manages a DRAM as an off-chip direct-mapped cache as with PMEM's memory mode.Table 2 shows the details of the microarchitectural parameters.
-s small 241MB PC [73] Update in hash-table.8 100000 196MB RB [73] Insert/delete nodes in a red-black tree.8 100000 166MB SPS [73] Swap random entries of an array.8 200000 264MB TATP [73] update_location transaction.8 100000 287MB TPCC [73] add_new_order transaction.8 100000 110MB r20w80 [100] Memcached with 20% reads and 80% writes -m 1000 -t 8 189MB r50w50 [100] Memcached with 50% reads and 50% writes -m 1000 -t 8 189MB We simulate the entire SPLASH3/STAMP/WHISPER program in the full system (FS) mode of gem5 with 8 cores by default.To stress the memory system and demonstrate the benefits of enabling DRAM as a cache, we use reference inputs to simulate SPEC CPU applications and the data inputs specified in Table 3 for Mini-apps and WHISPER.Additionally, we modify the source code of WHIS-PER applications to increase the key/value sizes, keeping their data footprint large enough; see Table 3.Similarly, we follow the prior work [105] using Memcached 1.6.18[100] as a server and memaslap from libMemcached 1.0.18[2] as a client to initiate 8 threads sending 10000 requests to the server.For each memaslap request, we test two ratios of read-to-write operations: 20/80 and 50/50 for writeintensive and read-intensive.In particular, we set the key and value sizes of Memcached to 64 bytes and 1KB, respectively.We follow the same way as prior work [27,28,80,89,122,129,153] to fast forward the first 5 billion instructions and then simulate the next 1 billion instructions with a detailed CPU model.As a comparison, Figure 8 presents run-time overheads of PPA and the state-of-the-art WSP-Capri [57] which incurs high hardware costs due to the separate FIFO persist path between the core and NVM and the complex undo+redo logging structures; see Table 6 for the comparison.To be practical, we set the persist path bandwidth of Capri to 4GB/s instead of its original unrealistic 32GB/s 15 .PPA incurs an average of 2% overhead, while Capri incurs a 26% overhead due to its 11x shorter regions than that of PPA; see Section 7.5.Note that PPA only incurs a slightly high overhead for rb of WHISPER due to the relatively higher write traffic towards NVM, as confirmed in Figure 15 and Figure 18.

Run-time Overhead Analysis
We also compare PPA and PMEM's memory mode to the DRAMonly system with a 32GB DRAM. Figure 9 depicts that PPA and the 15 We get Capri's source code and figure out its default persist path bandwidth is 32GB/s.Figure 9: Normalized slowdown to a DRAM-only system with 32GB volatile memory; lower is better memory mode are 16% and 14% slower than the system only with a 32GB DDR4 DRAM, respectively.The results are encouraging in that PPA's cost of making the DRAM-only system persistent is comparable to the run-time overhead of PMEM's memory mode that does not offer persistence.In particular, lbm and pc incur e.g., 44% and 58% overheads, respectively.That is because they have poor locality and thus the DRAM cache only increases the critical path of their memory accesses with a lot of misses.To demonstrate the benefits of enabling DRAM as a cache for the applications with high L2 miss rates (ranging from 18% to 100%), we compare PPA to an optimized version of BBB [4] whose performance is close to that of eADR, representing the upper-bound performance of a PSP scheme.Figure 10 shows that PPA incurs only an average of 3% run-time overhead for these programs, while BBB/eADR slows down the programs by 1.39x on average and up to 2.4x for libquantum.Notably, PPA underperforms BBB/eADR slightly for rb.The reason is two-fold: (1) PPA leads to higher contention in WPQ (Section 7.7) due to the store persistence; (2) rb exhibits high locality (4% L2 miss rate) and thus has less write traffic towards NVM for the baseline.Figure 11 shows the average ratio of the stall cycles occurred at the end of each region to the execution cycles of that region.Thanks to the sufficiently long region size (i.e., high ILP for hiding store persistence latency), PPA only increases the stall cycle ratio of the baseline (PMEM's memory mode) by 0.21% on average, showcasing why PPA incurs a low run-time overhead, i.e., 2% on average.Figure 11 also shows why PPA incurs a relatively higher overhead for water-ns and water-sp; the reason is that, as shown in the figure, these two applications have more stall cycles, i.e., 6.1% and 8.1%, respectively due to their shorter regions and more stores therein (see Figure 13).For both the baseline (PMEM's memory mode) and PPA, we measure the number of stall cycles due to the lack of physical registers in the renaming stage of the simulated core.Figure 12 highlights that PPA incurs negligible extra stall cycles (0.07%) on average compared to the baseline.The reason is two-fold: (1) The core pipeline stall caused by running out of free registers rarely occurs due to the sufficient amount of free registers (see Figure 5).

Impact on PRF
(2) Although the stall happens, PPA tends to spend minimal cycles at the end of regions (see Figure 11) and thus quickly deallocates their reserved registers for later use.To demonstrate why PPA incurs such a low run-time overhead, we measure the number of stores and others in each region.As shown in Figure 13, each region has 301 other and 18 store instructions on average thanks to the abundant free registers, while Capri's average region size is only 29.As a result, PPA has enough room to keep the pipeline busy while asynchronously persisting the data being stored to NVM without waiting at each region boundary.Note that some applications, e.g., bzip2 and libquantum, have smaller region sizes due to their heavy register usage.

Sensitivity to Deeper Cache Hierarchy
To evaluate the sensitivity to deeper cache hierarchy, i.e., 3 Figure 14: Normalized slowdown of PPA to the baseline when using L3 cache atop DRAM cache; lower is better set-associative L3 cache of 44-cycle hit latency to both PPA and the baseline (PMEM's memory mode).We also alter the existing L2 cache in Figure 2 to private L2 with 14-cycle hit latency and 1MB. Figure 14 shows that PPA incurs a negligible overhead (1%) even when the L3 cache is used atop DRAM cache thanks to PPA's sufficiently long region size (see Section 7.5) that can cover the extended store persistence latency through the hierarchy.To see the impact of the NVM write pending queue (WPQ) on the performance of PPA, we vary the WPQ size from 8 to 24 for memoryintensive applications of CPU2006/Mini-apps and multi-threaded applications.As shown in Figure 15, PPA still incurs a low overhead (8%) though the WPQ size decreases to 8.This is because many applications exhibit high L2 write miss rates indicating already high pressure on the WPQ for the baseline.As such, the negative effect of extra write traffic caused by PPA's store writeback is amortized.Note that PPA incurs a higher overhead for some applications, e.g., rb, water-ns, and water-sp., as setting WPQ size to 8. The reason is two-fold: (1) they have low L2 miss rates indicating low run-time execution time for the baseline; (2) the store writeback leads to high pressure on WPQ due to more generated write traffic to it.Fortunately, the extra write traffic can be absorbed by enlarging the WPQ size to the default (16).16, PPA incurs less overhead with a larger PRF.

Sensitivity to PRF Size
Note that even with the smallest PRF size of 80/80, PPA still forms sufficiently long regions and thus incurs an average of only 12% overhead owing to the underutilization of the PRF size.Interestingly, the benefit of the large PRF diminishes once its size increases beyond the default.This is because the default PRF setting already has enough amount of free registers to form long regions covering the persistence latency.Notably, with PRF size 80/80, PPA incurs about 30% run-time overhead for some programs, e.g.,hmmer, lbm, lu-cg, and tpcc, since (1) PPA requires at least 65/68 integer/floating-point registers for their normal execution, and (2) the programs have intensive memory writes, ending up with putting high pressure on the PRF.To investigate the proper size of the CSQ, we vary the CSQ size from 10 to 50.As shown in Figure 17, the CSQ size has a minimal impact on PPA's performance since there are an average of only 18 stores in each region (see Figure 13).In light of this, we set the CSQ size to 40 by default such that the core pipeline encounters as less pipeline stalls as possible caused by the CSQ overflow; it is cheap to enlarge the size of the CSQ to 40 because of its simple structure.To show how PMEM write bandwidth affects PPA's performance, we vary the NVM write bandwidth from 1GB/s to 6GB/s for intensive CPU2006/Mini-apps, SPLASH3, and WHISPER benchmarks.To be practical, PPA sets the default bandwidth to 2.3GB/s according to the empirical Intel PMEM analysis [148].As shown in Figure 18, PPA still incurs an average of only 7% overhead even for 1GB/s write bandwidth.Once the write bandwidth goes up beyond the default, PPA keeps its performance overhead as low as 2% thanks to the long regions hiding the potential pipeline stalls upon full WPQ.It is worth noting that PPA incurs a relatively higher overhead for SPLASH and WHISPER programs with 1GB/s bandwidth.This is because different threads of these multi-threaded applications always compete for the shared WPQ and the lower bandwidth exacerbates the competition.Note that some applications, e.g., water-ns, water-sp, and rb, are more sensitive to the write bandwidth due to their inherent less memory writeback traffic (i.e., they exhibit high locality).To study the impact of PPA on cache coherence, we vary the thread count and scale up the NVM WPQ/shared L2 size proportionally.Figure 19 shows that the resulting performance impact is quite small; PPA still maintains high performance, i.e., an average of 2%-6% overheads for 8-64 threads.PPA incurs slightly higher overheads for water-ns, water-sp, and Memcached (r20w80) with more threads due to the increasing stall cycles taken for thread synchronization.

Hardware Cost Analysis
PPA introduces a 64-bit LCPC register, a 348-bit vector register MaskReg due to the PRF size (348), and a 40-entry CSQ.Each CSQ entry records a pair of 9 (⌈ 348  2 ⌉)-bit index to a physical register and a 48-bit physical address.To facilitate JIT checkpointing, we round the size of PPA's proposed structures to the nearest multiple of 8 bytes such that their entry size is 8 bytes.We then use these numbers to calculate their hardware overheads (see Table 4).We use CACTI 7.0 [104] to estimate the hardware cost of PPA's proposed hardware structures with a 22 nm process technology node.Table 4 showcases PPA's low hardware costs in terms of chip area, access latency, and power consumption.In summary, PPA's proposed hardware structures only occupy 0.005% chip area of an Intel Xeon server core (11.85  2 after excluding its shared L2 cache); the core area size is calculated with McPAT [81].

Energy and Latency for JIT Checkpointing
Upon impending power loss, PPA checkpoints CSQ, LCPC, CRT, MaskReg, and a part of PRF marked by entries of CSQ or CRT in NVM.We assume 16 architectural integer registers and 32 architectural floating-point registers.Therefore, we need to checkpoint at most 88 physical registers (40 in CSQ and 48 in CRT).Energy Consumption: We assume that the checkpointed hardware structures are based on SRAM.To estimate the energy consumption, we leverage prior work [4,109,117].They measure the energy cost per memory operation by using an external power meter while executing carefully designed microbenchmarks.These microbenchmarks are used to observe the energy consumption of only data movement between core and memory and minimize the impact of other architectural optimizations and non-memory operations.It turns out that 11.839 nJ/byte is necessary for accessing data in SRAM cells and moving it from core to NVM.Therefore, we need to secure 21.7 J to JIT checkpoint 1838 bytes of data considering the worst case that each physical register has 128-bit data.However, the ideal PSP scheme BBB [4] and Intel's eADR require a supercapacitor of 775 J and 550 mJ, which are 36.5 and 25943x larger than ours, respectively.We leverage the prior work [4] to calculate the required size of supercapacitor [162] and Li-thin battery [119].These two battery techniques have an energy density of 10 −4  ℎ/ 3 and 10 −2  ℎ/ 3 , respectively.Table 5 shows that PPA needs a 0.06  3 supercapacitor or a 0.0006  3 Li-thin battery, which occupies 0.5%/0.0005% of an Intel server core (11.85  2 ), respectively.Checkpointing Time: PPA's JIT-checkpointing controller can persist 8 bytes of data per cycle thanks to its simple structure (see Section 4.5).According to our RTL synthesis results with TSMC 22 nm technology, the controller only requires 144 D flip-flops with 88 two-input logic gates.Therefore, the controller takes 114.9 ns to read 1838 bytes of data.Along with the write bandwidth (2.3GB/s) of PMEM [52] in our simulations, PPA needs only 0.91 s to flush the 1838 bytes data to PMEM upon power failure.Comparison of Energy Consumption: We calculate the energy consumption of a single core equipped with WSP Capri or PSP LightPC [78] to highlight the low energy requirement of PPA.Upon power failure, Capri flushes data in its battery-backed redo buffers (54KB per core) to NVM with 11.839 nJ per byte [4], thus costing 0.6mJ per core.Likewise, LightPC flushes volatile data of only user processes in architectural registers (4224 bytes of 16 GPRs and 32 XMM registers), L1D cache (64KB), and L2 cache (16MB) all the way to NVM, leading to a high energy consumption of 189 mJ; LightPC uses PCM as main memory.

RELATED WORK
Many prior PSP schemes [1,4,6,9,31,37,45,51,55,64,78,85,112,115,116,122,128,146,149,150,158] have offered user program persistency with crash consistency guaranteed.However, they require substantial programming burden in that users have to understand the underlying memory persistency model [73] and carefully write the code with crash consistency in mind.Moreover, the schemes often cause high run-time overhead (software approaches [140]) or significant logic complexity (hardware approaches [159]).
To this end, Narayanan et al. [107] propose the first WSP that flushes all volatile data, e.g., architectural registers/caches/DRAM contents, to NAND flash storage upon an impending power outage.Unfortunately, the just-in-time (JIT) checkpointing of all the data requires a considerable amount of energy to be secured always, which is in need of an expensive uninterruptible power supply (UPS).To lower the energy consumption, Capri [57] proposes a crash consistency mechanism based on hardware-managed redo buffers that only require a capacitor for their JIT checkpointing.In particular, Capri compiler partitions the input program into a series of recoverable regions with so that their stores never overflow the buffer.During the region execution, Capri persists the data being stored in the region by moving them from the redo buffer to NVM through the non-temporal path [24], bypassing the cache hierarchy completely.However, Capri still suffers expensive chip area/energy overheads due to per-core capacitor-backed redo buffer (each requiring 54 KB).On the other hand, ReplayCache [152], another WSP scheme for energy harvesting systems, incurs high run-time overhead with the frequent pipeline stalls at the end of compiler-formed store-integrity regions.
In summary, the overheads of the prior WSP schemes are so significant that they cannot enable a lightweight yet performant WSP.With the store integrity implemented using the simple register renaming trick, PPA achieves high-performance WSP for all at a negligible hardware cost.As shown in Table 6, PPA outperforms all prior WSP schemes in terms of all comparison criteria.

CONCLUSION
This paper proposes PPA, the first microarchitectural approach to WSP.As a basis for crash consistency and lightweight WSP, PPA realizes so-called store integrity in the out-of-order core pipeline.That is, PPA prevents store registers from being overwritten and dynamically partitions program to a series of regions whose boundary is delineated when the physical register file runs out.Upon impending power failure, PPA checkpoints the minimal architectural states including the preserved store registers using a tiny capacitor.When power comes back, PPA restores the checkpointed states, replays (persists) the stores of the power-interrupted region, and resumes the program following the latest committed instruction before the failure.Experimental results with 41 applications highlight the benefits of PPA causing only a 2% average run-time overhead and 0.005% chip areal cost.We believe that PPA lays the foundation for WSP and pave the way to realizing it for all.

Figure 2 :
Figure 2: PPA overview; for store integrity, 0 is not recycled even after the multiplication commits

Figure 3 :
Figure 3: PPA with Intel's memory mode; rounded rectangles corresponds to new components, while thick lines to new signal or data paths; shaded parts are JIT-checkpointed upon an outage (PPA checkpoints only those registers masked by MaskReg/CRT)

•Figure 4 :
Figure 4: Impact of register renaming on the region lengthAt first glance, forming store-integrity-preserving regions seems easy, i.e., placing a region boundary right before the redefinition of store registers to preserve their values within a region.For example, placing a region boundary (persist barrier) after store  2 in Figure4ensures store integrity and post-crash consistency but yields short regions because of a write-after-read (WAR) dependence on store register  2. A sophisticated compiler approach might form relatively longer regions by renaming the redefinitions of previous store registers-unless architectural registers run out-as in ReplayCache[152].However, the prior approach[152] to store integrity still generates short regions due to a limited number of architectural registers, e.g., 16 general-purpose registers in x86.The crux of the problem is that ReplayCache pays for persist barrier overheads so often at each end of such short regions.

Figure 5 :
Figure 5: (a): CDF of free integer registers; (b) CDF of free floating-point registers

4. 2 Figure 6 :
Figure 6: Dynamic region partitioning by physical register file size; the free list shows its status after the action

Figure 7 :
Figure 7: JIT Checkpointing logic; gray parts are checkpointed before impending power failure

Figure 10 :
Figure 10: Normalized slowdown of PPA and eADR/BBB (ideal PSP) to the baseline (running original program on PMEM's memory mode); lower is better

Figure 11 :
Figure 11: Stall cycles at the end of regions as a percentage of their execution time; lower is better

Figure 12 :
Figure 12: Increase in stall cycles at the renaming stage when the core is out of physical registers; lower is better

Figure 13 :
Figure 13: Average number of stores and others in regions

Table 1 :
Comparison between PPA and CLWB

Table 3 :
Data inputs for DOE Mini-apps and WHISPER apps Partial-System Persistence Stall Cycles at Region End Pressure -level SRAM caches atop DRAM cache, we add a shared 16MB16-way To show how PRF size affects PPA's performance, we vary the PRF size from 80/80 to 280/224 (integer/floating-point PR count).As shown in Figure

Table 6 :
Comparison of PPA to prior WSP approaches