SweepCache: Intermittence-Aware Cache on the Cheap

This paper presents SweepCache, a new compiler/architecture co-design scheme that can equip energy harvesting systems with a volatile cache in a performant yet lightweight way. Unlike prior just-in-time checkpointing designs that persists volatile data just before power failure and thus dedicates additional energy, SweepCache partitions program into a series of recoverable regions and persists stores at region granularity to fully utilize harvested energy for computation. In particular, SweepCache introduces persist buffer— as a redo buffer resident in nonvolatile memory (NVM)—to keep the main memory consistent across power failure while persisting region’s stores in a failure-atomic manner. Specifically, for write-backs during region execution, SweepCache saves their cachelines to the persist buffer. At each region end, SweepCache first flushes dirty cachelines to the buffer, allowing the next region to start with a clean cache, and then moves all buffered cachelines to the corresponding NVM locations. In this way, no matter when power failure occurs, the buffer contents or their memory locations always remain intact, which serves as a basis for correct recovery. To hide the persistence delay, SweepCache speculatively starts a region right after the prior region finishes its execution—as if its stores were already persisted—with the two regions having their own persist buffer, i.e., dual-buffering. This region-level parallelism helps SweepCache to achieve the full potential of a high-performance data cache. The experimental results show that compared to the original cache-free nonvolatile processor, SweepCache delivers speedups of 14.60x and 14.86x—outperforming the state-of-the-art work by 3.47x and 3.49x—for two representative energy harvesting power traces, respectively.CCS CONCEPTS• Computer systems organization → Embedded systems.


INTRODUCTION
Energy harvesting systems [79] are becoming more prevalent in a wide range of applications, e.g., vehicle tire pressure sensing, health and wellness monitoring [8,18,21,75], wearable computing [10,46,63,64], just to name a few.However, due to the unstable nature of the energy sources, e.g., radio frequency (RF) and WiFi, these applications experience frequent and unpredictable power failure during program execution-thus being called intermittent computation.
To resist frequent power failure, previous studies proposed a nonvolatile processor (NVP) [57,59,60,87].It provides an illusion of continuous execution to applications by leveraging both byteaddressable nonvolatile memory (NVM) as the main memory and voltage monitor based volatile data checkpointing.Whenever the monitor detects a voltage drop below a predefined threshold, i.e., a sign of impending power failure, NVP is interrupted to checkpoint all registers before the failure; this is so-called just-in-time (JIT) checkpointing.In the wake of the failure, NVP restores the registers and continues to progress from the interruption point-as if program had never been power-interrupted.
Nonetheless, the performance of NVP is limited by NVM accesses that are the most expensive in terms of both energy consumption and instruction latency.While caching hot data makes more progress under the same amount of harvested energy, equipping NVP with an SRAM cache puts significant pressure on crash consistency mechanisms [97].For example, upon power failure, the volatile cache loses all the data including dirty cachelines, and therefore only checkpointing the registers is not sufficient for correct recovery.
To address this problem, previous designs checkpoint and restore not only the register file but also the cache across power failure; NVSRAM cache [11,25,48,49,65,83,85] deploys a nonvolatile counterpart to back up the whole SRAM cache right before power failure.Since checkpointing the entire cache consumes too much harvested energy thus limiting forward progress, Liu et al.
devised a partial backup strategy [58] while others exploited hybrid cache architecture that checkpoints only dirty cachelines [86,94].Meanwhile, rather than checkpointing cachelines, ReplayCache re-executes any potentially unpersisted stores to restore consistent memory states before resuming power-interrupted program [97]at the cost of persisting each store during program execution, i.e., giving up on persist coalescing [37,77].
However, all prior cache-enabled designs rely on JIT checkpointing of registers which incurs nontrivial hardware complexity [26], e.g., the voltage monitor, the backup/restoration signal handling logic, the nonvolatile flip-flops (NVFFs) 1 that should be laid out next to the volatile registers for fast data movement [44,47,71,74,78], and the NVFFs controller; NVSRAM approaches require additional complexity for JIT checkpointing of the SRAM cache [58,86,94].Furthermore, the JIT checkpointing requires that the backup should be performed in a failure-atomic way without power interruption, which would otherwise fail to achieve crash consistency.Unfortunately, this forces energy harvesting systems to dedicate a large amount of hard-won energy for the failure-atomic backup all the time-since power failure can occur at any time-leaving only a portion of harvested energy for computation.More importantly, JIT checkpointing might suffer capacitor degradation [13,80], putting the failure-atomicity at stake, and a long voltage detection delay (Section 2.2).
To this end, this paper proposes SweepCache, a novel JIT-checkpoint-free design that achieves lightweight yet performant intermittent computation for cache-enabled energy harvesting systems by using intelligent compiler/architecture interaction.SweepCache's compiler partitions program into a series of recoverable regionswith their live-out registers checkpointed via store instructions-so that the region boundary serves as a recovery point for power failure.Then, to facilitate power failure recovery, SweepCache architecture runs them through region-level persistence, i.e., all stores of a region including the checkpoint stores must be persisted to NVM before the next region starts.
In case a region is power-interrupted, SweepCache holds all data of its stores in a NVM-resident buffer-we call persist bufferbefore persisting them to the main memory (NVM).More precisely, during region execution, all cache writebacks (i.e., dirty cacheline evictions) are first quarantined in the persist buffer; so it acts like a redo buffer for protecting the main memory against the stores of power-interrupted regions.Thus, no matter when power failure occurs, either the buffer contents or their target NVM locations always remain intact, which serves as a basis for correct power failure recovery.Especially for region-level persistence, when program control reaches each region end, SweepCache flushes all dirty cachelines to the persist buffer and then moves them to the NVM.This effectively makes each region begin with a clean cache lacking dirty cachelines2 , which allows any power interruption in the middle of a region to be recovered by simply restarting the interrupted region without worrying about persisting prior regions.
Although the region-level persistence simplifies the recovery protocol, there are a couple of challenges that should be addressed to put SweepCache into practice.First, the persistence delay at each region end incurs significant performance degradation because the next region cannot start until the previous one becomes fully persistent in NVM, which is the case for ReplayCache though it exploits in-region parallelism by ILP.To minimize the delay across regions, SweepCache introduces region-level parallelism overlapping the persistence latency with the execution of the following region.In other words, the next region always speculatively starts without delay as if the prior region were persisted.In addition, an inherent benefit of region-level parallelism is that it hides the latency for persisting the data (e.g., registers and dirty cachelines) that prior schemes [11,25,31,48,49,58,65,85,97,98] must pay for JIT checkpointing when power is about to be cut off.
Second, since the persist buffer exists on the data path, the way to handle load cache misses becomes significantly complicated.They should search the buffer-in case the latest value is held thereinbefore accessing the NVM, which increases the critical path of handling the cache misses.This is particularly problematic in terms of performance in that the persist buffer is allocated in NVM suffering from the same latency and bandwidth issues.To overcome this challenge, SweepCache makes an important observation that the persist buffer is empty most of the time due to the relatively short region size and the low cache miss rate of the benchmarks tested.The implication is that the buffer search can often be bypassed to shorten the critical path of load misses.
Taking that into account, SweepCache devises a single bit called empty-bit to figure out if the persist buffer is currently empty or populated.That is, load misses should first consult the empty-bit to decide whether to bypass the buffer search.According to our experimental results, this simple bit consultation allows SweepCache to direct 99% of load misses to NVM without accessing the persist buffer, thereby realizing the full potential of a volatile cache for highperformance intermittent computation.The experiment with 26 benchmarks from Mibench [24] and Mediabenchs [45] shows that compared to the original cache-free NVP, SweepCache achieves speedups of 14.60x and 14.86x-outperforming the state-of-the-art work (ReplayCache) by 3.47x and 3.49x-for two representative energy harvesting power traces, respectively.

BACKGROUND AND MOTIVATION 2.1 Basics of Energy Harvesting Systems
Energy harvesting systems collect ambient energy, e.g., RF and Wi-Fi, in a small capacitor [6,7,13,14,16,39,52,81,82].However, due to the unstable energy source and the lack of battery, the systems undergo frequent power failure [3,14,16,61,62,89].While they employ NVM as the main memory to survive power failure, it results in data loss of volatile registers.
Thus, prior studies proposed JIT checkpointing that persists the register values right before power failure [17,31,59,60].For example, NVP checkpoints all registers in NVFF closely integrated into the volatile register file [59,60], while QuickRecall [31] writes the registers to NVM.As shown in Figure 1(a), they both leverage a voltage monitor to detect impending power failure.To be more specific, if the voltage becomes lower than the predefined threshold (e.g.,   ), NVP is interrupted and copies the register values to NVFF to save the architectural state.certain level (e.g.,   ), the register values are restored by using the checkpointed values in NVFF, and NVP then resumes exactly from where it had been interrupted.Since NVP uses NVM as the main memory without a volatile cache in between, all committed stores are guaranteed to be persistent in NVM, i.e., a store is the granularity of failure atomicity in the architecture.By combining the JIT checkpointing and the nonvolatile main memory, cache-free energy harvesting systems such as NVPs can guarantee crash consistency even in the presence of frequent power outages.
However, energy harvesting systems often refrain from exploiting a volatile cache mainly due to its crash consistency challenge.Nevertheless, equipping them with the cache has a high potential to improve the performance and energy efficiency [11,48,49,97], e.g., cache-enabled NVP can deliver further forward execution progress by avoiding NVM accesses on cache hits compared to the original cache-free design.Therefore, enabling caches can open new use cases of energy harvesting systems and put them into practice.

Enabling Caches with JIT Checkpointing
Recent work has studied the problem of enabling volatile SRAM caches for energy harvesting systems [4,11,25,48,49,65,83,85,97].However, the problem turns out to be challenging because volatile caches may lead to crash inconsistency-unless they are backed up and restored across a power outage.For example, at the moment of an outage, all cache contents disappear, including dirty cachelines.Thus, merely restoring register values is not sufficient for correct recovery-resulting in inconsistent memory states across the outage.
There have been multiple approaches to dealing with such a memory inconsistency issue for cache-enabled energy harvesting systems.First, a straightforward but naive solution is to leverage a volatile write-through cache shown in Figure 1(b).Here, both write-through cache and NVM always maintain the same value for every committed store instruction, i.e., it is possible to recover the consistent program state without worrying about the loss of volatile cache data.However, the write-through cache pays for a high persistence overhead in that each store instruction cannot be committed until the corresponding cacheline is written to NVM.Such a long store latency is particularly harmful to energy harvesting systems since they do not use out-of-order pipelines that can tolerate the latency.Meanwhile, the frequent NVM writes consume a large amount of harvested energy.
As shown in Figure 1(c), the second approach (NVSRAM) uses a volatile write-back cache with the nonvolatile counterpart as a backup storage [11,25,48,49,58,65,83,85]. NVSRAM, combined with JIT checkpointing, flushes all SRAM cache contents (or only dirty cachelines) to the nonvolatile counterpart before impending power failure, thus being free from the memory inconsistency problem.However, even backing up only dirty cachelines requires NVSRAM to reserve a sufficient amount of energy which can afford the whole-cache backup to guarantee failure-atomic JIT checkpointing in case all cachelines are dirty.Another problem is that for swift backup/restoration, NVSRAM resorts to parallel data transfer; this ends up with non-trivial energy consumption and high inrush current, which may cause significant energy/reliability issues.Moreover, the NVM counterpart also results in extra area costs, e.g., a 32KB NVSRAM cache leads to over a 4.8x larger chip area cost [58].
The state-of-the-art work, ReplayCache [97], enables volatile caches in a more advanced way than prior work.Unlike NVSRAM, ReplayCache does not need a NVM backup for the SRAM cache as shown in Figure 1(d).Instead, ReplayCache leverages so-called store integrity; it preserves the operands of each store so that potentially unpersisted stores left behind power failure can be replayed for recovery.To achieve this, the compiler partitions program into a series of regions where the store integrity is enforced, i.e., none of store registers are overwritten in each region.In the wake of power failure that interrupted a region, ReplayCache first replays its unpersisted stores to keep NVM states up-to-date and then resumes program from the interruption point.In this way, ReplayCache resolves the memory inconsistency problem.
Yet, for each region to fully use the register file without breaking the store integrity of the prior region(s), ReplayCache cannot start a region until all stores of the prior region are persisted.To this end, ReplayCache persists them asynchronously using clwb during region execution with a store fence placed at the end of each region.Apart from the possible persistence delay between regions, Replay-Cache loses persist coalescing [37,77]-in that every single store is followed by the 64-byte cacheline flush (clwb)-causing high write amplification and energy consumption.Moreover, to ensure correct recovery, ReplayCache should load the data from NVFF (or NVM) to execute a recovery block for replaying stores sequentially, which leads to slow recovery.
In particular, a common problem for the aforementioned prior schemes is that they all rely on JIT checkpointing which incurs aforestated hardware complexity.More importantly, the prior work must set voltage thresholds high to ensure the failure-atomic backup, which leaves less energy for computation and therefore degrades performance.It is also important to note that, JIT checkpointing is vulnerable to a capacitor degradation phenomenon as demonstrated by recent work [13,80], e.g., the capacitor may deliver only 90% of its original output in roughly 7 days with typical power traces.This implies that the voltage threshold should be set higher than usual for JIT backup to work safely in case the capacitor degrades over time.Unfortunately, such a high voltage margin incurs a huge performance overhead, e.g., in our evaluation, a 20% threshold increase leads to a 1.4x slowdown while a 40% increase to a 2.5x slowdown.
Finally, the voltage monitor of JIT checkpointing usually must detect two different voltage thresholds for backup and restoration where   >   according to MPPT (maximum power point tracking) [91]; instead, SweepCache only needs a single voltage threshold to indicate an appropriate recovery point, i.e., when to reboot.The takeaway is that the prior schemes rely on a more complex voltage detection circuit than SweepCache's singlethreshold voltage comparator, thereby causing longer voltage detection delays-also known as propagation delays.For example, in prior work [23,58,87,97], the voltage detector has 1.5 us ( ℎ ) and 10.3 us ( ℎ ) propagation delays with at least 20 uA current supply whereas a simple same-year-technology voltage comparator [28] has only 0.88 us ( ℎ ) and 1.1 us ( ℎ ) propagation delays with 12 uA current supply.

SWEEPCACHE APPROACH
SweepCache is JIT-checkpoint-free and lightweight.It pursues performant cache-enabled energy harvesting systems that consume the majority of harvested energy for computation, instead of reserving the energy for JIT checkpointing.However, it is challenging to provide crash consistency for both a register file and a cache without the JIT checkpointing that facilitates the checkpoint/recovery to a large extent.
To address this challenge, SweepCache proposes a compiler and architecture co-design that performs region-level persistence and failure recovery at a low cost.With SweepCache's compiler, the input program is partitioned into a series of regions where live-out registers are checkpointed via store instructions, and the region boundary serves as a recovery point of power failure (Section 3.1).Besides, as shown Figure 1(e), SweepCache presents a persist buffer.It acts as a redo buffer not only to keep the main memory from the stores of a power-interrupted region but also to delegate the persistence of the data being stored in each region at its end, letting the pipeline keep executing the following regions to hide the persistence latency (Section 3.2 & Section 3.3).Across power failure, SweepCache consults the persist buffer, if necessary depending on where the program is interrupted (e.g., in-region or between regions), to resume the interrupted program correctly (Section 3.4).Figure 2 shows the design overview of SweepCache.

Compiler-Assisted Register Checkpointing
To eliminate expensive hardware structures for JIT checkpointing volatile registers, SweepCache leverages compiler techniques to transform program source code so that register values are checkpointed (via stores) to NVM in a region granularity and thus can be used for region-level failure recovery.
Unfortunately, it is hard to design such a checkpoint-based power failure resilient scheme because of two problems: (1) determining which registers should be checkpointed to NVM and (2) where to put register checkpoints.To address the problems, SweepCache leverages the persist buffer directed region formation [16,27,35,99]; it partitions the program to a series of regions (a sequence of instructions regardless of branches) so that the persist buffer never overflows during the execution of each region with its live-out [2] registers checkpointed.As shown in Figure 2(a) where the stores are normal stores but ckpt stores are register checkpointing stores, the number of stores in each region is guaranteed to be smaller than the buffer size (see more in Section 4.1).

Region-Level Store Persistence
The main obstacle to enabling a volatile cache in energy harvesting systems is that the partial updates from the cache to NVM cause an inconsistency across power failure.Unlike JIT-checkpoint designs, SweepCache has no way of interrupting a program on an outage and restoring the cache in the wake of the outage, thus being unable to resume from the interruption point.Without the luxury of JIT checkpointing, SweepCache instead offers persistence and recovery in a region granularity by deploying the persist buffer as a safety net to prevent partial updates for crash consistency.
To be more specific, SweepCache utilizes the persist buffer as an intermediate between the cache and NVM.That is, spatially, the persistence process of SweepCache is divided into two phases (s-phase1 and s-phase2) depending on actions made to the persist buffer; as shown in Figure 2(b), SweepCache writes back the cache to the persist buffer ( 1 ○), and then it flushes the buffer to NVM3 ( 2 ○).But temporally, the persistence process is split into three phases.During the execution of each region, all the writebacks from the cache are piled in the persist buffer (t-phase1), keeping NVM intact to protect NVM from them, i.e., partial updates, in case the region is power-interrupted.At the region boundary where the region finishes, SweepCache flushes all the dirty cachelines into the persist buffer (t-phase2).Finally, the persist buffer contents are all moved to NVM4 (t-phase3).In case power failure occurs in the middle of t-phase3, the persist buffer must be NVM-resident, which would otherwise lose the buffer contents that have not yet been persisted in NVM; such partial NVM updates make it impossible to ensure correct power failure recovery.To deal with the power failure during t-phase3, SweepCache restarts t-phase3 in the wake of the failure with accessing the NVM-resident persist buffer (more details are deferred to Section 4.2).
In this way, SweepCache ensures correct region-level persistence no matter when power failure happens in that either the NVM or the buffer can always remain consistently available.In particular, for fast data movement from the buffer to NVM, SweepCache leverages direct memory access (DMA), an existing hardware component in commodity energy harvesting systems, e.g., MSP430 series microcontrollers have already adopted DMA [30,88].Note that, unlike the traditional 2-phase commit, the 3-phase design of SweepCache effectively offloads the data handled by the first phase of the original 2-phase commit to the t-phase1.This design improves the performance since the in-region writebacks during the t-phase1 can be overlapped with regular program execution, entering the last phase faster than the original 2-phase commit.Yet, to simplify the description, the following sections use two phases ( 1 ○ and 2 ○) in the spatial context to refer to our persistence process.

Region-Level Parallelism
To achieve the region-level persistence, a region cannot start executing until the previous region is persisted, see persistence latency in Figure 3(a).One critical issue here is that the prolonged persistence latency at each region boundary can significantly degrade overall performance.To mitigate this issue, SweepCache introduces region-level parallelism, which allows the next region to speculatively execute as if the prior region were already persisted.This helps to hide the persistence latency but ends up with structural hazards as adjacent regions compete for the persist buffer.Specifically, before the prior region finishes its persistence, the next region's speculative execution can overwrite the buffer, thereby making the region-level persistence fail to achieve crash consistency.Ideally, each region should be assigned a separate buffer, which incurs nontrivial hardware costs.Alternatively, the following region should wait for the prior region to complete its persistence, which hurts performance a lot.Fortunately, it turns out that two persist buffers are sufficient to provide high parallelism (Section 6.3) hiding the persistence latency without compromising the crash consistency guarantee.  .According to experimental results, the efficiency of SweepCache's parallelism is over 91%, i.e.,   is insignificant most of the time.Such effective region-level parallelism is the basis for SweepCache to checkpoint the register values into NVM in a performant way-though it lacks expensive NVFF/NVSRAM.In contrast, JIT-checkpoint designs cannot hide the latency of persisting both the register file and the cache in their backup stage.

Region-Level Failure Recovery
To recover from power failure, SweepCache takes appropriate recovery actions according to where the failure occurs, i.e., before the completion of s-phase1 or after that as shown in Figure 2(c).
Since the persist buffer is a nonvolatile intermediate between the cache and NVM, its persistence status indicates whether a region is persisted or not.If the buffer's persistence process is incomplete at the point of power failure, i.e., it occurs before s-phase1, the current region is not persisted yet while not affecting the NVM state (see Section 4.2 for more details).Thus, in the wake of the power failure, SweepCache discards the buffer contents and rolls back to the beginning of the power-interrupted region for correct recovery.On the other hand, if the buffer's persistence process is already complete upon power failure, i.e., it occurs after s-phase1, the region is considered successfully persisted, and therefore SweepCache restarts from the next region's beginning after power comes back as shown in the figure.Nevertheless, the region formation is not as simple as sequentially performing region partitioning and live-out register checkpointing because of circular dependence.That is, on the one hand, checkpoint stores influence the number of stores that can be accommodated in a region.On the other hand, the number of stores can change the location of the initial region boundary, possibly leading to more live-out registers being checkpointed, in which case the region boundary might move further, thus forming a circular dependence.To break the dependence, SweepCache's compiler leverages the region formation techniques of prior work [16,35,56,99] like the following.
Region Formation: The compiler first partitions program at callsites and loop headers.Specifically, it inserts a region boundary at all the entry and exit points of functions.Then, to avoid exceeding the store threshold in a loop, a region boundary is also placed at the header of every loop 6 , i.e., each loop iteration starts/ends with the boundary, as shown in Figure 4(a); of course, the loop body may need additional boundaries (not shown in the figure) to keep the store count of their regions under the threshold during the CFG traversal.In this way, the number of stores per region is guaranteed not to exceed the threshold even for a long-running loop with many iterations no matter how many stores exist in the loop body.
However, for a small loop body comprised of a few stores, such a boundary at the header could end up with a limited region size.For example, as shown in Figure 4(a), each iteration forms a region with 5 stores, which is way smaller than it should be assuming the store threshold is 10.The problem of such a loop-header boundary is that it might generate many small regions, which could in turn 6 The only exception is the loop that has no store in the loop body.increase the number of checkpoint stores due to additional live-outs across more region boundaries.To tackle this issue, SweepCache's compiler leverages loop unrolling to enlarge the region size, as shown in Figure 4  After the initial region formation, the compiler proceeds to the step of checkpoint store insertion.In particular, to facilitate this step, the compiler first splits the basic blocks that have region boundaries therein, thereby ensuring that regions always start at the beginning of basic blocks.That is because of the granularity mismatch of the two compiler analyses, i.e., liveness analysis is generally conducted at the level of basic blocks whereas checkpoint store insertion is performed at the granularity of regions.After the basic block splitting, the compiler analyzes the regions to identify the live-out variables and inserts their checkpoint stores-right after the last update point of the variables in each region.
Then, the compiler traverses the CFG again in a topological order trying to combine the initial regions, whose store count is smaller than the threshold, into larger regions as much as possible.This leads to two benefits: (1) extending the region size and (2) often eliminating many checkpoint stores in that the live-out register of their region is no longer live-provided the following region being merged redefines the register.Because of the region combining, the store count of the merged region may exceed the store threshold; if that is the case, the compiler places a new boundary in the middle of the region to guarantee its stores never overflow the persist buffer and recalculates the number of live-out registers of the newly partitioned regions.It is important to note that this merging/repartitioning process is repeated until no region has more stores than the threshold, which resolves the issue of the circular dependence.Nevertheless, it would be a mistake to expect that the resulting regions have the same number of stores as the threshold; recall that the threshold indicates the maximum number of stores in each region.
Checkpoint Storage Management: To facilitate access to register checkpoints during power failure recovery, the compiler maps all registers to a global array with dedicated slots.For example, register  0 is mapped to index zero, i.e., a checkpoint store uses fixed destination addresses depending on which register is checkpointed, for them to be easily accessed through an index of the array.This is feasible since the number of architectural registers is predetermined by ISA.In the wake of power failure, SweepCache's recovery runtime reloads the values of the checkpointed live-out registers from NVM using the mapping in order to ensure crash consistency.
Forward Progress and I/O Functions: To guarantee forward progress without stagnation [14,15,82], SweepCache leverages the EH model [82] for estimating the worst-case energy of a region execution and its recovery.That is, among the regions, the compiler checks if some are too long to be executed with the underlying capacitor energy and splits such a long region so that it can be finished across power failure.Finally, supporting non-recoverable operations such as I/O operations has still remained an open problem.However, since SweepCache places region boundaries at all callsites, the function that implements I/O operations is treated as a separate region.Then, SweepCache can leverage the techniques in prior work [15] to guarantee I/O operations always start with a fully charged capacitor.Thus, I/O operations can always successfully complete without power failure.

Recovery Protocol
To ensure the correct recovery, SweepCache leverages the persistence status of persist buffer to determine appropriate protocols.To manage the persistence status, SweepCache introduces two additional bits, namely phase1Complete and phase2Complete, for each buffer.These bits indicate whether the corresponding phase is complete, with their initial values set to 0. When a phase is completed, the corresponding bit is set to 1.These bits are stored in a single persistent register that exists in a memory controller and gets read/written by a similar controller logic to that of prior work [101].At runtime, depending on the power failure points, SweepCache has different persistence statuses corresponding to a different status of the phaseComplete bits: (0, 0), (1, 0), and (1, 1).
The first case is (0, 0), meaning a power outage occurs before the s-phase1.In such a case, buffer persistence is not complete and the NVM is not affected at all.After power comes back, SweepCache ignores the contents of the buffer and restores the saved registers including the PC register, and jumps to the PC.Note that the PC here points to the beginning of the current region.This PC was preserved at the end of the preceding region.The PC saved at the current region that points to the start of the subsequent region has not yet been written to the NVM.
For the second case, i.e., (1, 0), indicating that s-phase1 is complete but s-phase2 is not.In such a case, since the first phase is complete, all the updated data are already in the buffer.Because the buffer is non-volatile, all the data including the saved register values still remain in the buffer during the power outage.Thus, SweepCache does not need to roll back to the beginning of the current region.Instead, SweepCache re-executes the second phase.After that, it restores the saved register values and jumps to the PC.Here, the PC points to the beginning of the next region.Let us explain here why SweepCache needs to flush dirty cachelines at each region end.Upon power failure, the cache loses all the data including the dirty cachelines.However, during regions, only evicted dirty cachelines are written back.Thus, jumping to the next region to do the recovery without considering those non-evicted dirty lines (which may include register checkpoint stores) cannot realize correct recovery.Therefore, SweepCache needs to flush all the dirty cachelines at each region boundary.Note that the flushed data still remain in the cache with their dirty bits reset to 0.
For the third case, i.e., (1, 1), both the two phases are complete.In such a case, the recovery is simple.SweepCache just restores the saved register values in NVM and jumps to the PC that points to the beginning of the region interrupted by the power outage.

Write-After-Write
To ensure region-level persistence, SweepCache must be careful about write-after-write (WAW) cases.Since SweepCache utilizes region-level parallelism to hide the persistence latency, a dirty cacheline of the prior region may be overwritten by the current region's stores before being written back to the persist buffer.To solve this problem, SweepCache leverages the phase1Complete bit, as mentioned in Section 4.2, in conjunction with the cache dirty bits to indicate whether a specific cacheline resides in the s-phase1 of the preceding region.To be more specific, if the prior region's phase1Complete is 0 and the dirty bit is 1, meaning that the cacheline is awaiting flush, the store of the current region that tries to write to the same address of the dirty cacheline needs to wait until the phase1Complete bit becomes 1.Such a method can sometimes cause unnecessary waiting.For example, the previous region's phase1Complete bit is 0 but the dirty cacheline is caused by the current region, we still wait even if this WAW does not cause any persistence issue.However, such a case is very rare in our evaluations so it is acceptable for such rare unnecessary waiting.

Cache Misses Handling
In cases where a cache miss occurs, SweepCache first checks the persist buffer before accessing NVM, as the most recent data may still be present in the buffer.To search for the requested data in the buffers, CAM (content addressable memory) would be the fastest technology, but too expensive for energy harvesting systems.Sequential search is energy-efficient, but too slow since two buffers may need to be searched if the previous region has not completed its s-phase2.Therefore, SweepCache requires a cost-effective and high-performance search method for handling cache misses.
Since SweepCache only needs to consult the buffers at the cache miss cases, the cache miss rate determines the frequency of consulting the buffers.However, our evaluation shows that the average cache miss rate is only 3.43% when the cache size is 4kB, meaning that it is affordable for SweepCache to leverage a sequential search logic rather than the expensive CAM-based associative search.
Though the sequential search is affordable because of the low cache miss rate, it is still slow because it may incur double NVM accesses for cache miss cases.Whenever a cache miss happens, SweepCache searches the persist buffers first.If the data is not found, SweepCache accesses the NVM.
However, in our evaluations, we found that the persist buffers only contain a few entries and are empty most of the time (we do not consider the flushed entries at the region boundary since they still remain in the cache and do not cause any cache misses).This is not difficult to understand.For a certain cacheline, it has to satisfy two conditions to be written back to persist buffer: (1) it is dirty; (2) it is evicted.For the first condition, the number of dirty cachelines in each region is limited since SweepCache always flushes dirty cachelines at each region end, leaving a clean cache for the next region.For the second condition, based on the low cache miss rate, the eviction rate is also low.
Given these, SweepCache leverages a simple but effective method -deploying a single bit (referred to as empty-bit) to indicate whether the buffer is empty.Thus, for a cache miss, SweepCache only does the sequential buffer search when the bit is 0, i.e., the buffer is not empty 7 .Otherwise, it bypasses the buffer.Thanks to the low fill rate of the buffer, empty-bit bypasses 99% of buffer searches in our evaluation, realizing high search performance with pretty low hardware costs (only two bits for two persist buffers).

The Size of the Persist Buffer
The size of the buffer determines the number of entries it can hold, which in turn sets the store threshold used by the compiler.This threshold then affects the region size.Longer regions are more likely to be interrupted during their s-phase1, which can cause slow forward progress due to our roll-back recovery property.Moreover, more stores may generate more evictions, increasing the frequency of searching buffers in cache miss cases.However, longer regions tend to hide more persistence delay, leading to higher region-level parallelism.Therefore, determining the size of the persist buffer is a trade-off to achieve optimal performance.
We experimentally found that setting the size (threshold) to 64 has relatively small average store numbers and high parallelism.The buffer entry consists of address and data where data has the same granularity as the cacheline (64B).

Write-Back-Instructive Table
SweepCache is required to flush all the dirty cachelines at the end of each region (Section 3.4) in order to guarantee the correct recovery and region persistence.Thus, SweepCache needs to scan all the cachelines at each region boundary to determine which cachelines are dirty.However, such a scan process not only lengthens s-phase1 but also impacts the accuracy of dirty cacheline identification (before scanning all the cachelines, the next region's stores may cause new dirty cachelines; flushing the next region's dirty cacheline may cause incorrect power failure recovery).
To eliminate such overhead and guarantee correct recovery, we leverage a small SRAM bit-table (referred to as write-backinstructive table) to indicate which cachelines are dirty at each region boundary.The table is updated during the region execution, allowing the identification of all the dirty cachelines at each region boundary by reading the table instead of scanning the entire cache.As with the persist buffer design, to prevent structure hazards between regions, SweepCache employs two tables.The table size is equal to the number of cachelines (one bit indicates one cacheline), e.g., for a 4kB cache with a 64B block, a 64-bit table is enough.

DISCUSSION
In general, frequent power failure is a norm of energy harvesting systems, in which case SweepCache performs the best regardless of energy sources (RF, solar, thermal) as will be shown in Figure 10.However, for a system backed with a bulky supercapacitor that can sustain for a while, SweepCache might waste harvested energy for region-level persistence (Section 3.2)-saving the data being stored in regions to NVM in case they are power-interrupted.That is because the regions rarely encounter power failure owing to abundant energy piled in the supercapacitor, though it causes several issues, i.e., slow reboot time, bulky area cost, and energy inefficiency due to the leakage proportional to the size of a capacitor.
While SweepCache mainly targets tiny energy harvesting systems (e.g., wearables) backed with energy-efficient small capacitors with a few hundred nF8 , it is possible to mitigate the problem of wasting energy for supercapacitor-equipped systems.To achieve this, we aim to enlarge our region size so that the region-level persistence is conducted less frequently thereby reducing the overhead.There are a couple of ways to do that: (1) small function inlining [70] and (2) aggressive loop unrolling including its speculative optimization [35].We leave applying them as future work.Multi-core: To the best of our knowledge, there is no commodity multi-core energy harvesting systems.In the literature, a single in-order core is predominantly used due to the power efficiency issue, e.g., RF energy harvesting cannot afford to power even dualcore systems.For this reason, we do not delve into the topic of multi-core systems on purpose.

EVALUATION
We implement compiler techniques described in Section 4.1 on top of the LLVM 13.0.1 [42].All evaluated programs are compiled with the default O3 flag except for our compiler optimizations.To measure the impact of runtime libraries as well, we instruct the linker to link evaluated programs against the MUSL C library which is also compiled by SweepCache's compiler with our compilation optimizations enabled.
We conduct our experiments atop the gem5 [5] with ARM ISA to simulate a single-core in-order processor as the original NVP simulator [23].As in prior work [57,97], SweepCache only modifies the L1DCache as the volatile cache while maintaining the L1ICache as an NVM cache.In all cache-enabled designs, the default cache size is configured as 4kB with a 2-way association.The capacitor size is set to 470nF, consistent with prior real fabricated chips [60,87,93].The propagation delay of JIT-checkpoint designs is configured in line with prior work [23,58,87,97].In particular, since SweepCache does not have a backup stage thus only has the restore propagation delay.To precisely set the delay, we deliberately selected the technology [28] built in the same year as the JIT-checkpoint designs' voltage detector.By default, SweepCache's persist buffer size is set to 64.Other configurations [16,23,97] can be seen in Table 1.We evaluated applications with two real power traces (RFHome and RFOffice) which were collected from real RF energy harvesting systems [23].
The rest of this section compares two variants of SweepCache (with NVM Search or Empty-Bit Search, see Section 4.4) against ReplayCache and NVSRAM (only backups dirty cachelines), in terms of their speedups over the cache-free baseline NVP.

Performance without Power Outage
To analyze the performance of SweepCache, we first evaluate it without a power outage.Figure 5 shows such outage-free performance results.On average, NVM search and Empty-Bit search achieve 8.80x and 8.91x speedups, respectively, while ReplayCache exhibits a speedup of 5.10x.We find that the ReplayCache's speedup over the NVP is lower than what the original paper reports.That is because we compile all the libraries for both SweepCache and ReplayCache in addition to the application code, which would otherwise lead to similar speedups.NVSRAM performs the best with a speedup of 11.53x.There is a two-fold reason for the performance gap between the NVSRAM and SweepCache: (1) NVSRAM has fewer instructions since it does not generate checkpoint stores; and (2) the persistence latency of SweepCache may not be fully hidden by its region-level parallelism as shown in Section 6.3.For most of the applications, SweepCache performs better than ReplayCache.On average, NVM Search and Empty-Bit Search achieve speedups of 1.73x and 1.75x over the ReplayCache, respectively.The performance gain is mainly because of: (1) the instructions generated by SweepCache are fewer than the ReplayCache (Section 6.5).Since ReplayCache's compiler has to generate storefence instructions and clwb instructions for every store to guarantee persistence while SweepCache only generates checkpoint stores for live-out registers; (2) SweepCache has a high parallelism efficiency, as demonstrated in Section 6.3, which can overlap most of the persistence latency; (3) low latency paid for searching persist 1 Owing to our simpler logic as described in Section 2.2, SweepCache actually can afford a lower   as prior work [16] uses 1.8v   ; according to our evaluation, SweepCache can obtain extra 10% ∼ 15% performance gain with 1.8v   .While using the same   (i.e., 2.8) of JIT-checkpoint designs serves as the lower-bound performance of SweepCache, it still outperforms them significantly. 2 Propagation delay: backup/restore voltage detection delay buffers since the cache miss rate is low and the average number of filled entries in the persist buffer is very small (0.00012 per region).
We notice that there are two exceptions, i.e., rijndaeldec and rijndaelenc, where SweepCache is not better than the ReplayCache.These two programs are small while SweepCache generates around 2x more regions than the ReplayCache.For each region, SweepCache needs to complete the two-phase persistence.Most of the time, the persistence latency can be hidden by SweepCache's region-level parallelism, but the non-overlapped part still plays a non-trivial role in such small programs' execution time.
Compared with NVM Search, Empty-Bit Search obtains a speedup of 1.18%.As mentioned before, the cache miss rate is very low leading to only an average of 0.00012 per-region access to the persist buffers.Thus, even though the Empty-Bit Search can bypass over 99% buffer search, based on the rare accesses to the persist buffers, the performance gain is limited.

Performance with Power Outages
Figure 6 and Figure 7 show the performance results in power outage cases with RFHome and RFOffice power traces.For RFHome trace, NVM Search and Empty-Bit Search achieve average speedups of 14.60x and 14.86x while ReplayCache and NVSRAM exhibit speedups of 4.26x and 7.37x.Compared with ReplayCache, NVM Search and Empty-Bit Search attain average speedups of 3.43x and 3.49x, respectively.In contrast to NVM search, Empty-Bit obtains a speedup of 1.75% on average.For RFOffice trace, NVM search and Empty-Bit achieve average speedups of 14.31x and 14.60x while ReplayCache and NVSRAM do speedups of 4.20x and 7.32x.Compared with ReplayCache, NVM Search and Empty-Bit Search deliver average speedups of 3.41x and 3.47x, respectively, while Empty-Bit Search yields a speedup of 1.96% over the NVM Search.
For the power outage cases, it is impressive that SweepCache achieves much better performance than JIT-checkpoint designs, i.e., ReplayCache and NVSRAM, which highlights the huge benefits of our JIT-checkpoint-free property.Compared with JIT-checkpoint designs, SweepCache maintains higher energy efficiency as shown in Section 6.6.That is because SweepCache does not need to pay any hard-won energy for the JIT-backup or other necessary logic such as the NVFF, backup/restore controller, etc.Therefore, SweepCache allows a larger portion of harvested energy to be used for computation.In this way, SweepCache experiences less power failure   2) which saves a significant amount of charging time to reboot the system.Moreover, SweepCache's region-level parallelism greatly reduces the persistence delay while JIT-checkpoint designs cannot.Besides, the low-cost design of SweepCache also makes it possible to have much less propagation delay leading to faster restoration.In addition to the above reasons, the three others mentioned in Section 6.1 still contribute to SweepCache's significant speedup over ReplayCache for these RFOffice and RFHome traces.
In particular, the speedup of Empty-Bit Search over NVM Search under the power traces is slightly higher than that under the poweroutage-free case.That is because the higher cache miss rates caused by frequent outages lead to more accesses for the persist buffer, which highlights the role of empty-bit.Taking into account the superior performance of Empty-Bit Search, we choose the Empty-Bit as the default design of SweepCache for the following evaluation.

Region-Level Parallelism Efficiency
SweepCache exploits region-level parallelism to hide the persistence latency.This is one of the reasons for its outstanding performance.Thus, we evaluate the parallelism efficiency of SweepCache in the power-outage-free/power-outage cases.We use the following formula to calculate region-level parallelism efficiency.
is the Persistence Latency (without parallelism) and   is the actual waiting time.Higher efficiency means more persistence latency can be hidden.Overall, we can achieve an average parallelism efficiency of 91.70% for power-outage-free scenarios and 91.95% for power-outage cases.

Sensitivity Study
Cache size: Figure 8 shows the speedups of SweepCache and its competing schemes across different cache sizes from 512B to 16KB with the RFOffice power trace.The performance is basically in proportion to the cache size.Compared with the NVP, the greater the cache size is, the higher speedup SweepCache can achieve.Capacitor size: We also explore the impact of the capacitor size on the performance.Figure 9 shows two kinds of speedups with different baselines for RFOffice power trace: (1) bars represent speedups over NVP varying its capacitor size from 100nF to 1mF as with other schemes; (2) the line shows another speedup over NVP whose capacitor size is fixed to 100nF and how they vary as the capacitor of other schemes gets bigger.
In addition, we show the number of average power outages across different capacitor sizes in Table 2 without considering the initial power-off state.Overall, increasing the capacitor size leads to better performance.However, performance gains become limited once the capacitor size reaches 10uF, as all designs experience fewer power outages with a capacitor size of 10uF or greater.Both   5.That is because NVSRAM must pay a longer propagation delay than ours when transitioning from the initial power-off state to the power-on state.
Our evaluation yields three key insights: Firstly, for a given capacitor size, SweepCache always experiences fewer power outages than JIT-checkpoint designs due to its superior energy efficiency.Secondly, given a certain capacitor size with power failure, SweepCache consistently outperforms JIT-checkpoint designs.Finally, a larger capacitor does not necessarily translate to better performance.For example, a 1uF SweepCache can deliver comparable performance to a 10uF NVSRAM, while the latter incurs 1.43x area costs with only marginal performance improvements.
Power Traces: Figure 10 shows the performance with different power traces.In general, ReplayCache and NVSRAM exhibit higher speedups over NVP when operating with more stable power traces such as solar and thermal, as opposed to RF traces.Conversely, SweepCache tends to demonstrate higher speedups when dealing with RF traces.That is because JIT-checkpoint designs generate fewer checkpoints when exposed to more stable power traces, whereas SweepCache conducts checkpoints at each region no matter the power failure frequency.But overall, SweepCache still delivers the best performance among all the different traces owing to its higher energy efficiency.12: Analysis on region size and store count per region in the paper (set JIT-checkpoint designs'  ℎ to 3.0us and set their  ℎ to 0.5us) [87].Figure 11 shows the results.We find that propagation delay plays a non-trivial role in the execution time since it is related to the speed of backup and restoration.Whether extending the delay in SweepCache (i.e., setting 1) or reducing the delay in JIT-checkpoint designs (i.e., setting 2), both settings result in an earlier occurrence of the performance turning point-when NVSRAM outperforms SweepCache-compared to our default settings.This is because these two settings either slow down our restoration or expedite the backup and restoration of JIT-checkpoint designs.But it is hard to shrink JIT-checkpoint designs' delay to so short since it needs huge power consumption [87] and we believe SweepCache's propagation delay can also be shrunk much shorter with the same power consumption.
Store threshold: Recall that the threshold is the maximum number of stores in a region, which does not mean that each region should have as many stores as the threshold.We evaluated the average store counts of regions at run time with different thresholds, such as 32, 64, 128, and 256.The resulting counts are relatively small showing insignificant differences across the thresholds.This phenomenon results from the initial region boundaries inserted at the entry and exit points of each function call and at each loop header in that they cannot be optimized to extend the region size (Section 4.1).For example, region combining cannot merge such callsite boundaries while loop unrolling is not feasible for those whose iteration counts are not known at compile time.Nonetheless, SweepCache can address the problems by leveraging the techniques mentioned in Section 5, i.e., aggressive function inlining [70] and speculative loop unrolling [35].Figure 12 shows CDF results of (1) region size and (2) store count per region for all benchmarks tested with the default threshold, i.e., 64; the average store number is 3.92 (while the average region size is 19.47) which to some extent accounts for why the persist buffers are empty most of the time.

Instruction Counts
Overall, compared with SweepCache, ReplayCache generates 1.64x as many instructions, which mainly comes from extra clwb instructions and store fence instructions.This ratio is much greater than only compiling the programs, which is only 1.03x.Compared with NVSRAM, SweepCache generates 15.04% more instructions.

Energy Consumption
To figure out the energy efficiency of SweepCache, we evaluate the total energy consumption under the default setting by using the power model provided by NVPSim [23] with RFOffice trace.Compared with NVP, the normalized total energy consumption for ReplayCache, NVSRAM, and SweepCache are 20.86%,12.37%, and 10.21%, respectively.Figure 13 also shows their backup/restore energy consumption breakdowns-normalized to NVP's-which are 23.74%,15.42%, and 0.28%, respectively.SweepCache turns out to be the most energy-effective.This section compares SweepCache with NvMR, the state-of-theart work that performs memory renaming to eliminate write-afterread (WAR) dependences [4], i.e., the reason for idempotence violations [26].Once they are detected, NvMR attempts to rename their memory location to be written.We implemented NvMR with its parameters kept the same as SweepCache's memory hierarchy shown in Table 1. Figure 14 shows the results of SweepCache and NvMR (a bar corresponds to their speedup over NVP while a curve to SweepCache's energy reduction compared to NvMR) when the RFOffice trace is used.SweepCache is significantly faster than NvMR for all capacitor settings but 1mF that is a lot bigger than our target capacitor size, i.e., a few hundred nF (Section 5).Overall, across 7 different capacitor sizes, SweepCache achieves an average of 1.71x speedup (up to 6.04x when the capacitor size is 470nF) over NvMR mainly due to its superior energy efficiency-resulting from the JIT-checkpoint-free nature and the lightweight hardware design.On average, SweepCache saves 19.94% of the energy consumed by NvMR (up to 82.3% with the 470nF capacitor).

Cache Miss Rate and Write Amplification
Compared to NVSRAM, SweepCache may encounter more cold misses since it does not save any cachelines before power failure.To evaluate the cache miss rate, we consider NVSRAM, NVSRAM-E (backs up the entire cache), SweepCache, and ReplayCache.The results are shown in Figure 15.Overall, the cache miss rate for all designs (excluding NVSRAM-E) decreases as power traces become more stable.We notice that ReplayCache has a higher miss rate than SweepCache, even though both designs do not save cachelines before a power outage.This is because ReplayCache experiences more power failure with the same capacitor size.Compared to NVSRAM, SweepCache incurs only a 7.50% increase in average cache miss rates as it experiences fewer power outages due to its higher energy efficiency.Furthermore, SweepCache suffers from write amplification due to the presence of the persist buffers, resulting in twice NVM writes for every writeback.The situation worsens for ReplayCache, which generates an NVM write (clwb) for every store.However, NVSRAM also results in additional NVM writes for backing up the registers to NVFF and dirty cachelines to their NVM counterpart.Therefore, we count the number of NVM writes for the four designs mentioned above, as shown in Figure 16.The NVM writes of SweepCache mainly come from the writebacks at each region end, while the NVM writes of ReplayCache mainly come from its clwb for every store.Unlike NVSRAM and NVSRAM-E, whose NVM writes primarily stem from the backup preceding the power failure and are largely impacted by the power outage frequency, resulting in fewer NVM writes in more stable power traces, i.e., solar and thermal, the NVM writes of SweepCache and ReplayCache mainly come from regular persistence operations which does not show a significant difference among various power traces.On average, SweepCache incurs 4.62x as many NVM writes compared to NVSRAM.However, NVM writes only consume about 0.01% and 0.23% of the total energy for NVSRAM and SweepCache, respectively.Therefore, thanks to SweepCache's superior energy efficiency and its substantial regionlevel parallelism, which effectively conceals the majority of NVM write latency, SweepCache still delivers better performance.The hardware costs of SweepCache are not significant owing to its intelligent compiler-architecture co-design.Apart from the two persist buffers, SweepCache only needs a total of 134 bits, i.e., two empty-bits (one of each persist buffer), four phaseComplete bits, and two 64-bit SRAM tables, for a 4kB cache.This is rather minimal compared to the hardware costs of prior JIT-checkpoint designs.

RELATED WORK
There is a large body of prior research on energy harvesting systems.Apart from the NVP architecture, QuickRecall [31] is another choice to deal with frequent power failure in a crash-consistent way.Although QuickRecall obviates the need for nonvolatile flipflops by saving registers to NVM before impending power failure, it still relies on JIT checkpointing to ensure the failure atomicity of the register saving.Since QuickRecall should save the registers by executing a series of store instructions without special hardware support, the resulting performance overhead is quite significant compared to those of NVP approaches.Unlike QuickRecall, some prior work attempts to equip energy harvesting systems with a data cache.NVCache explores the use of NVM as the material of persistent cache implementation [1,36,67,68,76,92,95].However, NVCache causes both longer latency and more energy consumption than a traditional SRAM cache.Thus, other researchers delve into the integration of SRAM on top of NVM, i.e., leveraging the NVM as JIT-checkpoint storage to save SRAM cache contents and consult it across power failure [12,25,43,48,49,65,66,69,83,86,94,96].To enhance the performance of the NVM backup and the restoration, the researchers propose various NVM technologies.For example, STT-RAM provides faster access time and higher energy efficiency than alternative NVM technologies at the cost of more sensitive process, voltage, and temperature (PVT) variations [20], causing more errors with a higher possibility.Furthermore, the speeds of the NVM backup and the restoration are lacking and remain a daunting challenge, as no NVM technology currently affords to match the performance of SRAM [16,41,97].
One might ask if such a cache-enabled energy harvesting system can benefit from existing crash consistency mechanisms for high-performance computing systems backed with deep cache hierarchy and persistent memory.For instance, prior software-based recovery schemes based on undo/redo logging or idempotent processing [9, 32, 33, 35, 40, 50, 51, 53-56, 90, 99, 100] may seem like a potential avenue.However, they tend to incur significant performance degradation, e.g., iDO [51] for failure-atomic sections ( FASEs) and Mnemosyne [90] for transactions result in up to 2-3x slowdown because of so-called persist barriers that serialize the out-of-order pipeline execution significantly.
Taking that into consideration, some other recovery schemes come up with hardware-based logging [19,22,34,38,73,84] to lower the performance overhead.However, they still suffer significant pipeline stalls waiting at the end of each atomic region e.g., a transaction or a FASE, to ensure the persistence of the stores in the region.Furthermore, neither the software nor the hardware logging is devised for whole system persistence [41,72,98] that is essential for energy harvesting systems.That is, these prior recovery schemes offer crash consistency only to a region of the code in transactions or failure-atomic sections, leaving other code outside the region inconsistent across power failure.Consequently, unlike SweepCache, the prior schemes are not appropriate for an energy harvesting system, i.e., they cannot enable it to take advantage of a data cache in a performant and low-cost manner.

CONCLUSION
This paper presents SweepCache, a novel compiler and architecture co-design approach that enables energy harvesting systems to exploit a volatile cache in a performant and lightweight way.To ensure correct power failure recovery, the compiler generates recoverable regions while the architecture runs them in a failureatomic way.Thanks to SweepCache's region-level persistence that cleans up the cache across the region boundary, energy harvesting systems do not have to rely on expensive just-in-time (JIT) checkpointing, and thus they can fully utilize harvested energy for computation.As a result, SweepCache achieves 3.47x and 3.49x speedups over the state-of-art work for two representative energy harvesting traces, respectively.

Figure 1 :
Figure 1: Architecture of energy harvesting systems; green corresponds to non-volatile parts while yellow to volatile parts.

Figure 2 :
Figure 2: The high-level view of SweepCache compiler and architecture

Figure 3 ( 1 3 Write
b) shows how SweepCache handles 3 consecutive regions with the region-level parallelism.Since the 2 persist buffers are assigned to the first 2 regions respectively, Region 2 can start immediately after Region 1 ends due to the absence of the structural hazard.As shown in the figure, SweepCache effectively hides the persistence latency of Region 1 by overlapping it with the execution of Region 2. Nonetheless, since SweepCache only has two persist buffers, Region 3 should not start to execute until Region Timeline Region 1 Write back to buffer Write back to NVM Region 2 Write back to buffer Write back to NVM Persistence Latency s-phase1 s-phase2 (a) No Parallelism Case Timeline Region Write back to buffer Write back to NVM Region 2 Write back to buffer Write back to NVM Region

Figure 3 :
Figure3: Hiding region-level store persistence latency 1 completes its second phase (s-phase2) to avoid the structural hazard, i.e.,   in the figure indicates the actual time Region 3 should wait for5 .According to experimental results, the efficiency of SweepCache's parallelism is over 91%, i.e.,   is insignificant most of the time.Such effective region-level parallelism is the basis for SweepCache to checkpoint the register values into NVM in a performant way-though it lacks expensive NVFF/NVSRAM.In contrast, JIT-checkpoint designs cannot hide the latency of persisting both the register file and the cache in their backup stage.
(b)  where the loop body is unrolled two times making the region size 2x bigger.

Figure 4 :
Figure 4: Region formation for loops

Figure 8 :
Figure 8: Speedups over NVP across different cache sizes

Figure 9 :
Figure 9: Speedups over NVP across different capacitors with and without fixing the size of NVP's capacitor

Figure 10 :Figure 11 :
Figure 10: Speedups over NVP across different power traces region (b) Number of stores per region

Figure
Figure 12: Analysis on region size and store count per region

Figure 13 :
Figure 13: Breakdown of backup and restore energy consumptions normalized to those of NVP

Figure 14 :
Figure 14: Performance gain and energy saving over NvMR.

Figure 15 :
Figure 15: Cache miss rate for different traces.

Figure 16 :
Figure 16: Analysis on the number of NVM writes normalized to those of NVSRAM when RFOffice trace is used When the voltage comes back to a SweepCache leverages the size of the persist buffer to guide region partitioning with the store threshold equal to the buffer size-conservatively assuming that every store leads to a cacheline writeback.During the partitioning process, SweepCache's compiler counts the number of stores while travers- [2] the program's control flow graph (CFG).Once this count reaches the predefined threshold (i.e., buffer size), a region boundary is introduced to start a new region thereafter.The compiler then analyzes the live-out[2]registers of the region and inserts checkpoint stores to save them into a designated register checkpoint storage in NVM.Additionally, the program counter (PC) is saved at the end of the region, which serves as a recovery point in case the next region is power-interrupted.By reading the value of the PC, SweepCache can roll back to the corresponding recovery point to re-execute the interrupted region.

Table 2 :
The Number of Average Power Outages If this is not possible due to structure hazards in the architectural components of NvMR, it triggers a backup to persist the registers, dirty cachelines, and other volatile states necessary for the memory renaming.Also, NvMR takes advantage of JIT checkpointing techniques that in their original form, do not start a power-interrupted program until the restoration voltage (  , see Section 2.2) is reached, which could otherwise encounter WAR dependences.The beauty of NvMR is that it enables the program to keep running even after the JIT backup, without waiting for the capacitor to be charged enough to offer  because the memory renaming can resolve the WAR dependences.In particular, if power failure happens while the backup voltage is not secured, NvMR must roll back to the latest JIT backup point.