How to Kill the Second Bird with One ECC: The Pursuit of Row Hammer Resilient DRAM

Error-correcting code (ECC) has been widely used in DRAM-based memory systems to address the exacerbating random errors following the fabrication process scaling. However, ECCs including the strong form of Chipkill have not been so effective against Row Hammer (RH), which incurs bursts of errors discretely corrupting the whole row beyond the ECC correction capability.We propose Cube, a novel chip-wise physical to DRAM address randomization scheme that leverages the abundant detection capability of on-die-ECC (OECC) and correction capability of Chipkill against RH. Cube allows for synergistic cooperation between ECC, probabilistic RH-protection schemes, and the system with minimal to no modification for each. First, Cube scrambles the rows of each chip using a boot-time key in a way that distributes RH victims to multiple Chipkill codewords. Second, Cube utilizes the newly observed distinct RH error characteristics from real DRAM chips, to swiftly diagnose the RH victim rows using the error profile from OECC scrubbing, and even correct it leveraging Chipkill. When combined, Cube decreases the failure probability of PARA and a state-of-the-art RH protection scheme SRS by up to 10−25. At a target failure probability of 10−10 per year on a DDR5 rank under the RH threshold of 2K, Cube reduces the performance and table size overhead of SRS by up to 24.3% and 39.9%, respectively.CCS CONCEPTS• Security and privacy → Security in hardware.

RH is a type of DRAM fault where activating a certain row (aggressor) repeatedly (over RH threshold or   ) causes bitflips in its neighboring rows (victim).Due to its nature of causing bursty errors across the whole row, ECC has not been so successful against RH.For example, even a strong Chipkill is only limited to increasing   by 2-3× or decreasing the BER (bit error rate) [49].This is ironic, considering that Chipkill rarely uses its correction capability [71] (e.g., 10 −5 [47,69] chip failure rate) while enduring higher bandwidth and capacity overheads, just for higher reliability.A body of work [22,23,41,63] ingeniously modifies the existing ECC to enhance their detection capability using MAC (Message Authentication Code).By doing so, they thwart the attacker from exploiting the bitflips for sophisticated usages, such as privilege escalation [94,105] or leaking protected data [55,85].However, they do not further improve the correction capability and often require higher correction latency.Meanwhile, orthogonal to ECC, various architectural RH protection solutions [10,42,46,50,52,56,64,74,81,91,92,95,96,98,107,108,113] have been proposed and even are adopted in the industry [32,51].However, RH has been a moving target with decreasing   , countervailing the existing RH protections, or incurring impractical overheads for safety.For example, probabilistic RH protection schemes such as PARA [52] or SRS [108] provide an RH-protection with a configurable probability of an RH attack success (e.g., 1% per year).Although they are efficient at high   , their performance overhead can exacerbate up to 43% when the target   is as low as 2K.
To this end, we propose Cube, a novel scheme with chip-wise physical to DRAM address-mapping randomization, RH victim diagnosis using OECC, and correction leveraging Chipkill.Cube corunning with a probabilistic RH protection scheme can dramatically decrease RH success probability, or decrease the overhead for a fixed target security level (failure rate).Cube enables synergistic cooperation between the ECC, any probabilistic RH protection scheme, and the system with negligible modification and overhead.
First, by randomizing the DRAM row addresses of each chip, all RH victims are distributed to multiple Chipkill codewords instead of one.Cube has a chip-wise modular multiplication unit that ensures no two physical address rows are adjacent in two or more chips.Also, to protect the scramble function from reverse-engineering, the global Fiestel cipher [65,75] is adopted optionally.The address randomization of Cube degenerates a single RH success (S RH ) into a single-chip error per RH victim row, which is correctable by SCC Chipkill.Furthermore, due to its random nature, the attacker is forced to succeed in multiple S RH instead of two contiguous S RH to blindly cause a double-chip error (D RH ) in an RH victim row.This enforcement is highly effective against the critical targeted attack that requires memory templating [111] and the denial-of-service (DoS) blinded attacks against an arbitrary row.Nevertheless, as there exists a possibility of two S RH colliding to cause D RH , Cube is co-run with other probabilistic RH protection schemes.When combined, each S RH occurs with a low possibility, granting up to ×10 25 improvement in the final security level.
Second, because the RH victim row is exposed until two S RH collide to cause D RH , Cube has a chance to detect the RH victims and correct those.In particular, we leverage the newly observed characteristics of the RH-induced error that multiple OECC codewords are bound to have an error once an RH-induced Chipkill uncorrectable error (UE) has occurred.Therefore, Cube analyzes the error profile gathered from commonly employed OECC scrubbing [17,29,38,93] to diagnose RH victims and correct them before it degenerates into D RH .Thus, Cube can only be bypassed if the attacker successfully incurs D RH before its initial S RH is scrubbed.This additionally boosts the security level of Cube.
§3 first demonstrates the RH-induced error characteristics and distribution, based on the raw RH error data collected from a SoftMC-based [30] FPGA environment with a temperature controller.We also show that the conventional ECCs can only achieve a limited effect against RH.§4 introduces Cube and defines the methodology of chip-wise physical to DRAM address randomization, as well as RH diagnosis using OECC error profile.We mathematically deduce how much security improvement Cube provides in §5, which leads to ×10 25 smaller failure probability over the default PARA, SRS, and SHADOW [107].We introduce a new filtering optimization for PARA that minimizes the performance overhead.In §6, Cube improves the table size and performance overhead of SRS by up to 39.9% and 24.3% at a   of 2K with a target failure probability of 10 −10 per year.
The key contributions of this paper are as follows: • We newly observe the distribution characteristic of RH errors and identify that the conventional ECC is not highly effective against RH.• We propose Cube, a novel chip-wise physical to DRAM address randomization method with an RH diagnosis based on OECC error profile.• Cube enables synergistic cooperation with probabilistic RH protection schemes and significantly improves their security level.
DoS, SGX-Bomb [34], Nethammer [58] Reliability issue MOESI-prime [60] issued at every tREFI.It ensures that a row is refreshed once at every tREFW.Continuously activating (ACT) a single row (aggressor) may cause bitflips on its neighboring row (victim), which is called Row Hammer (RH) [52].When the activation count reaches a certain value (  ), a large number of errors are incurred to cause an uncorrectable error (UE) even with the ECC.
Once an MC acquires a physical address (PA), it is first divided into a tuple of channel, rank, bank group, bank, row, and column based on a processor-specific way [1,2].The PA goes through another layer of translation at the DRAM side decoder, as some rows can be shuffled for optimization or spare row techniques [45,99].We can spatially number each row from top to bottom and call it a DRAM address (DA).Due to these layers of translation, contiguous PA may not be contiguous in DA.Still, the conventional PA-to-DA mapping is static except for a few special cases, such as sPPR [36,37,89], thus allowing one to reverse engineer [78].Table 1 summarizes the terminologies.

Row Hammer Induced Vulnerabilities
We categorize the RH-induced vulnerabilities into three types; i) templated targeted attacks (   ), ii) denial-of-service blinded attacks (  ), and iii) reliability issues that lead to silent data corruption (see Table 2).First, the majority of the critical RH attacks are    [20,25,28,35,53,55,86,[103][104][105].Based on the memory templating and massaging techniques, an attacker locates or allocates the desired data block and hammers its adjacent aggressors.A key requirement of    is that the attacker possesses the DRAM mapping information either via reverse engineering [78,109] or referring documentations [1,2].
Second,   aims to hammer and flip arbitrary DRAM locations, incurring system crash or lockdown [34,58].However, this form of attack has limited capability compared to the    .
The last category is the reliability issue, where benign programs cause RH-induced bitflips.Loughlin et al. [60] demonstrated that ordinary workloads can cause RH in a datacenter.Such a case causes silent data corruption even in the absence of a malicious attacker.Although it is unlikely that they follow the worst-case adversarial pattern, we treat them as the   hereafter.

Prior Row Hammer Mitigation Schemes
There are two types of RH protection guarantees; probabilistic and deterministic.The probabilistic schemes [52,91,107,108] tolerate an RH-induced UE, yet at an extremely low, configurable probability (e.g., 1% probability per year for an aggressor to reach   ).In contrast, the deterministic schemes [57,74,81,113] guarantee that no aggressor can reach   , often embracing a relatively higher overhead.Nevertheless, both types of solutions commonly incur higher area, energy, and/or performance overhead when the target   is smaller, which has been leading to a scalability issue under the recent trend of lowering   [48].

DRAM Error Correcting Codes
Layers of error correcting codes (ECC) are implemented to address the increasing frequency of DRAM failures from the continuous DRAM technology scaling [93]. data symbols (e.g., bits) are augmented with (−) parity symbols to constitute a codeword with  symbols, referred to as (, ).Each ECC scheme has a limited correction capability.When the number of errors exceeds it, they are considered uncorrectable errors (UEs).
On-die ECC: On-die ECC (OECC) is arranged inside each DRAM chip, which typically employs a single-bit error correction (SEC) (136, 128) code [16,37,77].The ECC engine that encodes and decodes the codeword exists at every bank or bank group [16,37].Figure 1 [47,69].When chip failure from hard fault actually happens, the higher level reliability, availability, and serviceability (RAS) feature [14,17,38,41,54,106] is initiated, and the device/row is soon replaced before additional errors occur or unmapped by software.Henceforth, we assume a single-chip-correction (SCC) Chipkill as the detection capability of Chipkill is not exploited and the Chipkill implementation details are orthogonal to our work.We will later discuss the practicality of adopting our work to different types of ECC in §7.0% 25% 50% 75% 100% 50

PROTECTION AGAINST RH ATTACK WITH CONVENTIONAL ECC
This section demonstrates how currently employed ECCs interact with RH-induced errors, and that they only achieve limited success in improving the RH protection guarantee.

Methodology
We use a SoftMC-based [30] FPGA environment with a temperature controller to conduct the RH experiments in a controlled environment (see Figure 2(a)).We modified SoftMC to port it to Alveo U280 [110], a Xilinx FPGA board, to test DDR4 RDIMMs.We use DDR4 DRAM chips from two DRAM vendors manufactured between 2016 and 2021, in total of 14 DIMMs of 208 chips.To acquire a raw error profile that is not masked by the ECC [76,77], we choose the ones that do not have OECC and also ignore the data from ECC chips.We turn off the REF command to bypass the known in-DRAM Target Row Refresh [26] mechanisms, but finish each iteration of RH experiments inside a tREFW window to avoid retention errors.
At every iteration, we first write the valid data to 32 rows adjacent to the specific aggressor row.Then, we execute a designated number of hammering on the aggressor with consecutive ACT-PREs.We repeat this experiment by changing the data stored in the aggressor and victim rows using all possible intra-column data patterns (8 patterns from {0, 0, 0} to {1, 1, 1} [17,39,48,52,55,112] where {, , } presents the data of an aggressor row (), a victim row (), and a row nearby victim (), as shown in Figure 2(b)) and conservatively report the union of the detected errors.We did not consider the cross-column patterns (e.g., checkerboard) into account.Unlike the retention errors [59], it has been reported that change in the horizontally neighboring cells has minimal effects the RH vulnerability [17,39].Without such a column-oblivious characteristic, attacks like RAMBleed [55] would have been highly constrained.We tested 1 K rows for each DRAM bank, for a total of 16 K rows per DIMM at a temperature of 50 °C.Figures 3, 4, and 5 represent two most recent DIMMs from each vendor, but similar trends were observed in other DIMMs too.

RH-induced Error Distribution
Using the profile of raw RH-induced errors, we first observe that each cell has a different degree of RH vulnerability and the cells susceptible to RH are randomly distributed across the row.For the rest of the analysis, we consider that there are 64 OECC codewords in a single row, each 128-bit.For each row, we can first define HC first and HC OECC as the adjacent hammer count that causes the first single correctable error (CE) and first double error (UE), respectively, in any of the OECC codeword (codeword OECC ).As each cell in a row has a varying degree of RH vulnerability, each cell fails at a different hammer count.Figure 3 shows that as the hammer count increases, more rows start to fail for both single and double errors.Also, because RH errors are randomly distributed, it takes more hammer count to cause a double error in a single codeword OECC as multiple randomly distributed single errors have to occur to get concentrated by chance.There is almost two times difference between the HC first and HC OECC in Figure 3.These observations concur with the prior experimental studies [48].Due to such characteristics, it is very likely that the RH victim row is sprayed with numerous single error corrupted codeword OECC at the point of UE for OECC or Chipkill.Figure 4(a) shows a probability density function of how many codeword OECC has at least a single error at each row's HC OECC .While some rows have only a single codeword OECC with error (UE), some have up to 90 codeword OECC with a single error.When we extend the analysis for Chipkill ECC, such a tendency intensifies.We define HC Chipkill as the per-row adjacent hammer count that causes the first Chipkill UE. Figure 4(b) depicts that the majority of rows have dozens of codeword OECC with error at each row's HC Chipkill .
Observation-1: When a row has RH-induced UE for either OECC or Chipkill, multiple OECC codewords are likely to possess error.

Conventional ECC Against RH
We discover that because RH causes a burst of errors across a whole row, even a stronger form of ECC such as Chipkill or Double-Chipkill (double-chip-correction [40]) has limited protection against RH. Figure 5 shows the number of rows with UE at each hammer count.For both DRAM vendors, even Double-Chipkill has limited success of increasing   , the point of RH failure, by an average ×2.6.Considering the recent trend of intensified RH vulnerability, simply increasing the   value to some extent are not effective.For example, even if OECC has been adopted since LPDDR4,   has still decreased by ×35 (139K [52] to 4K [48]).Moreover, Cojocar et al. [17] demonstrated that when an attacker has full knowledge of the employed ECC scheme, it can cause tailored errors that bypass the detection capability of ECC schemes.Other optimizations such as static DRAM address remapping [49] are only limited to decreasing the BER.
Observation-2: Even a stronger ECC achieves the limited success of increasing   by less than ×3.

Threat Model
We assume the following threat model for the rest of the paper: (1) The attacker has the full knowledge of what PA rows are inside a bank.(2) The attacker can incur a stream of activations at a maximum frequency.(3) When an aggressor row reaches   activation count, its victims experience enough bitflips to cause UE even with the OECC and a conventional 18-chip SCC Chipkill.(4) Blinded attack (  ) succeeds when two or more symbols of an arbitrary victim experience the   number of adjacent hammering.
(5) Targeted attack (   ) succeeds when that happens to a specific victim row.(6) The large granularity non-RH hard errors that trigger Chipkill correction are not assumed to exist when RH error occurs.Rare as they are, hard errors are soon removed by the RAS feature and maintenance [2,79,100].

CUBE
Cube consists of two techniques: (1) DRAM PA-to-DA mapping scramble and (2) RH diagnosis using the OECC error profile.Specifically, Cube scrambles the PA-to-DA mapping of each DRAM chip uniquely such that every RH victim row of each chip belongs to a different Chipkill codeword for a single aggressor row.This scrambling makes even a successful single RH attack (S RH ) correctable, as Reference PA used for scrubbing.
described in Figure 6.Cube also proactively diagnoses a row as an RH victim using the OECC error profile collected from now JEDECstandardized [37] OECC scrubbing and corrects it using Chipkill.
While the OECC scrubbing is only effective against soft single-cell faults and not RH burst errors, the collected error profile and the RH error characteristic (Observation-1) allow for RH victim diagnosis with a small cost.Without such an approach, S RH victim can only be corrected when it is accessed by chance.MC-side Chipkill scrubbing can provide a similar advantage, yet OECC scrubbing can be much more frequent, exploiting the rank and bank group parallelism.However, when two S RH collide by chance, two different chips can be corrupted on a single Chipkill codeword (D RH ), and Cube itself can only provide limited probabilistic safety.While such a collision has to occur before S RH victim is diagnosed and corrected, the frequency of scrubbing is limited.As such, we choose to use Cube together with any existing probabilistic scheme (e.g., SRS [108] and PARA [52]).Because each S RH now occurs with a low probability, Cube leads to a maximum ×10 25 reduction in the combined probability of RH-induced UE ( §5).Note that because deterministic schemes require each S RH to be always prevented, Cube does not have a synergistic effect with such schemes.

Scramble Function Requirement
The PA-to-DA mapping scramble function, or   of chip  (out of 18 chips constituting Chipkill) must fulfill the following three requirements.We define that   () = , and its inverse  −1  () = .R-i) No collision of Set victim : For every PA (e.g., PA agg ), we can define its   , or the set of other PAs (e.g., PA vic ) physically adjacent to it.For a PA agg , its   can be expressed as (  (PA agg ) ± 1) with  ranging from 1 to 18 (see Figure 6(b)).All PA vic in the same   must not collide with each other.If not, hammering a single PA agg can result in two chip failure of the collided PA vic .R-iii) Efficient hardware: The mapping function logic exists in each chip and is on the critical path for the DRAM row commands (e.g., ACT).Thus, some forms of permutations based on AES [19] or Fisher-Yates shuffle [24] are too slow and/or large, respectively.Modular multiplication of different constant   for each chip  on PA to get DA fulfills all the above requirements.When all   and (  −   ) do not collide with each other for all chips, R-i)   collision is always avoided.When the non-adjacent RH or blast-radius [53] of two is taken into account, now we must consider double key (2 ×   ).For a blast-radius of three, triple key, and so on.When all the   s are odd numbers, R-ii) no aliasing occurs owing to Euler's totient theorem.Because we only need the constant multiplication instead of generalized multiplication, and modular   is a simple shift, R-iii) the cost of   is low ( §4.4).Call this a vanilla scramble function of Cube, expressed as follows.

Diagnosing RH using OECC Error Profile
Cube diagnoses a row as an RH victim when it contains multiple OECC codewords with error, exploiting Observation-1.OECC error profile can be easily collected from the now JEDEC standardized OECC scrubbing [37].Within a scrubbing window, every OECC codeword is read, corrected, and written back.While the OECC correction itself cannot handle the multiple errors from RH in standard DRAM, Cube can exploit the information on how many OECC codewords have errors to diagnose the RH victim.When diagnosed, Cube can use the Chipkill to correct the RH victim unless it is a D RH victim; the correction is performed proactively within the window, instead of waiting until the victim addresses are accessed by applications.Because the attacker has to succeed in D RH within a scrubbing window, Cube further improves the security level by up to ×10 6 ( §5).
To minimize false-positive detection of RH victims (unnecessary Chipkill correction) and avoid false-negative (missed RH victim), Cube adopts diagnosis threshold (Diag TH ) and also merges per-chip error profile to get a multi-chip full image of each victim row.There is a possibility of non-RH errors fooling Cube, or contrarily an actual RH victim with a small number of errors escaping Cube detection.
For example, one OECC codeword per row may consistently correct an error at an extreme case of single cell faults [16].On the other hand, when we analyze the OECC error profile at a single-chip granularity, only one OECC codeword may contain an error in an RH victim row (Figure 7(a)).Instead, Cube gathers the OECC error profile from all chips constituting a Chipkill (avoiding false-negative), and only considers a row as RH victim when the number of OECC codewords exceeds a certain Diag TH (reducing false-positive).Setting Diag TH as 6 can guarantee no false-negative1 , which is enough to differentiate non-RH scaling faults (see Figure 7(b)).

Confidentiality of Scramble Function
The security advantage of Cube relies on the confidentiality of the scramble function.In other words, the collision of two S RH to cause D RH must happen by chance.However, if the attacker knows which two aggressor PAs are adjacent to a victim PA in two different chips, a tailored attack that causes D RH after two consecutive S RH is possible.In fact, there exists a side-channel that leaks   of some aggressor PA.Nevertheless, we demonstrate that such sidechannel is not practical against probabilistic schemes with dynamic remapping, or even static schemes when Cube is augmented with a lightweight encryption feature.i) Static scheme: When the probabilistic scheme co-running with Cube does not dynamically alter the PA-to-DA mapping [32,52], the following side-channel can be critical.As Figure 8 illustrates, the attacker can succeed in an S RH with PA agg1 and read the whole memory space accessible to the attacker.If the victim rows (PA vic ) happen to be accessible, the attacker can seek high access latency which can indicate the Chipkill correction latency [17].Then, the attacker can know that PA vic at some chip (e.g., chip1) can be hammered by activating PA agg1 .We refer to this procedure as a query on PA agg1 .With repeated queries on different aggressors, the attacker at some point, by chance, can hammer the same PA vic with a different aggressor PA agg2 at a different chip (e.g., chip2).Then, the attacker can execute a tailored attack targeting only PA agg1 and PA agg2 .However, this scenario does not empower the attacker any further, as the probability of finding such two aggressors is the same as the probability of D RH collision that penetrates Cube.The mathematical derivation of such probability is provided in §5.
Still, with the vanilla scramble function of Cube, a single query exposes the   , granting the attacker the full PA-to-DA mapping.For each query on PA agg , subtracting the two victim PAs from the same chip gives out the inverse of   as follows.

𝑖
We can address this critical vulnerability by simply adding a lightweight encryption scheme on the vanilla Cube scramble function as follows. (PA) and  (PA) denotes encryption and decryption, respectively.Adding such encryption does not affect the requirements of Cube, as long as encryption also does not cause aliasing.This is because, unlike the chip-wise constants   , the encryption/decryption is done globally (e.g., at the MC).Implementation details are described in the following §4.4.ii) Dynamic remapping schemes: With the probabilistic schemes that rely on the randomized row address remapping for RH protection [91,107,108], the PA-to-DA mapping dynamically changes.Such dynamic randomization that changes PA-to-DA mapping over time can be denoted as  (, ).This can also be considered as a part of the scramble function without hurting the requirements of Cube.Similar to  (PA), this prevents the attacker from reverse-engineering the PA-to-DA mapping.

Cube Implementation
Scramble function: Cube scramble function requires small constant multiplication logic at each chip, and also a lightweight encryption logic at the MC if not the vanilla version (Figure 9).As long as the encryption  (PA) or dynamic remapping  (PA, ) stays hidden from the attacker, the constants,    , can be fixed at design time. 2 Therefore,   can be fixed at the design time, only requiring a constant modular multiplication unit instead of a generalized one.Also, the modular by   is a simple shift, because   is always powers of two (e.g., 2 16 , 2 17 ).
The MC-side global encryption scheme needs to be alias-free (R-ii)) and efficient in hardware (R-iii)).Feistel cipher [11,61] can easily fit our needs.In the field of cryptography, a pseudo-random permutation in a small domain (PRP) or format-preserving encryption (FPE) [11,101] exists where the encryption method defines an  to  permutation without aliasing.In particular, Feistel [11,61] (or Benes [5], Thorp [68])-based ciphers are widely used due to its efficiency in hardware.
We follow the low-latency block cipher (LLBC) implementation [82] of the S-box and Feistel cipher.A single round of Feistel cipher takes 17-bit as input, splitting it into two segments ( and ).The  segment goes through the S-box and P-box (SPN).The S-box takes an input of 9-bit  and 9-bit boot-time generated roundunique key,    , and outputs 8-bit.Referring to [82], we configured our S-box to incur four consecutive XOR gate latency.We used a four-round LLBC as our baseline.[13,80] have proposed effective shortcut attacks that can neutralize the multi-level LLBC or even the key switching mechanism of CEASER [82] or CEASER-S [83].However, such attacks are not effective in our situation.We provide a detailed discussion of the security of our encryption scheme in Appendix A.
While the latency of the mapping function lies in the critical path of the DRAM commands requiring row address (e.g., ACT), the total latency of Cube scramble funcion is less than 0.2ns based on our synthesis result using TSMC 40nm.A single-round LLBC results in a 70 ps latency, while a four-round LLBC incurs a 100 ps latency.Such a small gap between the two is mainly attributed to the fact that multi-round LLBC degenerates to a linear function [13], which cancels out many of its XOR operations when optimized.The four-round still performs better in the avalanche test of the prior study [82] because it has a longer key length.The small latency overhead of the scramble function concurs with [82].We consider this to be hidden in the timing margin of the JEDEC commands, in a similar manner to the OECC encoding/decoding, sPPR translation, or prior works with floating point multiplication [31].Even when we increased tRCD by one cycle (i.e., 0.4 ns in DDR5-4800), the additional overhead in all tested workloads for 2K   was less than 0.5% in weighted speedup.Similar to the JEDEC standardized OECC scrubbing, Cube assumes that a small portion of all-bank refresh commands (REFab) are repurposed for scrubbing.We assume OECC engine exists per bank group.Within a single REFab command of 295ns, each bank group can read and write 16 OECC codewords out of 64 codewords per 8192-bit sized row, which takes 208 ns ( + 15 × (_ + _) +  +).Then, we can check 128 OECC codewords at a time.For a 32Gb DRAM of 2 28 OECC codewords, we set 10 minutes as our default scrubbing window, incurring 1.37% reduced tREFI (i.e., 112 additional REFab per tREFW).Thus, 112 ÷ (64 ÷ 16) = 28 rows are scrubbed at every tREFW.The scrubbing result of scrubbed address (17-bit) and the number of OECC codewords with error per row (3-bit) is stored in a ((17 + 3) × 28) = 70 sized buffer at each chip.The buffered results of all chips are collected by the MC at every tREFW, triggering Chipkill correction when the aggregate number of errors exceeds Diag TH .

COOPERATION WITH EXISTING PROBABILISTIC PROTECTION SCHEMES
In this section, we mathematically derive how Cube can cooperate with other probabilistic RH protection schemes.Especially, we demonstrate how Cube improves the security guarantee of a representative PARA [52] and a state-of-the-art SRS [108].The RH attack success probability of   and    is based on our threat model (see §3.4).

RH Attack Success Probability
Cube enforces an attacker to hammer multiple rows to succeed in an RH attack.When only a single PA row is hammered, the PA of each victim in each chip will be all different under Cube, which is correctable.We newly define a variable   as the number of PA aggressors that the attacker hammers throughout its attack.We sweep the   value from 2 to   (the total number of rows in a bank) and report the worst failure probability as a result.We define   of a generalized probabilistic RH-protection scheme as the probability that a single aggressor reaches   in a tREFW window under the scheme.The   for PARA and SRS will be provided in §5.2 and §5.3.We also define   _ as the number of tREFW that an aggressor row is hammered within a year.Then, we can calculate the probability of a single aggressor reaching   in a year as follows ( 1 ): Consider a randomly chosen row, PA 0xA of the first chip.The probability of PA 0xA row's adjacent row being hammered to   (i.e.,    ) can be calculated as the following, conservatively considering that there are   aggressors and two victims per aggressor ( 2 ): We can reflect the impact of the blast radius by adopting   parameter, which is defined as blast radius =1 1 2  −1 [113]. 2 can be modified as follows: We define  ℎ as the number of chips that constitutes the Chipkill (e.g., 18).As the adjacent rows of each chip are unique in Cube, we can calculate the probability of PA 0xA's aggressor activated to   at multiple chips as follows ( 3 ): The acquired  3 is the attack success probability for an    .For   , we conservatively assume the number of victim rows across all chips as 2×  ×   .Assuming that each RH success probability of a row is independent, which is also conservative, the RH success probability of all victim rows ( 4 ) is as follows: While the mathematical derivation so far has been regarding only the Chipkill scramble of Cube, RH diagnosis using OECC error profile further improves the final failure probability.For a default OECC scrubbing window of 10 minutes, the probability of a single aggressor reaching   in a single scrubbing window ( ′ 1 ) is as follows: 1.E-25 1.E-20 *  _  : the number of scrubbing windows in a year

Cube-equipped SRS (Cube SRS )
Probability Calculation: SRS [91] executes a row-swap operation within a single bank whenever an aggressor is activated by    , where  is a configurable parameter.Referring [108], the   for SRS is as follows.
* : maximum number of row-swap operations per tREFW window   is applied to all rows due to the random nature of rowswap, implying that   is   , and 1)∼ (7).Security Improvements: Cube-equipped SRS (Cube SRS ) shows multiple orders of magnitude improvements in security compared to the original SRS, regardless of the   and    .Figure 11(a) shows the RH attack success probability over time of the original SRS, Cube SRS without RH diagnosis (C SRS ), and a full Cube SRS .In a different view, Cube SRS requires a relaxed  compared to SRS when a desired attack success probability is fixed, as shown in Figure 11(b).Smaller  leads to a smaller performance and area overhead due to less frequent swapping and fewer entries for the tracker and row indirection table.

Cube-equipped PARA (Cube PARA )
Probability Calculation: PARA [52] randomly samples an activation command with a probability of  and sends a victim row refresh.Referring [74], we can get the following recurrence equation of [], which denotes the probability of the aggressor reaching   activations without its victims being refreshed, in a total of  activations.
Solving this recurrence equation to the end of a single tREFW gives us   of PARA.In Equations ( 1)∼( 7), the   _ is 1   ×  and we report the worst case after sweeping   .Security Improvement: Cube-equipped PARA (Cube PARA ) also demonstrates superior security compared to the original PARA. Figure 12(a) and (b) show the similar trend of a lower attack success probability and smaller , respectively, of the Cube PARA without RH diagnosis (C PARA ) and Cube PARA .Smaller  leads to a smaller performance overhead, because the victim row refresh is sent less frequently.Projection-Based Filtering: Inspired by the prior RH protection studies that utilize a projection-based filtering [81,113], we also propose a Count-sketch [7] based filtering technique that can be applied to PARA, orthogonal to Cube.We expect this optimization to filter the majority of the benign workloads and reduce the performance overhead of PARA.We allocate 512 entry-sized [81] Count-sketch table (SketchTBL) per bank, each entry responsible for (  ⁄ 512) number of rows (e.g., 128).At every activation, the hashed corresponding counter is incremented until it reaches a predefined ).When it does, Cube PARA is turned on to only the 128 rows belonging to the entry.We reset the SketchTBL at every tREFW window.

EVALUATION
We evaluate the overhead of the PARA [52], SRS [108], SHADOW [107], Cube PARA , Cube SRS , and Cube SHA (Cube-equipped SHADOW) for an RH attack success probability object of 10 −10 per DDR5 rank under a year of continuous RH attack. 3Although Cube is especially strong against    , we conservatively evaluate each scheme with regard to   .

Experimental Setup
Performance overhead is evaluated using the McSimA+ [4] cycleaccurate simulator.Table 4 summarizes the system configuration used for the experiments.From SPEC CPU2017, we extract 100M representative traces using SimPoint [97].Each trace is executed in a rate mode with 16 threads.Using these traces, we also render the multi-programmed traces of mix-high and mix-blend, each with 16 memory-intensive and evenly selected traces based on MPKI.We additionally render 32 mix-random traces, each with randomly selected 16 traces.We execute 800M instructions in total for each experiment.
Based on the analysis in §5 and similar calculations on SHADOW, we use the configurations of PARA, SRS, and SHADOW, variations of Cube PARA , Cube SRS , and Cube SHA as shown in Table 5, 6.We assume PARA to use a nearby row refresh (NRR) command [74], where an MC sends the address of the aggressor and DRAMs calculate their victims.We do not evaluate Scale-SRS [108] that pins aggressors to LLC, because it is ideally a deterministic scheme.PARA, SRS, and SHADOW are assumed to be equipped with an OECC and conventional SCCDCD Chipkill, while Cube PARA , Cube SRS , and Cube SHA with an OECC and Cube.As for the SketchTBL-filtering for PARA, we set 512 entries per bank [81] and use an empirically chosen best ℎ   for each configuration.We sweep   from 1K to 16K for the majority of experiments for PARA, SRS, and SHADOW.The baseline SRS was not scalable in 1K, where it incurs up to 57% performance degradation in just 2K.While some prior studies assume HC first as low as a few hundred, HC Chipkill is likely to be over 1K, which is   according to our threat model.

Overhead Analysis
Cube SRS : As Figure 13(a) suggests, Cube SRS reduces the performance overhead of SRS by up to 18.6 percentage points for rate mode when   is 2K.For the multi-programmed workloads, Cube SRS has 24.3 percentage points smaller weighted speedup overhead compared to SRS at   of 2K.
The table size of the Misra-Gries [67] tracker and row indirection table (RIT) of SRS is also reduced by up to 39.9% in Cube SRS at   of 2K.This is because the number of entries for both tracker and RIT is defined by the parameter . Figure 14 summarizes the reduction in the number of table entries of Cube SRS variations compared to the baseline SRS.Cube SHA : Figure 13(b) shows that Cube SHA improves the performance overhead of the original SHADOW by up to 3.9 percentage points at the   of 1K.This is equivalent to ×0.41 reduction in the overhead.Cube PARA : As Figure 13(c) suggests, Cube PARA improves the performance overhead of the PARA by 1 percent at the   of 1K.Moreover, when the CST-filtering is applied, the weighted speedup overhead of Cube PARA is less than 1%.Cube PARA without filtering is not as effective as Cube SRS because the security of PARA is too sensitive to its sampling rate, .Although Cube PARA provides dozens of order higher security level at a fixed  and   configurations, it does not directly translate to the same amount of lower sampling frequency when the target security is fixed.

Sensitivity Study
Scrubbing Window: There is a trade-off between security and performance with regard to the scrubbing window parameter of Cube.A shorter scrubbing window limits the attacker's window further and thus improves the security, but requires additional REFab commands for scrubbing operations.Figure 15 illustrates the effect of the trade-off, which we chose 10 minutes as the default for the main experiments.SketchTBL: We empirically choose an optimal ℎ   for each Cube PARA configuration of different   .There exists a performance trade-off with regard to ℎ   .When it is low, the CST entry is easily filled and triggers the sampling, but the sampling rate is not    There exists performance optimal ℎ   point on each   .
too much affected from the new effective  of (  −2×ℎ   ).
When it is too high, the SketchTBL entry will be rarely filled, but the triggered sampling requires a higher sampling rate.Figure 16 illustrates the trade-off of the different ℎ   for differing   .The ℎ   values that are chosen for the main experiment are summarized in Table 6.Blast radius: While SRS is not impacted by the blast radius [91], PARA has to be adjusted to incur additional victim row refreshes, resulting in performance degradation.However, because the change in the   ( §5) has minimal impact on the degree of Cube security, the performance improvement ratio of Cube PARA over PARA is consistent (see Figure 17).

Effect of Non-RH DRAM Faults
Although the possibility of large granularity errors coexisting in the RH victim row is extremely low ( §3.4), some form of single-cell faults (SCF) can affect the security of Cube.While soft errors can be corrected by OECC scrubbing, some form of birthtime scaling faults that were corrected by OECC or post-shipment hard/intermittent faults [33,87] may coexist with the RH error.The rate of SCF has increased over time, and may even be reaching 10 −4 in the pessimistic case [16,69,84].When SCF exists in an RH victim row with a low probability, Cube Chipkill UE can occur even with less hammering.We can fully reflect this in the security analysis by stochastically reducing the effective HC Chipkill .Figure 18 shows the security level of Cube with varying rates of SCF.

S RH and D RH Probability
Table 7 summarizes the probability of failure ( §5) and lifetime metric [91], or the expected time of failure, of Cube PARA and Cube SRS .
Although the lifetime of S RH is several orders of magnitude smaller than that of D RH , it is still only once per few years.

DISCUSSION 7.1 Cube for Other ECC and DRAM Devices
Cube can be employed for different DRAM ECC schemes as long as the correction granularity matches the scramble granularity.This trait allows Cube to be adopted for wider DRAM devices, a more recent DDR5 device, and different Chipkill ECCs, although its practicality varies.1) Wide DRAM device: A row scramble method can theoretically be applied at the cell-level instead of the chip-level for a SEC (or SECDED) on-die ECC.However, it would be impractical because it requires massive modification of the DRAM internals with a limited effect.Therefore, although possible, it would be unwise to apply the scramble technique of Cube for wide DRAM devices such as LPDDR, GDDR, and HBM which do not employ rank-level ECC (i.e., side-band ECC).Although it is possible to exploit data compression and use extra bits to construct a rank-level ECC for such wide devices [73], it is beyond the scope of this work.
2) DDR5: Cube can also be implemented on DDR5, which does not yet have a widely accepted Chipkill-level ECC proposition.DDR5 differs from DDR4 in that it adopts narrower sub-channels with longer bursts, requiring a smaller number of chips per subchannel [37].Constructing Chipkill on DDR5 has to consider multiple tradeoffs upon the correction granularity (e.g., device, halfdevice, and pins), the capacity overhead of parity bits, the number of chips/ranks/sub-channels involved per memory access, and effective bandwidth.For example, we can improve the power consumption and effective bandwidth by decreasing the number of chips that constitute the Chipkill, at the cost of higher parity bit capacity overhead [3].Contrarily, we can reduce the reliability guarantee by decreasing the correction granularity to half-device at a fixed parity bit capacity overhead [18].
Cube can be natively employed for DDR5 with device-granularity SCC Chipkill.For a half-device-granularity Chipkill, it would require two parallel scramble functions per chip and control paths reaching DRAM subarrays or MATs.A MAT is a 2D array of cells with rows and columns (e.g., 512×512), which has its own subwordline driver and bitline sense amplifier.An array of MATs that share a main wordline constitutes a subarray.The sub-wordline driver can be adapted to enable distinct row activation for each MAT [88].
3) Sub-chip Granularity ECC: There exist multiple variations of ECC applied to DRAM.For example, a prominent example of Bamboo ECC [47] adopts a vertically aligned layout of the codeword.In particular, it provides a better protection guarantee in a pin granularity within a fixed redundancy, which better meets the error characteristic of DRAM.Cube can also be applied to such ECCs, as long as the correction granularity matches the scramble granularity.In fact, if data from one MAT only passes through one I/O pin, the implementation cost can be rather small with a pin number of parallel scramble functions and control paths reaching the DRAM subarray or MAT.

Tailored Data Pattern Attacks
Some prior studies [17,39] proposed a tailored data pattern RH attack that only triggers bitflips on target cells instead of causing bursty bitflips across the unwanted cells.For example, pinpoint attack [39] exploits an interleaving data pattern that is oblivious to the victim cell data to suppress the unwanted bitflips.If such an attack is effective against Cube, it can undermine the RH diagnosis using the OECC error profile as bursty bitflips become less likely.However, such tailored attacks are cell-dependent and require a scanning phase in advance.During the scanning phase, the attacker identifies the location of the RH vulnerable cells and their effective/non-effective data pattern for a specific victim DRAM row, exploiting a huge page and RH attack.Because Cube prevents the RH attack itself and also hides the PA to DA mapping, such a scanning phase is prevented, disabling the tailored attack from its beginning.

Passing Gate Effect
Recent studies [32,62,70] have demonstrated that a DRAM row disturbance different from RH, which is called passing gate effect.When an adjacent aggressor row is opened (activated) for a long time, its victim row can experience a bitflip even with a much smaller number of aggressor activations compared to RH. Existing RH protection mechanisms can be augmented with row-open-time-aware counters or adjusted target   [32,62].For example, PARA [52] can be adapted with a timeout register that limits the maximum row open time, with an adjusted target   considering the worstcase row open times.However, such a passing gate effect does not affect Cube, as the errors themselves from the passing gate effect are similar to those from RH; i) localized around the aggressor row and ii) occur in a bursty way [70].As long as these two traits are preserved, Cube's i) DRAM PA-to-DA scramble and ii) RH (passing gate effect) diagnosis using the OECC error profile remain effective.

RELATED WORK
DRAM ECC Mechanism: Multiple variations of Chipkill [2,6,27,47,63,69,79] can be orthogonally applied together with Cube, as it only scrambles the adjacency relation between the rows.A set of works modify the currently used ECC to boost their detection capability [22,23,41], often using the message authentication code (MAC).While this allows fast detection of RH-induced bitflips with a low probability of failure, degenerating the    to DoS   , it doest not further boost the correction capability.RH Protection Schemes: While multiple architectural and software-based RH protection schemes have been proposed [8,9,14,38,42,46,50,52,54,56,74,81,91,92,95,96,98,106,113,114], the majority of them can be orthogonally applied with Cube, although not synergistically for deterministic schemes.As for the victim row refresh approaches, Cube can cooperate assuming the interface specifies the PA of the aggressor and the DRAM device locates its victims [74].Throttling [113] and quarantine [92] are another two different prominent mechanisms, which can be orthogonally applied with Cube because they do not require the   information.Cache Randomization: There exist multiple cache side-channel prevention studies [82,83,90] encrypting the address that indexes the cache, preventing or slowing down the attacker from acquiring a cache eviction set.However, the main difference in Cube is that the expected time for each query by the attacker and the final success time is prohibitively long.While both can be broken by the brute force attack tackling the possibility of collision, the former can be broken in dozens of seconds [82] unless some dynamic remapping scheme is adopted.Under Cube, it takes years just for a single query, not even the final collision and attack success.

CONCLUSION
We have proposed Cube, a novel chip-wise DRAM address randomization scheme that leverages the abundant correction capability of Chipkill and detection capability of OECC to mitigate RH.Cube randomizes the DRAM row address using a boot-time key, Feistel cipher for static randomization, and modular multiplication unit, distributing the RH victims to multiple Chipkill codewords.Also, Cube quickly diagnoses the RH victims using the OECC scrubbing data and the newly observed characteristic of the RH errors.When combined, Cube improves the security of any probabilistic scheme to a large extent, by up to 10 25 at   of 4K on PARA for example.Cube also decreases the performance and table size overhead of the state-of-the-art SRS by 24.3% and 39.9% when   is 2K and the target RH success probability per year is 10 −10 on a DDR5 rank. of aggressor PAs (e.g., PA agg1 and PA agg2 ) that hammers the same victim PA (e.g., PA vic ) on different chips (e.g., chip 1 and 2).While an RH attack is prevented by Cube in a server environment, the FPGA environment can enable an easier success of an RH attack.Such a set of aggressor PAs suffices the following example for chip 1 and 2. We assume PA vic1 as an upper victim for a more concise notation.
1 }, ] for  1 on chip 1 =  [ −1 2 • { Then, 2 ○ the attacker can try to reuse the discovered set of aggressor PAs on a server environment with a new LLBC key (e.g.,  ′ ).If such reuse is possible, it will only require two consecutive S RH for D RH instead of multiple S RH for D RH , degrading the security of Cube.
However, the prior method for the shortcut attack is not directly applicable on a Cube scramble function due to the constant multiplication and constant addition.The constant multiplication is a part of the scramble function.The constant addition of 1 (or −1) is a fundamental restriction on the attacker, who can only acquire the victim PA information based on the Chipkill correction timing sidechannel.Such two operations prevent the scramble function from becoming a mere chain of XORs, preventing the key independent reuse of the PA attackers.

Figure 1 :
Figure 1: Exemplar DRAM organization: Two DIMMs per memory channel, one rank per DIMM assuming ×8 chips with a burst length of eight.

Figure 2 :
Figure 2: (a) FPGA-based RH test environment with a temperature controller.(b) RH data patterns.

Figure 3 :
Figure 3: Cumulative distribution function (cdf) graph of rows regarding single error (HC first ) and double error (HC OECC ) in any codeword OECC of the row.

Figure 4 :
Figure 4: Probability density function (pdf) graph of rows regarding the number of codeword OECC with error at each row's (a) HC OECC and (b) HC Chipkill .

Figure 5 :
Figure 5: The number of rows with errors after the OECC and rank-level SECDED/Chipkill/Double-Chipkill correction at a given activation count.

Figure 6 :
Figure 6: An overview of the randomized PA-to-DA mapping on Cube.(a) The RH attack situation under the conventional Chipkill.It causes multiple RH errors across the whole victim row, thus uncorrectable.(b) Under Cube, RH victims are distributed to multiple codeword Chipkill , and thus each victim row is correctable.

Figure 7 :
Figure 7: Varying RH victim detection coverage at (a) a singlechip and (b) multi-chip granularity.

Figure 8 :
Figure 8: Reverse-engineering adversary can acquire information about the mapping function exploiting the Chipkill correction timing side-channel.

Figure 10 :
Figure 10: The error profile collected from OECC scrubbing is collected from all chips and evaluated at the MC.

Figure 13 :
Figure13: The relative performance (weighted speedups) of various schemes compared to the baseline system where no RH mitigation scheme exists.

Figure 16 :
Figure 16: Sensitivity study on the ℎ   of Cube PARA-filter .There exists performance optimal ℎ   point on each   .

Figure 17 :Figure 18 :
Figure 17: Sensitivity study on blast radius.The performance improvement ratio of Cube compared to the baseline PARA is consistent.

Table 1 :
Terminologies and abbreviations

Table 2 :
Type of Row Hammer attacks

Table 3 :
Symbols used throughout §4 RH Single RH success on a codeword Chipkill (CE).D RHDouble RH success on a codeword Chipkill (UE).()Mappingfunction for chip , which takes PA as input and DA as output.−1()Reverse of mapping function for chip , which takes DA as input ad PA as output.The set of other PAs physically adjacent to a given PA.  Total number of rows in a bank.
Cube requires a few registers and control paths to synchronize the OECC scrubbing of each DRAM chip, gather the error profile information, and initiate Chipkill correction when the RH victim is recognized (Figure10).The OECC scrubbing must be done at the RH victim row granularity.Thus, each chip must be synchronized to a single candidate aggressor PA scrub , and scrub its   by computing { −1  (  (PA scrub ) ± 1)}.Each chip has a 17b register holding the PA scrub .
are calculated in the same way as  2 ,  3 , and  4 .The final attack success probability with RH diagnosis for a year, Figure 11: The RH attack success probabilities of SRS, C SRS , and Cube SRS .C PARA (Atk targeted , P 3 ) C PARA (Atk blinded , P 4 ) Cube PARA (Atk targeted ) Cube PARA (Atk blinded , P ′ 5 ) *  _10 : #of tREFWs that an aggressor is hammered in 10 minutes  ′ 2 ,  ′ 3 , and  ′ 4 , which are the probabilities with RH diagnosis, ′ 5 , is as follows:

Table 4 :
Parameters for architectural simulation

Table 5 :
Cube SRS and SRS  configurations, Cube SHA and SHADOW RAAIMT configurations   SRS C SRS Cube SRS SHADOW C SHA Cube SHA

Table 6 :
Cube PARA and PARA  configurations   PARA C PARA Cube PARA Cube PARA-filter Table entry count for SRS, C SRS , and Cube SRS .

Table 7 :
[91]ability and lifetime[91]of S RH and D RH under a continuous RH attack on Cube PARA and Cube SRS .