Unity ECC: Unified Memory Protection Against Bit and Chip Errors

DRAM vendors utilize On-Die Error Correction Codes (OD-ECC) to correct random bit errors internally. Meanwhile, system companies utilize Rank-Level ECC (RL-ECC) to protect data against chip errors. Separate protection increases the redundancy ratio to 32.8% in DDR5 and incurs significant performance penalties. This paper proposes a novel RL-ECC, Unity ECC, that can correct both single-chip and double-bit error patterns. Unity ECC corrects double-bit errors using unused syndromes of single-chip correction. Our evaluation shows that Unity ECC without OD-ECC can provide the same reliability level as Chipkill RL-ECC with OD-ECC. Moreover, it can significantly improve system performance and reduce DRAM energy and area by eliminating OD-ECC.


INTRODUCTION
Dynamic Random Access Memory (DRAM) has long been employed for main computer memory, due to its high capacity and low cost-per-bit.DRAM technology has consistently scaled down to accommodate the demands of large-scale applications [50,53].Nevertheless, the shrinking of DRAM process technology presents four critical challenges when addressing application requirements: (1) high access latencies diminish system performance, (2) DRAM access energy is not scaling with technology scaling, and (3) the hardware overhead of DRAM has emerged as a significant concern for the cost-sensitive DRAM market [7,56], and (4) increased DRAM susceptibility to vulnerabilities reduces system reliability [22,51].
First, high DRAM access latency challenges system performance.Although DRAM capacity has increased significantly, access latency has improved by only 16.7% in the last two decades [4, 9-11, 23, 39, 40, 42].Processors often spend hundreds of clock cycles accessing data in DRAM, leading to performance bottlenecks that can negatively impact applications with low memory-level parallelism, high cache miss rates, and large working sets [3,5,18,24,30,31,37,44].Second, DRAM energy consumption has become a critical concern across modern computing systems [2,15,16,29,46,60].DRAMbased main memory makes up a substantial portion of overall energy consumption-for instance, DRAM accounts for 40% of the total power in graphics cards [57], and 40% of the total energy in servers [2,68].Third, DRAM hardware overhead is a crucial issue for cost-sensitive DRAM vendors [7,56].Vendors optimize DRAM cell arrays for low area-per-bit by densely packing them [41,42,61], and the difficulty of fabricating error-free dense DRAM has led to the introduction of On-Die ECC (OD-ECC) [7].Fourth, to ensure reliability, Error-Correcting Codes (ECCs) are implemented to detect and correct errors in data [19,33,54].While Rank-Level ECC (RL-ECC) is utilized for high reliability (e.g., Chipkill), it cannot correct randomly-scattered bit errors from multiple chips.DRAM vendors use OD-ECC to correct scattered errors [26], but it can impact system performance, energy consumption, and hardware overheads.This paper proposes a novel unified memory protection scheme, Unity ECC, to address all four DRAM challenges.Unity ECC improves system performance while reducing DRAM energy consumption and DRAM hardware overheads, all while maintaining system reliability at an acceptable level.Unity ECC is a singletier RL-ECC, and this paper explains its implementation for DDR5 DRAM. Figure 1 compares Unity ECC with conventional DDR5 ECC.Conventional DDR5 utilizes both OD-ECC and RL-ECC to perform double-bit error correction (through OD-ECC), and single-chip error correction (through RL-ECC).Due to this configuration, DDR5 has a total redundancy ratio of 32.8% when combining OD-ECC and RL-ECC.In contrast, Unity ECC eliminates OD-ECC and maps additional unused syndromes in RL-ECC for double-bit correction, enabling RL-ECC to perform the role of OD-ECC as well.
Unity ECC reduces the DRAM redundancy from 32.8% to 25%, and eliminating OD-ECC also decreases the DRAM access latency.We find that Unity ECC increases system performance by 7.3% on average (geomean) for single-core memory-intensive workloads and 8.2% for high misses-per-kilo-instruction (MPKI) multi-core workload groups, while DRAM energy consumption is reduced by 8.0% for memory-intensive workloads.Moreover, the chip die area overhead in DRAM also decreases by 6.9% by eliminating the OD-ECC redundancy and decoder hardware.
The main contributions of the paper are as follows: • We propose a novel single-tier RL-ECC called Unity ECC that can correct both single-chip errors and double-bit errors without any additional RL-ECC redundancy.• We provide an algorithm to flexibly construct a Unity ECC code by searching the Reed-Solomon syndrome space.• We describe an efficient decoding method that corrects singlechip and double-bit errors in parallel, resulting in negligible hardware overheads.• We evaluate Unity ECC, showing it to have significant performance, energy, and hardware cost benefits over conventional DDR5 while still maintaining acceptable reliability.

BACKGROUND
This section reviews the terminology that is fundamental to Unity ECC.Subsequently, the overall structure of DDR5 DRAM and the memory subsystem and the Rank-Level ECC and On-Die ECC used therein are explained.

Terminology
An error is a discrepancy between the intended and actual state of a system, a fault is a defect or physical phenomenon that can lead to an error, and a failure occurs when an erroneous system is unable to perform its intended service [1].Transient faults are temporary defects due to environmental factors like high-energy particle strikes, while permanent faults are irreversible physical defects causing persistent errors, such as stuck-at-0 faults [62].
Error Correcting Codes (ECC) can detect and correct errors by adding redundant information in the form of check bits.ECC encoding algorithmically generates check bits from data bits.A valid pair of data and check bits is called a codeword.Errors in a codeword can cause inconsistencies between data and check bits.A non-codeword is an invalid pair due to errors.ECC decoding refers to the recovery of the original data using the check bits.Decoding outcomes can be classified into four categories: No Error (NE), Correctable Error (CE), Detectable but Uncorrectable Error (DUE), and Undetectable Error (UE).UEs can result in a Silent Data Corruption (SDC), potentially compromising the final computation output.Reliability indicates the continuity of service without failure [1], often measured in Failures In Time (FIT).FIT denotes the expected number of failures during a billion hours of operation.

DRAM Organization
DRAM is widely used due to its high density.DRAM chips have multiple data pins (DQs) to transfer data in parallel.A DRAM with  DQs is referred to as an × chip (e.g., ×4 chip).A Dual In-line Memory Module (DIMM) mounts several DRAM chips in parallel to provide a standardized data width (e.g., 64-pin).
A rank is a group of DRAM chips accessed in parallel, frequently composed of a single DIMM.Ranks within the same channel share the processor interface by time-sharing.DRAM accesses transfer data over multiple cycles to exploit locality.The burst length refers to the number of consecutive locations that can be accessed in a single burst of data transfer.
The DRAM burst length has increased over generations, reaching 16 in DDR5.In a standard 64-bit DIMM, this can lead to a 128B access granularity.To align with the 64B cache granularity found in many processors, JEDEC introduced sub-channels in DDR5.A DDR5 DIMM is made up of two sub-channels, each with a 32-pin data interface and the ability to operate independently.

DRAM errors
DRAM errors are prevalent in modern computing systems, with chip errors and bit errors being the most common types.Chip errors can result from manufacturing defects or wear over time, affecting entire memory cell rows, columns, or banks [62].
As the DRAM manufacturing process has shrunk, DRAM reliability is getting worse [20], and bit errors have become increasingly dominant [7].Several factors contribute to the rise in bit errors, including fewer electrons retained in smaller memory cells, heightened susceptibility to disturbances [35], and weakened cells causing randomly-distributed single-bit errors [64].
To counter the increased bit errors in smaller DRAM process technology, memory systems employ error detection and correction techniques, such as Error-Correcting Code (ECC) [19,25,33].Various types of ECC are tailored to different purposes, with the appropriate method selected based on the situation.For example, DDR5 DRAM chips use Single Error Correction ECC due to its low overhead and ability to correct bit errors [26].

Rank-Level ECC
System companies have long used Rank-Level ECC (RL-ECC) to protect memory against errors.The memory controller encodes rank data and stores the generated redundancy on extra chips in ECC-DIMM.Single Error Correction-Double Error Detection (SEC-DED) on 64-bit data requires 8-bit redundancy, resulting in the standard 72-pin ECC-DIMM up until DDR4.Some companies leverage this redundancy to provide a strong correction capability known as Chipkill-correct [12,25,27,28,33].Chipkill-correct is a highly effective error-correction technique that can correct single-chip errors.Field studies have shown that Chipkill-correct can correct about 99% of DRAM errors by correcting multi-bit errors within a chip, whereas SEC-DED can correct approximately 95% .The increased error correction capability makes Chipkill-correct a valuable tool for improving memory reliability against severe faults, such as a row decoder fault or a dead chip.
In DDR5, the ECC-DIMM configuration has been modified to support sub-channels.A DDR5 ECC-DIMM has 80 data pins, allocating 32 pins for data and 8 pins for redundancy in each sub-channel.The 8-pin redundancy provides the necessary redundancy to correct errors in a ×4 chip.However, this change increases the redundancy ratio to 25%, leading to increased costs and power consumption.

On-Die ECC
As process technology continues to shrink, DRAM has indeed become more vulnerable to errors [20].The smaller feature sizes in advanced manufacturing processes lead to several challenges that can impact the reliability of DRAM, including; 1) reduced noise margins, 2) increased sensitivity to external factors, 3) higher cellto-cell interference, 4) increased variability, and 5) higher leakage currents.[7] estimated that the fault rate can go as high as 10 −4 in 1Y nm process.To counteract these challenges and maintain memory reliability, DRAM vendors introduced On-Die ECC (OD-ECC) in DDR5, LPDDR4, and HBM2E.
OD-ECC can correct errors inside a DRAM chip using extra cells on the DRAM die.During a write operation, an ECC encoder on the DRAM die internally generates redundancy from per-chip data and stores it on the redundant cells.When the data is read, an ECC decoder internally corrects errors using the stored redundancy, effectively making the erroneous DRAM chips appear error-free to the external components.
OD-ECC typically provides bit-level error correction, offering protection against random bit errors.For example, in a DDR5 chip, the internal ECC encoder generates 8 check bits from 128-bit data.The 8-bit redundancy is stored in the redundant cells and allows Single Error Correction (SEC) capability over the 136-bit word, ensuring that any single-bit error within the data can be corrected before being sent to the processor.By using both SEC OD-ECC and Chipkill-correct RL-ECC, the system indeed provides strong protection against both bit-level and chip-level errors.However, the reliability improvement offered by this combined approach comes at the cost of increased redundancy, higher energy consumption, and lower system performance (Section 4).

RELATED WORK
This section briefly reviews works related to Unity ECC, including bit-level and chip-level ECCs.Bit-level ECCs protect against random bit errors, but they may be unable to correct severe multi-bit errors caused by high-energy particle strikes or peripheral circuitry faults.Chip-level ECCs, such as Chipkill-correct, are employed to detect and correct errors affecting an entire chip.Current DDR5 ECC combines bit-level and chip-level ECCs (e.g., SEC OD-ECC + Chipkill-correct RL-ECC), providing robust protection at the cost of high redundancy.Unity ECC surpasses this combined approach by effectively guarding against both bit and chip errors with a single unified RL-ECC.This results in less redundancy, enhanced storage efficiency, and potentially reduced energy consumption.

Single Error Correction
In 1950, Richard Hamming introduced Single Error Correcting (SEC) codes, also known as Hamming codes [21].These codes use  -bit redundancy to correct a single error in a 2  − 1-bit word and are based on linear block codes.They are widely used for error detection and correction in digital communication and memory systems.
Hamming codes rely on the construction of an H-matrix (a.k.a., parity check matrix), which is an ( ×) matrix with  redundant bits and  total bits in the codeword.The H-matrix must have unique non-zero values in each column to efficiently identify and correct single-bit errors.This property ensures that each error pattern produces a distinct syndrome, allowing the decoder to locate and correct the single-bit error.SEC codes.Among various ways to create DEC codes, BCH (Bose-Chaudhuri-Hocquenghem) codes are a popular choice due to their flexibility [6].However, there are some challenges in using BCH DEC codes for OD-ECC.First, the required redundancy is larger than that of SEC codes.For example, for 128-bit data, an SEC code requires 8-bit redundancy, while a DEC code requires 16-bit redundancy.This increased redundancy can result in higher overhead in terms of storage and power consumption.Additionally, the decoding process of BCH DEC codes is more complex compared to SEC codes.The complexity of the decoding process may lead to increased latency and higher power consumption during error detection and correction, which could negatively impact system performance.

Single Symbol Correction
Bit-level ECCs, though effective for random bit errors, become inefficient when handling multi-bit errors due to increased redundancy requirements.In contrast, symbol-based ECCs offer efficient protection against chip-level errors by grouping affected bits into a single symbol and correcting any errors within the bits using a Single Symbol Correction (SSC).
Reed-Solomon (RS) codes are one of the most popular symbolbased ECCs [59].They are non-binary linear block codes designed to correct errors within symbols, where each symbol consists of multiple bits.They can correct  symbol errors with 2 redundant symbols if the word size is no greater than 2  − 1 symbols, where m is the symbol size.RS codes are particularly well-suited for correcting aligned errors where errors do not cross a boundary (e.g., chip boundary in a DRAM DIMM).AMD Chipkill is a prominent example of Chipkill-correct, which applies Reed-Solomon (RS) codes to DRAM for error detection and correction [25].
Figure 3 shows the overview of AMD Chipkill in DDR5 that will be in use.AMD [25] builds 8-bit RS symbols by combining two consecutive 4-bits from a ×4 chip.The larger symbol size reduces the number of redundant symbols required for SSC down to two, allowing AMD to achieve SSC with a single ECC-DIMM.However, the codes themselves do not offer double symbol detection capability, which can compromise system safety.To compensate for this weakness, AMD utilizes a technique called hardware-managed ECC history mechanism (referred to as conservative mode in this paper).This technique enhances error detection capabilities by recording error locations for each ECC word.When words within a memory transfer block report different correction positions, AMD's Chipkill mechanism assumes that some of these corrections are miscorrections of multi-chip errors rather than corrections on distinct chips.This assumption is based on the low likelihood of different chips exhibiting errors during the same access.In such cases, the mechanism discards the corrections and reports the event as an Uncorrectable Error (DUE).
In the case of DDR5, the ECC-DIMM configuration has changed to accommodate the new sub-channel architecture.Instead of the traditional (64 + 8)-pin ECC DIMM, a DDR5 ECC-DIMM has (32 + 8) × 2 sub-channels.This new configuration is designed to preserve the 64B access granularity and enhance parallelism.However, this also means that the existing ECC techniques, which were developed for the older (64 + 8)-pin ECC DIMM, may need to be redesigned to protect DDR5 memories effectively.

Double Bit Error Correcting-Single b-bit
Byte Error Correcting DEC-SbEC (Double Bit Error Correcting -Single b-bit Byte Error Correcting) codes are capable of correcting both random double-bit errors and single b-bit byte errors, although not simultaneously [66].
With 24-bit redundancy, DEC-SbEC can correct double-bit errors or single 8b-symbol errors on (64 + 24)-bit words.This means that it can address both bit-level errors and chip-level errors, offering a potential single-level unified protection for DRAM.However, the increased redundancy required for DEC-SbEC (37.5%) is higher than the combined redundancy of OD-ECC and RL-ECC (32.8% in DDR5).In contrast, Unity ECC provides both bit-level and chiplevel protection using the same redundancy as DDR5 RL-ECC (i.e., 25%).This makes Unity ECC a more efficient solution for providing robust DRAM protection without a redundancy increase.

Multi-Tiered ECC
Prior studies proposed multi-tiered ECC schemes for memory protection [13,19,28,54,65,69]. [69] presents a virtualized and flexible ECC scheme for main memory that dynamically adjusts ECC based on memory usage, enhancing error detection and correction capabilities.This approach maximizes performance improvement and energy efficiency by efficiently allocating ECC resources according to memory demand.Udipi et al. [65] proposed LOT ECC that uses L1 local error detection, L2 global error correction, and parity across L2 to provide high reliability while minimizing overheads.Jian et al. [28] proposed a scheme called Multi-ECC that groups multiple memory lines together, enabling low-power, low-storage-overhead chipkill correct by distributing the correction capabilities across several memory lines.Chen et al. [13] presents a rate-adaptive, twotiered error correction code scheme that dynamically adjusts error correction strength based on the observed error rates, allowing for efficient and reliable error correction in 3D die-stacked memory systems.Nair et al. [54] introduce a method that exposes on-die error detection information to the memory controller, enabling more accurate error detection and correction decisions, which in turn significantly enhances reliability.Gong et al. [19] propose a technique that exposes on-chip redundancy to rank-level ECC, allowing for effective utilization of both on-chip redundancy and ECC, resulting in improved memory system reliability.
Overall, these studies separate error detection and correction and move the sophisticated error correction part off from the latencycritical read path.Consequently, multi-tiered ECC provides a more robust and resource-efficient solution compared to traditional singletier ECC schemes, optimizing memory system performance and reliability.The purpose of Unity ECC is to provide a strong singletier ECC scheme for the memory system.That is, Unity ECC codes are far from the multi-tiered ECC schemes.In fact, it is possible to apply Unity ECC to these schemes, but such an evaluation is beyond the scope of this paper.

MOTIVATION
This study is motivated by the high costs of separate bit-level and chip-level protection.Combining OD-ECC and RL-ECC provides robust memory protection against both bit-level and chip-level errors.However, it increases redundancy and negatively impacts performance due to overfetching and Read-Modify-Writes (RMWs) in OD-ECC.Meanwhile, DDR5 Chipkill-correct RL-ECC has unused syndromes, which, if utilized to correct more bit errors, can eliminate OD-ECC to reduce redundancy, energy consumption, and performance overheads.

OD-ECC Overheads
DDR5 OD-ECC employs (136, 128) codes to correct single-bit errors [26].This implementation requires an additional 6.25% of cells for redundancy, and the extra circuitry for encoding and decoding further enlarges the chip area.A DRAM vendor has reported a total chip area increase of 6.9% for OD-ECC [7], which presents a substantial challenge for cost-sensitive manufacturers.When combined with the 25% extra chips in DDR5 ECC-DIMM, the overall cell redundancy escalates to 32.8%.
OD-ECC also degrades performance due to the disparity between access granularity (64-bit data) and ECC granularity (128-bit data).A ×4 DDR5 chip transfers 64-bit data over a 16-beat transfer.Ideally, OD-ECC block size should correspond to the access granularity, but providing SEC over 64-bit data increases the redundancy to 10.9% (7-bit).The incongruity between access and ECC granularities leads to overfetching and RMW operations, which increases power consumption and negatively affects performance.
For every 64-bit read, a DRAM chip must internally fetch 128-bit data along with its redundancy, decode the information, and transfer only half of the fetched data.This process consumes more power and lengthens the read time (by up to 2ns in [38]).The situation becomes more problematic for writes, as it requires fetching the original 128-bit block, partially updating the block with new data, encoding the data, and writing the block back to cells [7,19,32,34].DDR5 micro-architectures have maintained most timing parameters despite this change, except for one; tCCD_L_WR.It is the latency between two consecutive writes to the same bank group and has doubled due to OD-ECC.Due to the increased read time and tCCD_L_WR, OD-ECC is reported to reduce the performance of memory-intensive applications by an average of 5 − 10% [7].

Shortened Codes in RL-ECC
Meanwhile, DDR5 RL-ECC has the potential to provide more-thanchipkill corrections.As an example, we apply AMD Chipkill to a DDR5 sub-channel and demonstrate that many syndromes are used for detection only.
On a DDR5 sub-channel with 32-pin data and 8-pin redundancy, we construct 8-bit symbols from two consecutive data from a ×4 chip, similar to the AMD approach (Figure 3).Consequently, an ECC word comprises 8 data symbols and 2 redundant symbols.The two redundant symbols (16 bits in total) offer 65535 distinct nonzero syndromes, which can be used to identify any single symbol error (255 cases for 8-bit symbols) across 255 symbol positions.
However, the ECC words contain only 10 symbols (8 for data and 2 for redundancy), and the remaining 245 symbols are replaced with zeros during encoding and decoding (i.e., shortened).If a decoded syndrome corresponds to errors on one of the shortened symbols, it is considered as the detection of more severe errors (e.g., two-chip error) rather than correcting the error-free constant.As a result, only 2,550 syndromes (3.89%) out of the 65535 syndromes are used for correction, and the remaining 96.11% of syndromes are used for detection only.
If these syndromes can be repurposed to correct multi-bit errors, we can potentially eliminate the need for OD-ECC, reducing redundancy, power consumption, and performance overheads.This change trades detection capability for correction and should be carefully controlled not to degrade the detection coverage level, which is important to large-scale systems and mission-critical systems.

UNITY ECC
This paper proposes a novel ECC, called Unity ECC, that is capable of correcting both bit errors and chip errors at the rank level.Featuring Single Symbol Correcting and Double Error Correcting (SSC-DEC) capabilities, Unity ECC offers robust protection against both growing scaling-induced bit errors and infrequent-but-severe chip-level errors.By integrating double-bit error correction into RL-ECC, Unity ECC eliminates the storage, power, and performance costs associated with OD-ECC.The high efficiency of this approach stems from repurposing detection-only syndromes in RL-ECC to correct multi-bit errors.
Unity ECC is a strong single-tier RL-ECC designed for correcting DRAM bit and chip errors.Similar to AMD, Unity ECC forms 8-bit symbols from two beats of data per ×4 chip, resulting in eight (10, 8) 8b-symbol codewords per memory transfer.Similar to RS codes, Unity ECC can correct a chip error using SSC (2-symbol redundancy) per codeword.However, its novel SSC-DEC capability can also correct two-bit errors by mapping double errors to detection-only syndromes in the SSC code.Unity ECC unifies the roles of both RL-ECC and OD-ECC within a single RL-ECC without additional redundancy.

Code Property
Our proposed Unity ECC codes can correct all single-symbol errors and all random double-bit errors. 1 Linear block codes are uniquely determined by a parity-check matrix, "H."The H-matrix dictates the structure of the encoder/decoder and the error correction and detection capabilities of the code.The H-matrix of Unity ECC should have the following properties: 1) All columns are non-zero.
2) DEC: The sums (XOR operation) of any two columns are unique non-zero values.
3) SSC: The sums (XOR operation) of all symbol-aligned columns are unique non-zero values.
4) DEC+SSC: All sums from properties 2 and 3 should be unique (apart from double-bit errors in the same symbol, which are considered symbol errors).
The first and second properties provide DEC capabilities.The syndrome must be the sum of any two distinct non-zero and unique columns for double-bit errors.The first and third properties relate to SSC, where the syndrome is the sum of columns aligned with the symbol size.All syndromes derived from DEC and SSC must be non-zero and unique, with overlapping cases excluded (e.g., when a 2-bit error occurs in a single symbol).

Code Construction
Consider an 80-bit codeword with an 8-bit symbol size.The sum of any two H-Matrix columns yields 3160 ( 802 ) cases, while the sum of any symbol size-aligned columns produces 2550 ( 101 × (2 8 − 1)) cases.Overlapping cases (280; 8  2 × (10)) should be excluded, resulting in 5430 cases.If all cases are non-zero and unique, the code satisfies SSC-DSC requirements.
The number of possible non-zero syndromes using two 8-bit symbols of redundancy is 2 16 −1 = 65535.While this is higher than the 5430 unique syndromes for single-symbol and double-bit errors, finding such an SSC-DEC code is non-trivial.As a starting point, one might adopt an approach based on RS or BCH codes-RS codes possess SSC correction capabilities, while BCH codes provide DEC correction.We construct the Unity ECC H-matrix using the unshortened extended RS code H-matrix (Figure 4), as building DEC properties on RS codes may be easier than constructing SSC properties on BCH codes.Unity ECC codes are constructed as systematic codes for convenience.
We select columns from the unshortened H-matrix (Figure 4) until matching the codeword length.A greedy search such as [17,43,49,63] is applied based on previously-selected columns.Algorithm 1 presents a Unity ECC construction algorithm using a greedy search.Select a column from the unshortened extended RS code H-matrix.

Restrained mode
AMD Chipkill employs a conservative mode to increase its detection capability.However, using conservative mode reduces correction capability in Unity ECC, as it can correct bit errors originating from different chips.Therefore, Unity ECC utilizes a restrained mode instead of a conservative mode.
Similar to conservative mode, restrained mode records an event as a DUE and discards the memory transfer block if any DUE occurs within an ECC word.However, it does not process a DUE if the correction positions that arise within the memory transfer block differ.This approach preserves the robustness of Unity ECC against bit errors, which can be further examined in Section 6.Since the DE Corrector and SSE Corrector operate in parallel, the impact on system performance is not significantly increased compared to AMD Chipkill.And Unity ECC uses restrained mode, as a 1-bit error of the two chips is corrected through the DE Corrector.

EVALUATION
This section evaluates Unity ECC in terms of performance, energy, chip area, and reliability.The results demonstrate that Unity ECC can significantly improve performance, energy efficiency, and chip area by eliminating the need for OD-ECC while maintaining the same level of reliability across a wide range of Bit Error Ratios (BERs).
The state-of-the-art memory protection scheme used for comparison is a combination of SEC OD-ECC and Chipkill.The SEC OD-ECC employs Hamming codes to correct a single bit within each 136-bit block of memory.Chipkill constructs 8-bit symbols from two-beat per-chip data and applies (10, 8) RS codes for SSC.Memory access with a burst length of 16 has eight such ECC words, and we apply the conservative mode from [25] to enhance the detection capability.This mode discards corrections and reports as a DUE if memory access has corrections on more than one chip.Although it is not optimal against random bit errors and delays data forwarding until the last beat arrives, the conservative mode can compensate for the weak detection capability of SSC by effectively detecting all double or more chip errors [33].

System Performance and DRAM Energy
We first analyze the impact of eliminating OD-ECC on the system performance and DRAM energy consumption.OD-ECC increases DRAM timing parameters with internal decoding and Read-Modify-Write operations.It also increases DRAM power consumption through overfetching and RMWs.Unity ECC, which can correct up to 2 errors, can eliminate OD-ECC and improve performance and energy efficiency.DRAM Parameters: Table 2 compares the key DRAM parameters used in the evaluation.The baseline DRAM is a 16Gb DDR5-4800B ×4 chip [26], which has a 16.67ns read latency with OD-ECC.Without OD-ECC, we reduce the latency by 1.67ns based on estimations from [7,19,38].We also reduce tCCD_L_WR, which is the delay between two writes on the same bank group.The JEDEC standard has two tCCD_L_WR values: 20ns for RMWs, and 10ns for non-RMWs.Removing OD-ECC eliminates the need for RMW, and we use the non-RMW value for Unity ECC.We also decrease the write latency, tCCD_S_WTR, and tCCD_L_WTR by the same amount as the read latency.The parameters are defined to prevent data bus contention between reads and writes, and JEDEC derives its values from the read latency.With a 4-cycle reduction in read latency, we also adjust the parameters accordingly to avoid bus contention.
To estimate the energy savings of eliminating OD-ECC from DDR5, we compare the power numbers of DDR4 and DDR5.Micron DDR4-3200 [47] does not have OD-ECC, and it has a ratio of 100 : 75.6 between read and write currents (i.e., IDD4R and IDD4W), whereas Micron DDR5-4800 [48] with OD-ECC has a ratio of 100 : 108.5.Assuming that the increase in write current is primarily due to RMW for OD-ECC, we multiply the DDR5 IDD4R current by the old ratio to estimate DDR5 IDD4W without OD-ECC.Using this approach, the estimated IDD4W current for DDR5 without OD-ECC is 240mA, which is significantly less than the original IDD4W current (345mA).We conservatively use the same IDD4R for with and without OD-ECC.
Unity ECC can increase RL-ECC decoding latency with more complex SSC-DEC.To mitigate this impact on performance, we separate the decoding process into two parts: error detection and correction.The error detection part generates syndromes and checks whether they are all zeros or not.If all syndromes are zero, it indicates no error, and the data can be forwarded to the requester without any further correction steps.The longer correction latency  occurs only in rare cases of errors.Error detection in Unity ECC and Chipkill operates on the same-sized ECC blocks, and we do not increase the memory read latency in the performance evaluation.Methodology: To evaluate the performance of the various ECC schemes, we run 23 benchmarks from SPEC CPU 2006 [14].We use Pin [45,58] to extract each program trace after fast-forwarding the first 100 million instructions.Then we feed the traces to an architectural simulator, Ramulator [36], with DRAM parameters in Table 2 and the CPU configuration in Table 3.The simulator warms up the cache by running the first 100M instructions and executes up to 200M more instructions, providing the execution cycle information and a DRAM command trace.Then we feed the command trace to DRAMPower [8] to estimate the DRAM energy consumption.
We categorize the workloads as memory-intensive or non-intensive based on the last-level cache misses-per-kilo-instruction (MPKI) during single-core execution.Twelve benchmarks with ≥ 1 MP-KIs are considered memory-intensive, while the remaining eleven are non-intensive.For multi-core evaluation, we randomly select 4 distinct benchmarks and run them on 4 cores in parallel.The low memory intensity mix ("L") includes three or more non-intensive benchmarks, the medium memory intensity mix ("M") consists of two non-intensive and two memory-intensive benchmarks, and the high memory intensity mix ("H") contains three or more memoryintensive benchmarks.
Single-Core Performance: Figure 7 (top) illustrates the instructions per cycle (IPC) with Unity ECC, normalized to the baseline.Benchmarks are sorted in ascending MPKI order.Overall, memoryintensive benchmarks can significantly benefit from the reduced read latency of Unity ECC (geomean: 7.3%), while non-intensive ones have marginal gains (geomean: 0.3%), as expected.The overall performance improvement is 3.6% across all benchmarks.
To understand the origins of the enhanced performance, we evaluated the average end-to-end read latency in the libquantum workload.The latency reduces from 133 DRAM cycles in the baseline to 112 DRAM cycles when using Unity ECC.This decline is significantly larger than the DRAM read latency reduction in Table 2 (a 4-cycle reduction due to the exclusion of OD-ECC).On closer examination, the residual improvement can be attributed to better bandwidth utilization from enhancements to other timing parameters (e.g., tCCD_L_WR, tCCD_L_WTR).These enhancements permit shorter intervals between DRAM commands, which in turn boosts DRAM bandwidth utilization by 6%.This increased utilization lowers the average number of requests in the memory controller queue from 7.4 to 6.9, leading to shorter queueing delays and, consequently, faster end-to-end memory latency.
Among the memory-intensive applications, performance gains are not proportional to MPKIs.To analyze this, we measure the DRAM row buffer hit ratios.Applications with high row-buffer locality show higher performance gains (e.g., milc with 17.2%), since their memory latency is dominated by DRAM read latency.Meanwhile, applications with high MPKI but low locality show modest improvements (e.g., cactusADM with 2.1%), as row-buffer miss latency includes tRP and tRCD, which are unaffected by Unity ECC.
Single-Core Energy: Figure 7 (bottom) presents the normalized DRAM energy consumption of Unity ECC.For non-intensive benchmarks, Unity ECC shows a 0.5% energy reduction with less power for writes.Note that we conservatively do not reduce the DRAM read power (IDD4R) in this evaluation.For memoryintensive benchmarks, the geomean energy savings is 8.0%, owing to less write power and less standby energy from faster execution.The overall energy savings is 4.2% across all benchmarks.
Multi-Core Performance: Figure 8 presents the multi-core performance results.We measure individual IPC improvements of the 4 benchmarks and use their geomean as the overall speedup.The "L" mix shows a 3.7% overall speedup, and the "H" mix shows an 8.2% speedup, both larger than their single-core counterparts (0.3% and 7.3%, respectively).We analyze that multi-core execution benefits more from the smaller tCCD_L_WR, as it allows faster back-to-back writes from different cores.The "M" mix shows an intermediate speedup.

Reliability Against Bit Errors
Continuous process scaling introduced new types of faults (e.g., variable retention time) and many of them are reported as random bit errors.We first demonstrate that Unity ECC can be more reliable against these growing bit errors than conventional ECC schemes.
Methodology: In order to assess the reliability against bit errors, we run random bit-error injection simulation with varying Bit-Error-Ratios (BERs).We utilized a common multiple of the OD-ECC and RL-ECC blocks as a target for error injection (Figure 9).OD-ECC uses 136-bit blocks on a chip, which span over two memory transfer blocks.This makes a group of two memory transfer blocks and their OD-ECC redundancy as the injection target.We randomly inject errors into a target block with BERs varying from 10 −6 to 10 −2 .Subsequently, we applied actual ECC decoding to the error-injected block to determine whether the erroneous block is correctable (CE), detectable (DUE), or undetectable (SDC).While OD-ECC can correct errors, a detected error is not reported to RL-ECC as per the DDR5 standard.RL-ECC generates one output per memory transfer block, and the final output is the worse one (NE=CE>DUE>SDC).For example, if memory transfer block 1 and 2 report CE and DUE, respectively, the final output is DUE.
Result: Figure 10 presents the results.OD-ECC exhibits more robustness to bit errors than Chipkill.However, Unity ECC (restrained mode) is even more robust to bit errors than this ECC scheme since it enables double error correction for each RL-ECC block.Also, it shows a similar level of reliability compared to the baseline, and when the BER is higher than 10 −4 , it exhibits superior reliability to the baseline.This is because Unity ECC has an increased likelihood of correcting multi-bit errors occurring in multiple chips, whereas the baseline fails to correct such errors.Therefore, Unity ECC offers higher reliability in bit error situations than OD-ECC and Chipkill and even surpasses the baseline when the BER is 10 −3 or higher.

Reliability Against Bit and Chip Errors
DRAM has long suffered from multi-bit and chip-level errors.To demonstrate that Unity ECC is more reliable against bit errors and chip errors, we run scenario-based reliability experiments.The experiment fixes the type and number of errors, randomly generates the position/values of the errors, and applies the actual decoding to evaluate the reliability of the ECC schemes.
Scenario-based: In this experiment, we consider three types of errors; per-chip Single Bit Error (SBE), per-chip Double Bit Error (DBE), and Single Chip Error (SCE).An error scenario specifies the number and types of errors in memory transfer blocks.Based on the scenario, we randomly generate errors.For SBE and DBE, the chip and bit positions of error(s) are chosen randomly.For SCE, the chip position and the chip error value are randomly generated.We evaluate five ECC schemes; 1) OD-ECC only, 2) Chipkill only, 3) Baseline, 4) Unity ECC in the conservative mode, and 5) Unity ECC in the restrained mode.Once errors are generated, we apply real ECC decoding to determine whether the erroneous block is correctable (CE), detectable (DUE), or undetectable (SDC), similar to Section 6.2.
Table 4 provides a comparison of the reliability of the five ECC schemes against ten error scenarios.OD-ECC can correct SBEs but cannot correct DBEs or SCEs.Chipkill can correct single-chip errors, including SBEs, DBEs, and SCEs, yet cannot correct multichip errors.The baseline can correct multi-chip errors as long as there are no multi-bit errors on two chips.However, it comes at high costs in terms of area, performance, and power consumption.Moreover, it has relatively relatively-high SDC ratios for DBEs, because a miscorrection by OD-ECC increases the error severity from 2-bit to 3-bit.
The evaluation compares two modes of Unity ECC: conservative mode and restrained mode.The results reveal that the restrained mode provides higher correction capability and indistinguishable detection capability in all scenarios except one: SCE + SCE.In this particular scenario, the conservative mode offers slightly higher detection capability.Given the results, the restrained mode represents Unity ECC in the above and following sections.
Comparing Unity ECC in the restrained mode against the baseline shows that Unity ECC performs better in 2 scenarios; DBE + DBE and DBE + DBE + DBE.We analyze miscorrections by OD-ECC can make the RL-ECC more difficult to detect and correct.In contrast, Unity ECC can correct multiple DBEs as long as they belong to different RL-ECC words.The baseline outperforms in four other scenarios.For cases of SBE + SBE + SBE, DBE+SCE, and SCE + SCE, the baseline exhibits a marginally superior ability for error correction and detection.It improves correction probabilities by 0.39, 0.07, and 0 percentage points, respectively, and diminishes SDC probabilities by 0.03, 4.16, and 0.0000004 percentage points, respectively.The only scenario where the baseline significantly excels is SBE + SCE, where it can correct 100% of such errors using its two-level protection.Meanwhile, Unity ECC manages to correct 3.5% of such errors.However, the results illustrate that, depending on which error patterns are more prevalent, Unity ECC can deliver an equivalent high level of reliability while significantly enhancing system performance and reducing DRAM energy consumption.

Hardware Overheads
To estimate area and latency overheads, we implement SystemVerilog models for the encoders and decoders of the (10, 8) 8b RS codes of the baseline RL-ECC and (10, 8) 8b codes for Unity ECC.We synthesize the models using Synopsys Design Compiler, UMC 28nm SVT/LVT cells, and the worst condition.The virtual target clock frequency is set to 2.4GHz with a 40% margin for clock uncertainty and wire delay.This leads to a 0.25ns budget for gate delays.For power estimation, we use the default switching activity factor of 10%.Table 5 presents the hardware overhead results.

Latency:
Both techniques exhibit similarly low encoding latency values: 1-cycle for encoding and 1-cycle for error detection.However, Unity ECC demonstrates a 0.5ns increase in correction latency compared to the baseline.Unlike the baseline, Unity ECC incorporates a DE syndrome table to facilitate the DEC process.Latency escalates during the procedure of verifying the correspondence between the syndrome and 2880 double error syndromes via a multiplexer.Nonetheless, the actual impact of this on the system performance is minimal.In Unity ECC, error detection is executed initially, and error correction is only conducted upon the detection of an error; otherwise, the process proceeds to forward.
In reality, instances of error occurrence are infrequent (single cell fault rate is lower than 10 −4 in 1Y nm process [7]), hence the rarity of error correction processes.Therefore, although the inclusion of error correction leads to a 0.5ns increase, the resulting decrement in system performance is negligible.

Area and Power:
Unity ECC increases the encoder and decoder area by 151 2 and 8951 2 , respectively, in the 28nm process.Most of the increase is due to the optimized look-up table in the DEC decoder.While the relative area increase is significant, overall, the Unity ECC decoder would consume an insignificant portion of modern processors, which have areas in the 100 of  2 [55,67].For instance, the area overhead is only 0.009% for a 100 2 processor, and the ratio will continue to decrease with process scaling.On the other hand, eliminating OD-ECC can reduce the DRAM chip size by 6.9% [7].Given that a modern process has tens of DRAM chips over ranks and DIMMs, the size reduction should be amplified, leading to significant overall cost savings.Similarly, Unity ECC increases the power consumption in the RL-ECC decoder by 25.4 , yet the power savings from DRAM and faster execution can easily offset the cost.

CONCLUSION
This paper presents Unity ECC, a novel memory protection scheme that addresses key challenges in DRAM technology: high access latencies, energy consumption, hardware overhead, and susceptibility to vulnerabilities.Implemented for DDR5 DRAM as a single-tier RL-ECC, Unity ECC eliminates OD-ECC and reduces DRAM redundancy from 32.8% to 25%, leading to improved performance and reduced energy consumption.The proposed flexible algorithm and efficient decoding method allow Unity ECC to offer significant benefits over conventional DDR5 while maintaining acceptable levels of system reliability.

ACKNOWLEDGMENTS
ChatGPT Appendix: Artifact Description/Artifact Evaluation 1.1 Abstract We evaluate system performance with and without the presence of On-Die ECC (OD-ECC) using an architectural simulator, Ramulator [1].We extended the existing Ramulator by incorporating the DDR5 configuration while only modifying the key timing parameters affected by the implementation of OD-ECC.We conduct singlecore experiments with a newly added 16Gb DDR5-4800B x4 chip configuration and modify Ramulator to support multi-core (4-core) configurations and conduct experiments.

Artifact Identification
1) The main contribution of this simulator is its extensibility, which allows for quick performance measurements and easy modification to support current and future DRAM standards.
2) Software architecture of this simulator is decoupled and modular, providing out-of-the-box support for a wide array of DRAM standards without sacrificing simulation speed.It uses C++ as its primary programming language and supports both trace-driven simulation mode and execution-driven simulation mode.
3) This simulator provides an open-source platform that facilitates reproducibility.Moreover, it offers existing DRAM standards, enabling users to understand and modify them easily.Additionally, the simulator employs cycle-accurate simulation, ensuring accurate reproduction across different platforms.

DRAM ENERGY [2] 2.1 Abstract
We employed DRAMPower [2] to evaluate the DRAM energy consumption with and without the implementation of OD-ECC.By extending the existing simulator, we incorporated the DDR5 configuration and only modified the key timing parameters and current values that are influenced by the presence of OD-ECC.We feed the command trace from Ramulator to DRAMPower and conduct experiments with a newly added DDR5-4800B DRAM (16Gb x4 chip) configuration.

Artifact Identification
1) The main contribution of the given simulator is its ability to swiftly and accurately measure the energy consumption of various DRAM memory types based on JEDEC standards.
2) This simulator offers both command-level and transactionlevel approaches, with our implementation utilizing the commandlevel method.The command traces resulting from Ramulator are transferred to DRAMPower, which then displays the energy measurement outcomes based on these results.
3) By providing a validated power model, the simulator facilitates reproducibility, accelerates simulation speed, and supports a wide range of DRAM memory operations (e.g., ACT, PRE, etc.).

RELIABILITY 3.1 Abstract
We evaluate reliability by injecting bit and chip errors and applying ECC schemes.We conducted experiments with a DDR5 ECC-DRAM (x4 chip) configuration.

Artifact Identification
1) The key contribution of this simulator is its ability to compare the reliability of various ECC schemes by injecting bit errors with varying BERs (Bit-Error-Ratios) and applying ECC, as well as its extendibility to DDR-DIMM-based systems (e.g., DDR3, DDR4, DDR5).
2) This simulator enables the evaluation and comparison of reliability without requiring significant simulation execution time.It primarily uses C++ as the programming language.
3) Designed for extendibility to future DRAM standards, the simulator can be easily adapted to support DDR-DIMM-based systems, facilitating reproducibility.

HARDWARE OVERHEADS 4.1 Abstract
To estimate area and latency overheads, we implement SystemVerilog models for the encoders and decoders of the (10,8) 8b RS codes of the baseline RL-ECC (Chipkill) and (10,8) 8b codes for Unity ECC.And synthesize the models using Synopsys Design Compiler.

1)
To help readers understand the computational artifacts, we provide a detailed description of the artifact meta information used in our approach.
2) The software architecture consists of modular components that facilitate the integration of the UMC 28nm library, such as synthesis.The data models employed capture the essential characteristics of the library components, such as cell timing, power, and area information, allowing for accurate performance evaluation and optimization of the synthesized designs.
3) Lastly, we present a clear demonstration of the extent to which our computational artifacts contribute to the reproducibility of the experiments.We can achieve consistent results across different design instances and technology nodes, enabling the research community to compare and validate various design methodologies and optimizations effectively.Furthermore, our detailed description facilitates easy adaptation to other technology libraries, paving the way for improved reproducibility in future studies.

Experiment workflow
1) Choose a DRAM standard or configuration to simulate.Ramulator [1] supports a wide range of DRAM standards, including DDR3, DDR4, LPDDR3, and LPDDR4.We newly added DDR5.
2) Define the memory access pattern for the simulation.This can be done by creating a trace file that contains a sequence of memory requests (e.g., read or write operations) to be executed during the simulation.
3) Configure the simulator parameters such as cycle time, memory size, and number of channels.These parameters can be adjusted to match the specific DRAM system being simulated.
4) Choose a simulation mode: trace-driven or execution-driven.We choose the trace-driven mode.In trace-driven mode, Ramulator reads memory requests from a trace file and executes them in order.
5) Run the simulation using Ramulator and collect performance metrics such as memory access latency and bandwidth utilization.
6) Analyze the results obtained from the simulation to draw conclusions about the performance of the simulated DRAM system under different conditions (read latency, tCCD_L_WR, etc.).
7) Repeat steps 1-6 for different DRAM standards or configurations to compare their performance characteristics with and without OD-ECC.

Evaluation and expected result
The expected results of the simulator can be inferred from the performance impact of OD-ECC presented in the existing paper.Due to OD-ECC, the read time and tCCD_L_WR increase, resulting in a 5-10% average decrease in the performance of memory-intensive applications [4].The Unity ECC proposed in this paper shows a performance increase of 7.3% (geomean) in single-core memoryintensive applications and 8.2% (geomean) in multi-core high memory intensity mix, which is consistent with the reported values [4].2) Choose the Command-level for integration and DRAM configuration (timing parameters, capacity, etc.).We newly added DDR5.
3) Log the DRAM command traces from the existing memory controller setup.
4) The DRAM command scheduler assumes a closed-page policy, employs FCFS scheduling across transactions, and uses ASAP scheduling for DRAM commands.
5) Provide the DRAM command traces in one of two ways: (a) as XML files parsed by the tool, or (b) compile the tool as a library and call it directly from a simulator using the provided API.
6) DRAMPower performs DRAM command trace analysis based on memory state transitions, avoiding cycle-by-cycle evaluation and speeding up simulations.

Evaluation and expected result
In DRAM access, two distinct cases exist: read and write operations.Unity ECC demonstrates a 10% reduction in read latency and a twofold decrease in tCCD_L_WR, while read latency generally exerts a dominant influence on DRAM energy consumption.Consequently, the expected reduction in DRAM energy consumption is to be less than 10%.In actuality, the DRAM energy consumption in a single-core configuration experiences an 8.0% (geomean) reduction for memory-intensive benchmarks and an overall decrease of 4.2% (geomean).Thus, the expected and actual values match.1) OD-ECC (case where only OD-ECC is used for each chip) 2) Chipkill (case where only RL-ECC is used) 3) Baseline (case where both OD-ECC and Chipkill are used) 4) Unity ECC (case, where only the RL-ECC proposed in this paper, is used) • Outputs: CE, DUE, and SDC ratio.2) Perform error correction using one of the four ECC schemes.
3) Evaluate CE, DUE, and SDC for each of the two memory transfer blocks.
4) Report the worse case of the two as the final result.5) Repeat this experiment for each BER and ECC scheme 1 billion times.2) Error type: Error scenarios 2) Constraint Creation: Define necessary constraints for Chipkill and Unity ECC, including timing, voltage, and area requirements.Generate a Synopsys Design Constraint (SDC) file to provide these constraints to the synthesis tool.

Evaluation and expected result
3) Library Preparation: Acquire the UMC 28nm logic library, typically comprising standard cell libraries, I/O libraries, and memory compilers.These libraries supply the essential information for the synthesis tool to map Chipkill and Unity ECC designs to technology-specific components.
4) Tool Setup: Configure synthesis tools for Chipkill and Unity ECC, such as Synopsys Design Compiler, Cadence Genus, or Mentor Graphics Precision RTL.Ensure proper setup with the correct technology library files, constraint files, and other required settings.
5) Synthesis Execution: Execute the synthesis tool using the Sys-temVerilog design, constraint files, and UMC 28nm logic library.The tool optimizes the design based on provided constraints, generating a gate-level netlist representing the design with technology-specific gates and components.
6) Review Synthesis Results: Examine synthesis logs and report files to confirm that the design adheres to specified constraints.Assess the design's performance, area, and power consumption to ensure alignment with target specifications.

Evaluation and expected result
In contrast to the conventional Chipkill, Unity ECC requires additional XOR operations in the encoder and incorporates a DE syndrome table in the decoder, which is expected to result in increased area and power consumption for both the encoder and decoder.Moreover, the inclusion of the DE syndrome table in the decoding process is expected to generate a large multiplexer, consequently leading to an increase in decoding latency.

Figure 1 :
Figure 1: A comparison of conventional and Unity ECC.

Figure 3 :
Figure 3: Applying 8-bit Symbol AMD Chipkill to DDR5.mechanism assumes that some of these corrections are miscorrections of multi-chip errors rather than corrections on distinct chips.This assumption is based on the low likelihood of different chips exhibiting errors during the same access.In such cases, the mechanism discards the corrections and reports the event as an Uncorrectable Error (DUE).In the case of DDR5, the ECC-DIMM configuration has changed to accommodate the new sub-channel architecture.Instead of the traditional (64 + 8)-pin ECC DIMM, a DDR5 ECC-DIMM has (32 + 8) × 2 sub-channels.This new configuration is designed to preserve the 64B access granularity and enhance parallelism.However, this also means that the existing ECC techniques, which were developed for the older (64 + 8)-pin ECC DIMM, may need to be redesigned to protect DDR5 memories effectively.

Figure 5 :
Figure 5: H-matrix example of (10, 8) Unity-ECC with generator polynomial = 0x15F.Our Unity ECC construction algorithm is flexible, allowing adjustments to codeword and data lengths, making it applicable to various systems.We focus on DDR5 protection in this paper; Figure 5 displays a Unity ECC code example with 64-bit data and 80-bit codeword matching DDR5's code configuration.

Figure 6
Figure 6 shows the Unity ECC decoder's block diagram.Each codeword is concurrently transmitted to a Syndrome Generator, SSE Corrector, and DE Corrector.The Syndrome Generator produces

Figure 6 :
Figure 6: Block diagram of Unity ECC decoder.a 16-bit syndrome via XOR operations between the H-matrix and codeword.It then forwards the syndrome to the SSE Corrector and DE Corrector, which operate in parallel.The SSE Corrector processes three cases (CE, NE, and DUE), as does the DE Corrector (through the DE Syndrome Table).The Decision block determines which data and decode result to choose by comparing the two decode results.If both decode results are either 0 or 1, it indicates an NE or DUE case, and any data and decode result can be chosen.Conversely, if one decode result is 0 and the other is 1, it represents a CE case; thus, the data and decode result from the Corrector with a decode result of 1 should be selected.Unity ECC considers two-bit errors within a symbol as a symbol error-thus, in no case should the result of both the SSE and DE correctors be 1.Since the DE Corrector and SSE Corrector operate in parallel, the impact on system performance is not significantly increased compared to AMD Chipkill.And Unity ECC uses restrained mode, as a 1-bit error of the two chips is corrected through the DE Corrector.

Figure 7 :
Figure 7: The IPC and DRAM energy of Unity ECC, normalized to the baseline.Benchmarks are sorted by the LLC MPKI.

Figure 8 :
Figure 8: The normalized multi-core speedup of Unity ECC.

Figure 9 :
Figure 9: Configuration of DDR5 reliability simulation by comparison of two memory transfer blocks.

Table 4 :
A comparison of reliability against error scenarios.

1 ) 1 - 1 )
Error type: BER • ECC scheme: OD-ECC • BER: 10 −6 • Expected value: Assuming there are about 1000 bits in two memory transfer blocks, running the experiment around 1000 times should result in bit errors (10 6 /10 3 ).In this case, DUE or SDC occurs if there are two or more errors in a single chip.Use the binomial distribution.The total number of trials N is 136 for each chip with OD-ECC.1-2) The probability p of a bit error occurring is 10 −6 .1-3) The probability of no errors in a single chip is (1 − ) 136 .1-4) The probability of a single bit error in a chip is 136 × (1 − ) 135 × .1-5) The probability of 2 or more errors occurring in a single chip is p2 = 1 − (1 − ) 136 − 136 × (1 − ) 135 × .1-6) Therefore, the probability of 2 or more errors occurring in at least one chip among ten chips is 10 × 2 × (1 − 2) 9 + 45 × (2) 2 × (1 − 2) 8 + ...(2) 10 × (1 − 2) 0 .1-7) The resulting value is approximately 10 −7 .• Actual value: In this case, running the experiment 1 billion times results in approximately 100 instances of DUE or SDC, leading to system failure.This demonstrates that the expected and actual values match.

Table 1 :
Comparison of prior works and Unity ECC.

Table 2 :
A summary of the DRAM parameters

Table 3 :
The simulation configuration

Table 5 :
A comparison of ECC encoder/decoder hardware overheads per a DDR5 sub-channel.