Predicting Future-System Reliability with a Component-Level DRAM Fault Model

We introduce a new fault model for recent and future DRAM systems that uses empirical analysis to derive DRAM internal-component level fault models. This modeling level offers higher fidelity and greater predictive capability than prior models that rely on logical-address based characterization and modeling. We show how to derive the model, overcoming several challenges of using a publicly-available dataset of memory error logs. We then demonstrate the utility of our model by scaling it and analyzing the expected reliability of DDR5, HBM3, and LPDDR5 based systems. In addition to the novelty of the analysis and the model itself, we draw several insights regarding on-die ECC design and tradeoffs and the efficacy of repair/retirement mechanisms.CCS CONCEPTS• Hardware → Hardware reliability; • Computer systems organization → Dependable and fault-tolerant systems and networks.


INTRODUCTION
Large-scale field studies of DRAM faults and errors [5,7,8,15,33,43,44,47,[49][50][51] have been invaluable in understanding memory reliability and developing and evaluating memory fault tolerance mechanisms and techniques [16,17,[26][27][28]36].Such field studies have been limited to CPU memory in highly-reliable error checking and correcting (ECC) DDR DRAM modules; such memories comprise the vast majority of memory in large-scale installations and CPUs provide sophisticated error reporting for diagnoses.
Our goal is to address two important limitations of directly relying on field-study data for research and development.First, field studies are necessarily retrospective and cannot directly be used to accurately predict the efficacy of different fault tolerance approaches for future DRAM technologies.Second, the proliferation of accelerators and new DRAM interfaces is pushing greater datacenter memory capacity into DRAM components outside of traditional CPU ECC modules, including HBM [6,9,23,48,54] and LPDDR [30].While HBM and LPDDR share much of their internal design and fabrication technology with DDR DRAM, their configurations differ, as well as the number and type of interface pins (e.g., 4 data pins per DRAM chip in ECC DDR modules vs. 16-64 pins for LPDDR5 [30]).
We achieve our goal by, for the first time, developing a configurable DRAM fault model based on deep analysis of a large-scale, publicly available memory error log [8] that matches error events with corresponding root-cause physical faults at the internal DRAM component level.The dataset is large, containing over 70 million error events from over 250,000 nodes with over 3 million DDR4 ECC DIMMs, which offers enough faults to perform root cause analysis with statistical significance.The dataset has some limitations, such as not including error information at the bit or pin levels and containing some physically-unlikely error patterns.Our analysis overcomes these limitations and we compare the expected error and failure statistics from a simulator we develop to observed failures, both based on our model and recent field study results.
Our model is important and timely.First, new DRAM technologies are rapidly increasing in importance and overall share of capacity.For example, the current fastest supercomputer on the Top500 list-Frontier-has half of its memory capacity in GPUintegrated HBM [3,39].Second, even traditional DDR ECC modules are facing challenges with future scaled DRAM technology: (1) the DRAM chips in a module now implement their own internal on-die ECC [11,22] for tolerating scaling-related bit errors that are not present in the DRAMs of available field studies; and (2) DDR5 perchip access granularity is coarser than that of DDR4, necessitating greater chip redundancy for the desired "chipkill"-level reliability [17,22,36].Worse, it is possible that granularity will continue to increase with future DDR standards.
Third, prior models based on the field data express faults at the granularity of logical DRAM address components: rank, bank, row, and column, possibly extending to the bit and data pin (DQ) level [50].However, coarse-grained faults at the bank logical level, for example, are not rare.Current "chipkill"-style ECC schemes tolerate such faults in DDR4 ECC modules, even if treated as a full bank failure.However, when also considering the scaling errors that are already appearing in DDR5, the probability of system failure resulting from a detected uncorrectable error or an undetected error grows substantially compared to modeling faults at a physical, rather than logical level.
In summary, we make the following contributions: • We propose the first internal physical component based empirical fault model for modern DRAM.The model is based on a publicly available, large scale dataset of memory errors observed in the field [8] (Section 3).• We develop a simulator based on this model that can be configured to simulate the expected faults and errors not only of the DRAM technology used in the error log dataset, but also for the first time for current and future DRAM technologies, including DDR5, HBM3, and LPDDR5 (Section 5).• We use the simulator to closely reproduce the error and failure statistics observed in the dataset, and of other recent field studies [5] (Section 6.1).• We conduct a set of case studies to demonstrate important potential use cases of our model, including evaluating the impact of different combinations of ECC mechanism, DRAM technology, and technology parameters; e.g., for HBM3, LPDDR5, DDR5, and varying rates of scalinginduced errors.We identify interesting insights, including that the expected reliability differences between different vendors can be quite significant for HBM and LPDDR based systems because the fault type distribution varies between the vendors (Section 6).
• We use our model to identify a new opportunity for reducing the expected rate of undetectable errors in HBM and LPDDR memories by incorporating address information in their ECC scheme-this is needed for these memories because HBM and LPDDR5 memory channels typically read from a single chip per memory access, as opposed to an entire DDR rank (Section 4).• We demonstrate the utility of our model by showing how it can be used to better estimate the expected lifetime system costs when using integrated memories, such as HBM3 and LPDDR5.We show that designing row and column repair and address retirement mechanisms with knowledge from our model can slash the expected processor-module replacement rate by 4× compared to using prior methodologies, identifying more favorable tradeoff points in memory-system design that may enable better performance and efficiency.• We make the detailed model and simulator available as opensource software at https://github.com/lpharch/DRAM_FAULT_SIM.

BACKGROUND 2.1 Reliability Terminology
A fault is a physical defect that can potentially cause an erroran error is the difference between the intended and actual system state [4].Transient faults can be caused by external triggers such as high-energy particle impacts.Permanent faults, on the other hand, are physical defects that consistently lead to errors (e.g., a stuck-at-bit).
A failure occurs when the system does not deliver its service.If an error is detectable but uncorrectable, it is a detectable but uncorrectable error (DUE) failure.If an error is miscorrected or undetected, wrong data may escape the error control system, potentially leading to a silent data corruption (SDC) failure.Reliability refers to the continuation of service without a failure, which can be expressed in terms of failures in time (1 FIT=1 failure per 1 billion hours of operation).

Error Checking and Correcting (ECC)
DRAM subsystems are designed with multiple reliability features to ensure the integrity of stored data.One of these features is the use of ECC, which stores redundant check bits alongside the data to detect and correct errors during read access to the DRAM.
ECC mechanisms can be categorized based on the error types that they can correct.For example, a single-bit error correction (SEC) ECC can correct only single-bit errors, while a single-symbol correction (SSC) ECC can correct multiple errors that occur within a single multi-bit symbol.In such an ECC, if a single-symbol spans one entire DRAM device, it is called single-device data correction (SDDC); ChipKill ECC [12] is an example of such an ECC.BCH codes are widely used for correcting bit errors, while Reed-Solomon (RS) codes are commonly employed for correcting symbol errors.ECC can also be categorized based on the location of the encoder/decoder.A rank-level ECC uses redundant chips in a rank and operates at the memory controller, while on-die ECC uses redundant memory cells in a bank and operates within the DRAM chip.
The notation we use for ECC codes includes the code type (e.g., RS or SEC), the symbol size if greater than 1 bit, and the number of symbols in the codeword in total and those used for data.For example, RS8 (18,16) represents an 8-bit symbol RS code with 16 symbols (128 bits) of data and 2 redundant symbols (16 bits).For RS codes specifically, the number of symbols the code can correct is equal to half the number of redundant symbols.Further, the code guarantees correct decoding as long as the number of erroneous symbols in a codeword does not exceed its correction capability.In practice, RS codes can also detect errors that exceed this level of guaranteed decoding, but without providing guaranteed detection.In general, the longer the symbol and the codeword are, the higher the coverage of error detection beyond the theoretic guarantees.However, longer symbols and codewords increase decoder complexity.
The DRAM subsystems periodically read every memory location to detect and correct transient faults before they cause uncorrectable errors.This process is known as scrubbing.

DRAM Structure
We provide a brief summary of the main structure of a DRAM and its internal components that we rely on to derive our fault model.We refer the reader to other publications for detailed descriptions and explanations of these structures (e.g., [37]).
Memory systems comprise multiple DRAM devices, where a DRAM device refers to one manufactured DRAM die.Depending on the DRAM packaging and interconnect technology (or standard) used, a device may be individually connected to, and controlled by a memory controller via a channel (e.g., HBM), or multiple devices may be organized into modules and ranks that share a channel (e.g., a DDR dual-inline memory module-DIMM).The devices that form a rank are all controlled as one wider device.
A DRAM device is structured as a hierarchy that optimizes area and performance.At the lowest level of the hierarchy are DRAM cells.Cells are grouped into mats-the fundamental array component of DRAM.Cells in a mat are arranged into columns and rows.The cells in a column are all connected via bitlines to a single bitline sense amplifier (BLSA) per column.One cell per column may be electrically connected to a BLSA at any time and a group of such cells forms a row in the mat.One row is selected by operating the sub-wordline driver (SWD) that is connected to the cells in the row via a sub-wordline. 1To keep mat row and column pitch to a minimum, BLSAs and SWDs are shared between adjacent mats.Figure 1 shows the structure of a mat with its cells and peripheral SWD and BLSA circuits.
Mats are kept relatively small to balance BLSA and SWD area, delay, and power, with each mat having on the order of 1Mb of data.Mats are therefore themselves arranged into coarser-grained arrays within a device.A group of mats forms a subarray, and a group of subarrays forms a subbank.Multiple subbanks make up a bank (Figure 2) and there are multiple banks in a device.Multiple banks share off-device interconnect (the channel), but each bank is independently controlled to enable concurrent DRAM commands to hide the long latency of each command.Thus, each bank has its own independent set of decoders and interconnect.The per-bank row decoder is hierarchical, first choosing a subbank, then subarray, and finally the SWD.The SWD is selected via a main wordline that runs across the subarray and additional pre-decoded control signals (FX0 and FX1 in Figure 1).Data from BLSAs of a mat must eventually be communicated to the DRAM device data pins, which are commonly referred to by their pin-out designation of clocked data pins-DQs.At the edge of each mat, a column mux/selector connects a small subset of (consecutive) BLSAs at a time to (possibly segmented) global bitlines (or data lines).This subset is selected by a column decoder via column select lines (CSLs).The global bitlines are eventually connected to the bank edge and then DQs.The column decoder may remap columns to remove faults detected post-manufacturing to improve yield.This is done independently per subbank [34].
Global bitlines are shared between a vertical set of mats in the coarser-grained array, but mats that are horizontally within a subarray do not share global bitlines.As a result, the number of global bitlines per mat remains small to keep interconnect area small.All the mats in a subarray are accessed together and target their data across all DQs in the channel.
One implication is that, for a particular access, there is a fixed connection between specific DQs and mats.The number of DQs and mats depends on the total number of DQs in the channel, the number of mats in a subarray, and the width of the column mux of each mat.This differs from design to design, but recent publications and DRAM standards indicate 1 DQ/mat for DDR5 [22] and LPDDR5 [30] and 2 DQs/mat for HBM3 [18,23,42].
Note that the number of mats in a subarray of the narrow ×4 DDR devices is typically only half that of ×8 devices because they share a single combo DRAM design [29].One bit of the row address in the ×4 configuration selects which half of the physical subarray to access.We refer to each half as a logical subarray.

COMPONENT-LEVEL MODELING
This section describes our empirically-derived per-component fault model for DRAM, which can be parameterized and extended to multiple current and future technologies.We first discuss the DRAM error log dataset released by Alibaba [14] and then explain our model derivation.

Alibaba DRAM Error Dataset
The publicly-available Alibaba dataset includes error logs collected from more than 3 million DIMMs on 250,000 servers over an eightmonth period in Alibaba's production data centers.It includes DRAM error logs where each entry includes a timestamp, row, column, bank, rank, memoryID, and serverID for correctable errors and issue tickets related to server failures [8].In total, the DRAM error log includes 75.1 million correctable errors from 30,496 servers (about 12% of servers report memory errors).
This public model is very useful, but has few limitations that we overcome through our analyses and the assumption that any observed sequence of error log entries is caused by the smallest number of plausible physical faults.Specifically, we face the following five challenges: coarse, cache block reporting granularity; lack of details on the specific ECC mechanisms used; data representing multiple DRAM vendors; the presence of a memory page retirement policy that changes the expected errors generated by a fault; and potential erroneous log entries.
Reporting granularity.The cache block reporting granularity limits our analysis of the number of DQs (pins) affected by a particular fault and the errors it generates.The mapping of errors to DQs they affect is considered proprietary, but which DQs are affected by an error is important for reliability analysis (it interacts with the ECC codes and their layout).We do our best to attribute componentlevel faults to the potential DQs they affect by analyzing the failure statistics of each fault type we categorize below.ECC mechanisms.The dataset does not include information about the specific (Intel proprietary) ECC used.Knowing which errors are correctable and which are not is important for interpreting the error logs and deriving the component-level fault model.We overcome this challenge by making an assumption about the ECC correction capability and later validate this assumption in Section 6.1.Specifically, we assume the ECC can correct all errors that are confined to a single pair of DQs.We base this on recent publications [14] and the specific inclusion of 2DQ errors into recent DRAM standards [22].DIMM vendor information.The dataset associates each DRAM error with one of three anonymized DRAM vendors, but does not specify the total number of DIMMs from each vendor.This prevents us from calculating absolute FIT rates per vendor and hence percomponent FIT rates.We address this by averaging across vendors for our experiments.Page retirement.The systems covered by the dataset use an active page retirement policy.Page retirement improves system reliability by retiring physical memory pages that exhibit frequent errors (or that pass an error count threshold) [10,13,14,21,33,52].We observe potential page retirement occurring once per day, after which a DRAM that exhibited numerous errors earlier in that day no longer exhibits errors.This indicates that a retirement policy is in place, though we do not know its details.While not all faults are removed with this policy, retirement affects the classification of transient vs. permanent faults.Following prior work [5,[49][50][51], we classify faults as transient if they do not persist for more than one day.We also model such faults as lasting one day in our simulations.Retirement also impacts the failure-rate estimations of our model because certain faults are removed before they lead to a failure.Erroneous log entries.We identified several cases of likely erroneous log entries.These entries suggest permanent faults that impact only a small number of addresses, but with the addresses showing a very strong correlation (i.e., identical behavior) across banks, ranks, and even modules, despite not sharing a common physical component that could lead to such patterns.We use statistical analysis to show that these faults can be mapped to faults in a single internal DRAM component under the hypothesis that the bank, rank, or module address reported in those cases is erroneous.

Mapping Errors to Component Faults
We derive our component-level DRAM fault model by analyzing the error logs with the goal of mapping the error events associated with each DRAM module into a root cause component-level fault.We do this by identifying an error pattern from the collection of all impacted addresses reported in the log for each DDR4 rank.We then classify that pattern based on which component fault likely generated it.For example, a single-bit fault would impact a single address in the DIMM in the dataset, while a BLSA fault would affect multiple addresses (cache blocks) that all map to the same DRAM column address across all rows of a subarray; Figure 3b shows an example of this BLSA fault pattern from all log entries associated with DIMM ID 4, rank 1, bank 6 in the dataset.We follow this approach to, whenever possible, associate the pattern from each DIMM with a single faulty component-fault rates are low such that multi-component faults are very rare.We further assume that the vast majority of faults can be associated with the internal DRAM components described in Section 2.3: cells, sub-wordlines, BLSAs, sub-wordline drivers (SWDs), column select lines (CSLs), and decoders, which we describe in detail below.
We are able to classify 99% of ranks with error reports using a decision-tree classifier.For each module, we rely on the count of affected columns, rows, banks, ranks, and modules in each error pattern, the distance between affected addresses, and whether an explicit failure is reported.Some component faults induce error patterns that are similar to one another.When possible, we disambiguate between them using the expected number of affected DQs from each component: when all DQs of a chip are impacted by a fault, we expect to observe failures with high likelihood, but not observe failures if a component impacts only 1 − 2 DQs.We thus proportionally attribute faults to those components that affect all DQs or not based on the fractions of failures and tolerated faults.The table within the figure summarizes the fraction of all faults attributed to a faulty component type that we attribute to each pattern for each of the three vendors.
Cell faults.There are billions of cells in a DRAM device, making cell faults the most common fault type.Cell faults manifest as a unique address in the error pattern of a DIMM, or a small number of addresses (1 − 4) that are completely uncorrelated (different row and column addresses and typically different banks).Such single-bit faults account for 85%, 45%, and 47% of all faults for vendors A, B, and C, respectively.Note that this does not imply that one vendor is less reliable because the limitation of the dataset prevents us from computing per-vendor FIT rates.A bit error will impact a single DQ.Sub-wordline faults.A sub-wordline connects a row of cells within a single mat.A faulty sub-wordline will affect a single row in a single mat, which corresponds to error patterns with multiple columns in a single logical row address (Figure 3a).The dataset refers to ×4 DDR4, in which all rows in a mat affect a single DQ.Bitline sense amplifier faults.A BLSA is shared by a column of cells in each of two vertically-adjacent mats.BLSA error patterns manifest as a single column across numerous rows within a limited range of row addresses (Figure 3b).Specifically, we find that a common error pattern in the dataset corresponds to a single column spanning ranges of 1, 024 − 2, 048 rows.This matches nicely with recently-reported mat sizes of 512 × 512 and 1024 × 1024 [37,45,46,55].BLSA errors impact a single DQ.Sub-wordline driver faults.An SWD drives the sub-wordlines of two horizontally-adjacent mats within a subarray.An SWD fault therefore affects the same single row address in those two mats.It is likely that the DRAMs in the dataset system use the combo-DRAM design approach in which one physical subarray is treated as two logical subarrays, and only one logical subarray supplied data for a specific access [19,29,32].If the two mats of an SWD are mapped to the same logical subarray, an SWD fault affects a single logical row address, but impacts two DQs in that row.The error pattern would look identical to that of a sub-wordline because the dataset lacks DQ-level information (Figure 3a).
On the other hand, if the SWD spans two logical subarrays, the impact is on two logical row addresses at a half-page distance of 64K rows.This pattern is prevalent with an example shown in Figure 3c.This SWD fault impacts a single DQ in each transfer (in each of the two logical subarrays).This fault represents a strong deviation from prior models, which characterize errors from two distinct rows as a logical bank error.Our component-based analysis instead shows that only a single row in each of two mats is faulty.Row decoder faults.The row decoder is a hierarchical, multi-stage decoder leading to different error patterns depending on which level of the decoder suffers a fault.The subarray-level ("local") decoder decodes and drives a main word-line (MWL) and forwards the predecoded bits to the mats.A decoder or MWL fault will thus result in errors from either a single row (Figure 3a), two rows with a half-page delta (Figure 3c), a cluster of rows (Figure 4b), or two row clusters with a half-page delta (Figure 4c).In all cases, the fault affects all the mats in a subarray and hence all 4 DQs.We expect this to lead to a failure.We use this fact to differentiate these decoder faults that only impact a single row from single-row faults caused by wordline and SWD faults.Specifically ∼ 3% of the single-row faults lead to a failure and we attribute those to decoder faults and the rest to wordline and SWD faults.
A related fault impacts not the MWL or decoder itself, but rather connections between the pre-decoded bits sent to a subarray and a set of SWDs.In such cases, a cluster of SWDs of two adjacent mats in a physical subarray are faulty.This leads to an error pattern that spans either one or two clusters of rows as above, depending on whether the two affected mats are in the same or different logical subarrays.Examples are shown in Figure 4b and Figure 4c.Such row decoder faults that impact a single cluster affect 2 DQs, while those that span two clusters impact a single DQ.We observe that roughly half of the occurrences of this type of error pattern span two clusters of 1K rows and half span just one cluster.We use this observation to differentiate between single-row SWD faults and wordline faults; we attribute the same number of SWD faults to both single row (Figure 3a) and two rows separated by 64K (Figure 3c) with the rest of the single-row errors being wordline faults.We also observe that these SWD-connection faults will not lead to failures in the dataset.We thus attribute the ∼ 12% of this 1K cluster patterns to the MWL faults described above and the rest to SWD-connection faults.Overall, the clustered pattern for 0.6%, 0.4%, and 1.2% for vendors A, B, and C respectively.We also observe an error pattern of small, 1 − 4-row clusters with a fixed stride of 16K rows between the clusters (Figure 4a).We estimate the size of a subbank to be 16K rows (see the discussion of the column decoder faults below as well).These error patterns occur fairly frequently for vendors B and C, yet 85% of these faults are transient and only ∼ 3% lead to a failure.We speculate that this mostly-transient fault type relates to a refresh operation that occurs at multiple subbanks simultaneously as a way of increasing refresh granularity.Specifically, a transient CSL or decoder fault that occurs while refreshing would yield this strided error pattern.If the error is CSL-related, then only a single DQ would be affected and no failure would be observed, but a failure is likely with decoder faults.Luckily, this fault is both transient and rarely results in a failure so misinterpreting its root cause has little impact on the reliability analysis.Column decoder and CSL faults.We identify four different error patterns that we attribute to column decoder or CSL faults (Figure 5).The first pair of patterns (Figure 5 a/b) exhibit one or two clusters of ∼ 16K rows along a single cacheline-granularity column address.Faults yielding these error patterns rarely lead to a failure so we attribute them to a fault that impacts either a single CSL within a subbank that incorrectly drives a column of mats, or cases where the signals for two CSLs are flipped near a column of mats (decoder).This fault type will impact a single DQ.We refer to these faults as CSL faults and do not distinguish them in our analysis, though we note that the one that yields a column pair would have been categorized as a bank-level fault using the models of prior work.
The second pattern is similar, but includes two clusters separated by the half-page distance of 64K.We attribute such error patterns to a fault in the column decoding logic that incorrectly maps one or two columns and affects all the mats in the subbank.These faults manifest error patterns that are similar to those of CSL faults, but we expect the decoder faults to lead to failures.We observed failures in ∼ 9% (Figure 5c) and ∼ 40% (Figure 5d) of each error pattern and attribute those fractions to decoder faults.Column mux faults.We did not observe any error patterns that we can confidently attribute to a column mux fault.A column mux (and its circuits) are shared by all rows and all column groups in two vertically-adjacent mats.We expect such faults to yield error patterns that impact all columns and all rows at subarray granularity, yet do not lead to a failure.We did not observe any such clear patterns.Overlapped faults.By manual inspection, we identified error patterns that were a clear overlap of two of the patterns above.We classify these as overlapped faults and observe that they occur in 0.7%, 0.9%, and 1.6% of faults for vendors A, B, and C, respectively.We did not develop a classification algorithm for these faults because their number was small enough for manual inspection and their patterns complicated to capture in the decision tree.Rank and bank faults.We observe that 0.7%, 1.6%, and 2.5% of the faults for vendors A, B, and C, respectively span numerous addresses across multiple ranks and lead to failures.Such faults are rooted in components that are shared across a channel, which are pins or other transmission-related components.We categorize such faults as rank-level faults and conservatively assume they impact all 4 DQs of a device.In practice, they may impact only a single DQ, but the dataset lacks DQ information.
We also observe that 0.3% 0.7%, and 0.8% of faults for vendors A, B, and C, respectively span numerous addresses across banks of a single rank.Similarly to rank-level faults we classify these as multi-bank faults and conservatively assume they impact all DQs and all addresses.
Of all faults, only 0.7%,1.5%,and 2.9% of faults for vendors A, B, and C, respectively do not fit any of the categories described above, i.e., they are not attributed to any of the internal bank components or clear overlaps, yet are contained within a single bank.We classify these as single-bank faults.Due to their small number we inspect them manually and conclude that these are primarily faults that impact multiple subbanks.Erroneous log entries.We observed a substantial fraction of error patterns (∼ 5%), which we strongly suspect are a result of erroneous log entries.Specifically, these error patterns are clear single-component patterns that match those described above (primarily in a single row or column address), yet span multiple banks and even modules.Such address correlation is implausible because any faulty components that are shared across banks, ranks, and modules would impact numerous rows and columns.
We perform statistical analysis that, with high confidence, we can attribute these patterns to a single bank and attribute the anomalous pattern to a reporting bug.We compute the total variational distance and compare the distribution distance between the statistics associated with the suspected true root fault and each of the other categories.We then plot the distribution of that distance along with the distance that corresponds to the root cause (Figure 6).The figure clearly shows that the distribution distance (total variational distance) is smallest when computed against the suspected root-cause fault and always more than one standard deviation away from the distance mean.We therefore count these anomalous faults as belonging to their most-likely basic category.

ON-DIE ECC ADDRESS PROTECTION
Our analysis of errors and faults exposes multiple new insights.Among them is the relative prevalence of errors arising from decoder faults.Prior work characterizes fault at the logical granularity, so simply attributes these decoder faults to bank faults.However, the implications of decoder faults are more nuanced and are particularly meaningful for systems that rely primarily, or even entirely on on-die ECC.
Prior characterization focuses on DDR DRAM channels where accesses span multiple devices across a rank.A decoder fault affects a single device and therefore only the data read from that device is erroneous.A strong (e.g., RS) ECC code easily detects such errors.
When ECC is on die, all the data and its redundant information are sourced from a single subarray.A decoder error therefore results in the entire codeword (data and redundancy) being read from an incorrect location.The on-die ECC still receives a valid codeword, but the data is erroneous from the software perspective-an SDC failure.In contrast, if we consider these decoder faults to be banklevel, the most likely ECC outcome is a DUE failure.
To improve reliability, we propose to avoid such decoder-error SDCs by incorporating the row and column addresses into the ECC codeword and thus detecting decoder errors as DUEs.We adapt the approach of the extended data ECC of All-Inclusive (AI) ECC [28] for on-die ECC.We apply an XOR between the row and column addresses to reduce decoder complexity and also support bit-level   ECCs.We illustrate our approach for a possible RS (18,16) on-die ECC in Figure 7.During encoding, the ECC encoder implicitly includes the row and column addresses with the data as part of the encoded message.Both data and address are used to generate the on-die redundant bits, but only the data and redundancy are stored in memory.On a read, the address from the read command is again added to the codeword that is retrieved from memory and the ECC decoder attempts to correct any errors.If an error is detected, or a correction is attempted on the implicitly-attached address information, a DUE is reported.
This approach detects decoder errors with extremely high probability and avoids silently corrupting data without requiring either additional redundant storage or redundant decoders, which are very expensive.Our evaluation demonstrates the effectiveness of address protection, reducing the expected SDC rate of LPDDR5, for example, by orders of magnitude (Section 6.3).Data ( 256) ECC (32) Row( 16)

Bank
Col( 10) Address ( 16) Data ( 256) ECC (32) Store in memory Load from memory Fix error on data Flag DUE on address

METHODOLOGY 5.1 Fault and Error Simulator
Our simulator accurately models component-level DRAM faults and resulting errors to predict the impact of operational and inherent faults in recent and future DRAM systems.Prior simulators rely only on the logical categorization and miss important correlations and bounds stemming from the physical components.For example, our simulator can identify decoder-related faults to not overestimate the potential overlap between two-row or mat-level faults and other faults because our model does not characterize those as "bank" level.
Like prior simulators [16,26,35], we use a Monte Carlo approach.Each trial consists of injecting faults based on the fault model into a representative channel (e.g., a DDR5 rank) until either a failure occurs or the simulated lifetime (i.e., 5 years) is reached without a failure, though possibly with faults that only yield correctable errors.Injection continues to also estimate the rate of failures from overlapped faults.
Unlike some simulators that only check whether the injected fault pattern can theoretically lead to a failure for the simulated ECC (e.g., [35]), our model refines reliability estimates by adding an additional nested set of Monte Carlo trials for the error patterns resulting from the injected fault patterns, then actually executing error correcting code to estimate the reliability.This helps with simulating the impact of scaling faults and for estimating the true SDC rate.
We report reliability as the expected probability of a representative memory channel to experience a DUE, and separately an SDC, over its lifetime of five years.Because faults across channels are i.i.d., estimating the reliability of large systems with multiple servers/sockets and multiple channels requires simply scaling the single-channel results.Scaling lifetime is also roughly linear because the fault rate is low and the rate at which two independent operational faults overlap is quite low (that is not the case for scaling faults, but those occur at a steady ratio).

Fault Model
We use a model that reflects the per-component empirical model we derive in Section 3, scaled to the different memory technologies we evaluate.We first develop a model for a ×4 DDR4 device as a baseline.Because the dataset prevents us from computing pervendor FIT rates, we use a representative DRAM device model, which has the per-component rates that correspond to a system where all vendor devices have the average FIT rate observed in the data.This representative ×4 DDR4 device fault model is summarized in Table 1.
We incorporate the scaled-DRAM inherent fault model of Gong et al. [16,17].We sweep the ratio of weak cells that are susceptible to scaling errors between 10 −5 − 10 −8 with a constant 10 −6 activation probability for each weak cell (the probability that a weak cell generates an error at any particular time).The 10 −7 point corresponds to the error rate used in prior work [40] and we set that as our nominal scaling error rate.However, Gong et al. 's model conservatively assumes that column-level and bank-level faults impact all rows in a logical column or an entire bank, leading to a very high overlap probability between inherent scaling errors and such broad faults (effectively guaranteeing overlap with bank-level faults).Our proposed model offers far lower overlap probabilities because operational faults are modeled with the fine granularity of individual components.4 b, c 0.20% 1.09% 1.38% 0.61% CSL Figure 5 a, b 0.39% 12.18% 18.82% 6.28% Col DEC Figure 5 c, d 0.10% 1.57% 0.03% 0.39% CSL Figure 5 c  We use two approaches to scale our representative DDR4 device to 32Gb ×4 DDR5, ×16 LPDDR5, and HBM3 device fault models.The first approach keeps the per-device overall FIT rate the same as that of DDR4 and scales the relative frequency of the different component-fault rates based on the ratio between those components in each newer technology compared to DDR4 (e.g., the number of decoders scales with bank count, SWDs and BLSAs with capacity, etc.).We estimate the component count ratios from the relevant specifications and related publications [1,2,18,22,23,37], as shown in Table 2.The motivation for this approach is that prior field studies conclude that overall FIT rate scales with device count rather than capacity and is roughly constant across DRAM generations [49][50][51].
The second approach derives a per-component FIT rate from the representative DDR4 device and then uses the same componentscaling approach to derive DDR5, LPDDR5, and HBM3 models that have higher overall FIT rates because of their larger number of internal components.We report a range of reliability based on these models in most experiments.

ECC Schemes
We evaluate several ECC schemes for the different memory technologies and compare their reliability.For DDR5, we use a SEC(136,128) on-die ECC, which can correct one bit out of 128 bits [22], and two rank-level ECC options.A 10 ×4-device rank with the strong Chipkill-level protection expected for servers, and a weaker 9device rank that can only correct up to 2-DQ errors.
The HBM3 specification [23] and a recent publication [18] suggest that the on-die HBM3 ECC is an RS16 (19,17) ECC code.We evaluate this ECC, as well as a simplified version of this code that interleaves two RS8 (19,17) codewords to form the 256b HBM3 access [42].We also evaluate an RS8 (38,34) code that has a single long codeword of 8b symbols and can tolerate a larger number of scalinginduced inherent bit errors.We evaluate all three schemes with and without our proposed address protection extension (Section 4).
There is no public information about the on-die ECC of LPDDR5.LPDDR5 may use the 6.25% redundancy of DDR5 on-die ECC, in which case each LPDDR5 access may comprise two SEC(136, 128) codewords or a, for example, RS8 (34,32) codeword.However, the access granularity of LPDDR5 matches that of HBM3 and we therefore also evaluate the RS8 (36,32) code that has 12.5% redundancy.We evaluate all schemes with and without address protection.

EXPERIMENTS
We conduct three sets of experiments.One to validate the simulator and model and a second for estimating the expected reliability of DDR5, HBM3, and LPDDR5 memories with different ECC schemes.The third set of experiments highlights the benefits of using our fine-grained component-level model to understand the lifetime and module-replacement implications of tightly integrating HBM3 or LPDDR5 memories within processor sockets or modules.

Simulator Validation Discussion
First, we validate our assumption that the primary ECC schemes used in the machines represented in the error log dataset is not Chipkill-level.We simulate a system with the same server size and DDR4 memory capacity using our model under two hypotheses: (1) that the ECC used is Chipkill-level and can correct all errors that are confined to a single device in a rank; and (2) that the ECC can only correct 2-DQ errors.The expected number of failures in an 8-month period with a Chipkill-level ECC is less than 10 (closer to 1).However, with 541 failures observed in the dataset, it is extremely unlikely that Chipkill is used.In contrast, the model predicts 870 failures with the weaker ECC, which is close to the observed number, especially when considering that page retirement and DIMM replacement were employed to lower the failure risk in the studied system.The specific policies used and their impact are not reported with the dataset, but other recent work demonstrates that error logs can predict future system failure with up to 63% precision [31] and used to guide retirement and replacement.
The lack of DQ-level error reporting in the dataset, and also in all other recent fault analysis publications, 2 severely limits our ability to further validate the model.We use the same experimental setup above to also compare the weaker ECC with a fault model that follows prior methodologies of using logical-level classification into bit, column, row, bank, multi-bank, and rank faults with the mostrecent reported fault rates [5].Because no DQ-level information is available, we run two such experiments.The first under the assumption that only coarse-grain faults are multi-DQ and that all such bank and multi-bank faults affect all 4 DQs of a chip.The second experiment uses the per-fault DQ distribution from our model when mapped to the logical-level faults reported in prior work; we treat multi-row and multi-column faults as single-bank faults following Beigi et al. [5].The first experiment predicts over 4,000 failures, while adding the more-refined DQ distribution of 2 The last publication to include DQ information is from 2012 and based on DDR2 so challenging to directly compare to [50].our model predicts 1,800 failures.These are far more than the 541 failures observed and the 870 failures predicted by our full model.Another point of validation is a comparison of the expected reliability of HBM3 predicted by our model for the same system configuration as that of a recent HBM3 ECC study [18].We configure the simulator for 40k sockets with eight 8-Hi stacks of 16Gb HBM3 devices each.Simulating this system predicts a mean time between failures of 60 hours for our representative-device HBM3 model and 92 hours when considering just one of the vendors.This is very close to the 100-hour value reported by Gurumurthi et al. [18].

Overall Reliability
Figure 8 shows the estimated system failure probability within 5 years of operation for DDR5 (for both 9-and 10-device ranks), HBM3 with RS16(19, 17) ECC, and LPDDR5 with RS8 (36,32) with our fault model based on each vendor individually and our (overall) representative device.We make three important observations.Although not shown, the reliability of the 10-device DDR5 rank is comparable to that of 18-device DDR4 ranks in existing Chipkilllevel enabled systems, reported as a DUE probability of 10 −3 and an SDC probability of 10 −7 in prior work [26].
First, while the models based on individual vendors are not identical, they predict similar reliability and are well represented by our representative-device model.Second, the 9-device DDR5 rank exhibits a high SDC-failure probability, which is orders of magnitude worse than the other configurations, likely relegating it to lower-reliability and smaller installations.Third, as expected, the predicted reliability of a 10-device DDR5 rank is far superior to that of HBM3 and LPPDR5.However, the SDC failure rate of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
those systems trails DDR5 by less than two orders of magnitude; we discuss HBM3 and LPDDR5 reliability in more detail below.

ECC and Scaling Faults Interactions
To better understand the behavior of HBM3 and LPDDR3, we evaluate these memory systems with different ECC schemes and different rates of inherent scaling-induced faults.HBM3 reliability.We explore three different ECC schemes for HBM3 based on HBM3 internal architecture and prior work [23,42]: RS8 (38,34), and RS16 (19,17), and a codeword of two interleaved RS8 (19,17) codes.We evaluate each ECC with and without our address protection extension (Section 4) and for four different rates of scaling faults (10 −8 − 10 −5 ).
Figure 9 reveals three key observations.First, the higher scaling fault rates are catastrophic for all but the RS8 (38,34) ECC.This ECC can correct up to 2 arbitrary symbol errors and can hence tolerate a range of different faults that overlap with a scaling fault (including any two scaling faults).The other ECC schemes, which are the ones mentioned for HBM3 by industry papers [18,38,41], are confined to only correct a single 16b-long effective symbol.This lack of flexibility dooms them when scattered scaling faults dominate.
Second, at lower scaling-fault rates, all ECCs exhibit generally reasonable DUE-failure probabilities.However, the interleaved RS8 (19,17) offers poor SDC coverage, even when address protection is used.The third observation is that our model predicts that address protection is necessary for achieving reasonable SDC coverage in these single-device channels that rely solely on on-die ECC.LPDDR5 reliability.The access granularity of the LPDDR5 device we model matches that of HBM3, but its internal organization and component counts differ; e.g., LPDDR5 has half as many banks and uses longer bursts.Fewer details about LPDDR5 ECC have been discussed compared to HBM3.We therefore evaluate a range of possible ECC schemes, including with two different redundancy levels: one matching that of DDR5 and the other that of HBM3.We highlight three interesting observations.First, the RS8(34, 32) ECC simply does not have enough redundancy to tolerate the expected faults in the system that correspond to sub-wordlines, BLSAs, and SWDs.It therefore offers reliability that is simply not competitive with the other schemes and is too low.
Second, the low-redundancy SEC_SEC organization does a good job of correcting errors and avoiding DUE failures.But this is done at the expense of an extremely high SDC failure rate.This suggests that this ECC organization is also unacceptable.
Third, the RS8(36, 32) ECC, which matches one of those we also suggest for HBM3 can strike a good balance between DUE-and SDC-failure rates, especially when address protection is enabled.Address protection turns decoder errors from DUE to SDC events, and this is clearly visible in the results.

Expected Module Lifetime
Our final set of experiments highlights the benefits of using our component-level model when interpreting the reliability estimates in terms of system impact and the costs associated with maintaining overall system capability.We do this by considering the expected Of course, while a DIMM can be physically replaced, the loss of an HBM3 or LPDDR5 device requires replacing a processor module and is far more costly.
To avoid replacing a processor, modern memory systems support post-package repair mechanisms, such as remapping faulty rows within a device and the OS retiring address ranges.However, these are limited to certain fault granularities.We evaluate the expected efficacy of these replacement-avoiding approaches when the policies used are tuned based on our component-level model vs. the prior approach of coarse-grained logical-component fault modeling.
We determine the replacement rate by using the overall FIT and relative contribution of each component-level fault that cannot be repaired in the field with remapping and retirement of small memory regions.These fault rates are shown in Figure 11 for different ECC configurations of HBM3, LPDDR5, and DDR5.We first consider all logical-bank level faults as not repairable and calculate that ∼ 2.5% of HBM3 and LPDDR5 "modules" would trigger replacement over a 5-year period.This could be as much as 10% of sockets in a system of accelerators with 128GB of HBM3, for example.With our refined model, however, we identify those logical-bank faults that result from SWD and many decoder faults as impacting a relatively small number of rows, making them amenable to repair/retirement.This slashes the replacement rate to 0.8% over 5 years.If columnrepair is also available (or corresponding address ranges retired), the rate drops further to 0.7%.Thus a much more reasonable ∼ 2% of 128GB accelerators need replacement over a machine's 5-year lifetime.

RELATED WORK
This paper focuses on the characterization, modeling, and analysis of DRAM faults and errors.We briefly discuss work related to this focus below.Other prior work on ECC mechanisms is discussed in context earlier in the paper.
Our work builds on prior reliability analysis of large-scale system, including systems at US national labs [5,15,43,47,49,50], Google [44], Meta [33], and Alibaba [7].These prior studies used error logs and system telemetry (e.g., temperature, rack position, uptime, etc.) to investigate DRAM failures and their impact on system reliability.Key findings from these studies include the characterization of DRAM fault modes, such as single-bit, single-word, singlecolumn, single-row, single-bank, multiple-bank, and multiple-rank errors.This characterization enabled empirical error models and triggered new research on DRAM error correction and fault tolerance mechanisms.
Other work has explored the impact of DRAM scaling induced errors, such as those from variable retention time (VRT) faults become prominent.Since these errors become more common as DRAM scales, it is possible that scaling errors overlap with those from the operational faults discussed in the studies above.Prior work has analyzed and attempted to characterize and model scaling faults [16,40].We rely on the model described by Gong et al. [17] that combines scaling and operational faults.
However, these prior characterization works have an important limitation.As DRAM scales, accurately estimating the overlap between faults becomes far more important for reliability prediction, while this prior work relied on coarse-grained logical-level fault models.A very recent paper also recognizes this important challenge [5].The authors attempt to address this limitation by mapping errors to more fine-grained regions, but still face challenges in providing comprehensive information and projecting to newer DRAM devices like HBM3 and LPDDR5.
Prior work also exists on modeling DRAM faults in the wafer and circuit levels.For example, at the circuit level, Kim et al. simulate VRT errors [25].There are also several studies on identifying failures during the manufacturing and testing process [20,24,53].However, this prior work focuses on increasing yield and does not model operational faults, making it unsuitable for reliability analysis.

CONCLUSION
We propose a new error model with fine granularity that is applicable to a wide range of DRAM structures.By mapping error patterns to physical components, such as wordlines, BLSAs, CSLs, sub-wordline drivers, and row and column decoders, we can identify the root causes of errors.This physical component mapping allows our model to be applied to various DRAM types, including DDR5, HBM3, LPDDR5, and future technologies, just by adjusting error rates for each physical component according to the memory architecture.
Using our model, we demonstrate that while Chipkill remains an effective method for ensuring reliability, relying on on-die ECCenabled HBM and LPDDR systems results in substantially lower reliability.This is because Chipkill (including in DDR5) relies on redundant DRAM chips at the rank level while on-die ECC is susceptible to some internal faults that affect both the data and its redundancy.
We identify and highlight a new failure mode for systems that rely on on-die ECC: an incorrect address can result in silent data corruption.We propose a low-cost method to detect such faults by protecting address bits.While this mechanism alone may not correct errors caused by a faulty address decoder, it can detect them.Additional error correction at the system level can eventually correct these errors, further enhancing the reliability of memory systems.
Finally, we provide better insight into future DRAM errors and their reliability.For this, we build a simulator based on the proposed error model to estimate the reliability of various memory types and ECC schemes.We expect to bring better understanding of memory reliability and improve error correction methods for future DRAM systems without solely relying on retrospective analysis of older technology.

Figure 1 :
Figure 1: Mat structure; an SWD is shared with a neighboring mat in the same subarray.

Figure 3 :
Figure 3: Example of likely mat-level error patterns.Each subfigure shows the 2D (row, col) address map of a single bank with all addresses for which an error was reported.The table within the figure summarizes the fraction of all faults attributed to a faulty component type that we attribute to each pattern for each of the three vendors.

Figure 4 :
Figure 4: Example multi-row error patterns.Each subfigure shows the 2D (row, col) address map of a single bank, with addresses in which errors were observed for a particular DIMM plotted as dots.The table within the figure summarizes the fraction of all faults attributed to a faulty component type for each pattern for each of the three vendors.

Figure 5 :
Figure 5: Example multi-column error patterns.Each subfigure shows the row/col address map of a single bank, with addresses in which errors were observed for a particular fault plotted as dots.The table within the figure summarizes the fraction of all faults attributed to a faulty component type that we attribute to each pattern for each of the three vendors.

Figure 6 :
Figure 6: Total variational distance statistics for each suspected erroneous fault vs. all possible fault categories.The bar represents the mean distribution distance, the vertical line is one standard deviation, and the red dot is the distance computed vs. the suspected root cause.

Figure 7 :
Figure 7: Address protection for on-die ECC by implicitly including the row and column address as part of ECC computation; we propose to XOR the row and column address because a simultaneous fault in both decoders is improbable.

Figure 8 :
Figure 8: Probability of DUE and SDC within 5 years with various vendors and memory types.

Figure 11 :
Figure 11: Proportion of component failures contributing to system failure (FIT rate marked above each bar) for a 32GB system of HBM3, LPDDR5, and DDR5.*For DDR5_10chips, 100% of failures are a result of overlapped operational and scaling errors, leading to a very low FIT.On the other hand, failures of the other configurations are all from a single fault type.

Table 1 :
Fraction of per-component faults for each vendor and the representative ×4 DDR4 device.

Table 2 :
Memory configurations for system reliability analysis.