Understanding Silent Data Corruptions in a Large Production CPU Population

Silent Data Corruption (SDC) in processors can lead to various application-level issues, such as incorrect calculations and even data loss. Since traditional techniques are not effective in detecting processor SDCs, it is very hard to address problems caused by SDCs. For the same reason, knowledge about SDCs in the wild is limited. In this paper, we conduct an extensive study on SDCs in a large production CPU population, encompassing over one million processors. In addition to collecting overall statistics, we perform a detailed study to understand 1) whether certain processor features are particularly vulnerable and their potential impacts on applications; 2) the reproducibility of SDCs and the triggering conditions (e.g., temperature) of those less reproducible SDCs; and 3) the challenges and opportunities to mitigate SDCs. Inspired by the above observations, we design an efficient SDC mitigation approach called Farron, which relies on prioritized testing to detect highly reproducible SDCs and temperature control to mitigate less reproducible SDCs. Our experimental results indicate that Farron can achieve lower overall overhead with better coverage of SDCs, compared to the baseline used in Alibaba Cloud.


Introduction
In recent years, CPU technology has achieved rapid development with higher clock frequency and more cores attached in one processor.A classical assumption is that processors work as designed and produce reliable computation results [38].However, processor faults do occur in production environments [1,[46][47][48]56].Both the growing complexity of modern CPUs and the increasing scale of cloud infrastructures have increased the risk of processor faults.
These processor faults can lead to application-level errors, which fall into two categories.One class of errors causes crashes or exceptions promptly.The other class of errors can introduce undesired data (e.g., incorrect calculation results or even data loss) without being detected immediately.We call the second class of errors caused by processor faults as "silent data corruptions", acronymized as SDCs.
CPU SDCs occur at a low but non-negligible frequency in production.For example, a few processors in Alibaba Cloud occasionally gave wrong checksum calculation results.Such incorrect information misled the cloud application to conclude that request data was corrupted and thus triggered repeated requests frequently, which impaired the overall performance.SDCs also occur in Google Cloud [48]: a small subset of their processors gave wrong results when executing some rarely-used instructions.These errors made a largescale data-analysis application give wrong answers.Meta also notices SDCs [46]: one machine occasionally misjudged the file size to be zero due to wrong calculation, and caused a database to lose files.
Data corruptions in storage and memory systems are wellknown to be dangerous and hard to detect and diagnose.CPU SDCs are even more notorious, because the basic technique to detect data corruptions in storage and memory systems can hardly be applied to CPU instructions (e.g., how to know that a computational instruction gives a wrong result?).As a result, Meta shows SDCs can require months of engineering time to debug [46], and Alibaba Cloud once took several weeks to debug a SDC.
In this paper, we investigate SDCs in a large production CPU population, encompassing over one million CPUs from hundreds of clusters in 28 data centers across 14 countries.To the best of our knowledge, this is the first work to quantitatively assess and systematically analyze CPU SDC phenomena in a large-scale production environment.The main contributions of this paper can be summarized as follows: • Assessing SDCs in a large-scale production environment: We have conducted SDC testing on over one million processors over 32 months.In addition to providing the overall statistics of failure rate, we further analyze how microarchitecture, the timing of testing, etc., affect the failure rate and whether these failures affect a single core or all cores within the processor.• Investigating software impacts of SDCs: Through this study, we identify vulnerable features in the processors (e.g., cache consistency, floating-point computation, and vector computation), and find that, not surprisingly, workloads that extensively engage such vulnerable features are likely to be affected.Moreover, we reveal the deficiency of existing failure models.For example, we find for floating point calculations, SDCs are more likely to cause bitflips in the fraction part, which only cause a small loss of accuracy due to floating point encoding [17] and thus render existing accuracy-based detection techniques less effective [33].• Analyzing reproducibility of SDCs: We find that, on the one hand, some SDCs are highly reproducible, which means, without proper mitigation, they will manifest frequently in production: this is confirmed by Alibaba Cloud's investigation of production errors caused by SDCs.On the other hand, some SDCs can only be triggered under specific conditions (e.g., temperature, workload stress).Such observations motivate our exploration of new SDC mitigation strategies, as discussed next.• Assessing current mitigation practices for SDCs: We identify a substantial design space for improving existing SDC testing by prioritizing testcases and developing testcases focused on multi-threading scenarios.Moreover, we discuss the challenges that SDCs present to fault-tolerance techniques.
• Proposing a concrete approach for SDC mitigation.Inspired by the above observations, we propose Farron, an efficient SDC mitigation approach.It relies on regular testing to identify those highly reproducible SDCs, and improves the efficiency of this approach with testcase prioritization; it controls the temperature of processors to minimize the occurrences of those less reproducible SDCs; it can further make a trade-off between these two approaches by assigning longer testing time if a processor has to work under a high temperature for long.Our evaluation shows that Farron can protect applications from CPU SDCs, with both higher SDC detection rate and lower overhead compared to the baseline approach.
2 Motivation and Methodology

Target System
We study SDCs in Alibaba Cloud, which involves hundreds of clusters deployed in 28 data centers worldwide.Alibaba Cloud has provided a stable working environment for hardware, with a strong focus on cooling, power distribution, cable management and environmental monitoring, and environment variations are controlled to be minimal.
Our study includes over one million processors deployed since 2017.These processors are supplied by a well-known international chip manufacturer.These processors cover a wide range of micro-architectures in recent years, apply the advanced lithographic technology, and widely use the multi-core technology.We believe our processors are able to represent the international mainstream.

SDC Examples in Production
Over time, Alibaba Cloud has occasionally observed servers with a higher error rate compared to others.After extensive debugging to identify the root cause, we find the problems are due to processor defects.Here we present some examples.
In one case, a storage application frequently reported checksum mismatch of the user data.After weeks of debugging, we found that one processor in the fleet was faulty and a checksum-calculation related instruction on the processor gave wrong result intermittently.
In the second case, we also observed checksum mismatches, but our debugging revealed a different cause: A client thread packed data and its checksum into a buffer, which was then shared with a daemon thread.Due to defective cache coherence, the daemon thread sometimes got inconsistent data, incurring checksum mismatches.
In another case, a program sometimes triggered assertion failures.We later found this is because the application used a hash map to manage its metadata, and defective hashing calculation in a faulty processor affected its metadata service.
These cases, particularly the concerns that some errors may not be detected by checksums, and similar reports from industry [46,48] have motivated us to conduct this study.

Toolchain
We use a toolchain provided by the chip manufacturer, which is designed to detect SDCs related to cloud workloads.The toolchain includes 633 testcases and a framework.The framework drives these testcases and checks for the occurrence of SDCs.According to a user's specification, the framework selects the testcases to be performed and controls their execution order, resource allocation (such as CPU time and concurrency) during testing, etc.
Testcases are programs that simulate cloud workloads, carefully crafted with consideration of both software behaviors and hardware features.Most testcases focus on individual processor features, such as floating point calculation, branch prediction, cache, interconnect between cores, etc.The complexity of these testcases vary significantly: 1) Some execute a specific instruction within a loop.2) Some call functions in libraries.3) Some invoke application logics.
Moreover, we also try other toolchains designed for SDC detection like OpenDCDiag [60] as supplementary and reach the same observations in our study.
In our study, this toolchain serves two roles.First, it provides an authoritative method to test processor functions.Second, it acts as an impacted workload simulator when conducting in-depth study on faulty processors.To facilitate testing and analysis, we have designed additional tools, which will be discussed in the corresponding sections.
Despite the toolchain provided by the manufacturer, it is often not easy to determine whether a failed test is due to a CPU SDC or other reasons (e.g., memory error), especially considering some failures are not reproducible.Therefore, if we cannot determine the root cause with a reasonable amount of effort, we will send the suspected processor back to the manufacturer.All the faulty processors reported in this paper have been confirmed by the manufacturer.
On the other hand, like any testing techniques, the toolchain may not be complete to cover all SDCs.We did find SDCs that cannot be detected by this toolchain, after extensive testing and debugging.Therefore, this work should be considered a best-effort approach to detect and understand SDCs: both false negatives and false positives are possible.

Study Process and Approaches
We carry out large-scale tests in order to find faulty processors in Alibaba Cloud, both before production and during production.As shown in Figure 1, pre-production testing is carried out 1) after factory delivery (after manufactured chip is shipped to Alibaba Cloud), 2) after datacenter delivery, and 3) after system re-installation (before a machine goes into production, it needs to install a new system for its service).Then in production, machines will be regularly tested in groups.Testing for each group lasts about 2 weeks, and testing for the whole fleet needs months.
In these large-scale tests, we execute all the testcases in the toolchain sequentially, and each testcase is allocated with equal test duration specified by the administrator.We started such tests since January 2021, and so far, we have found hundreds of faulty processors.
Among these faulty processors, we have conducted extensive experiments on 27 of them for a more detailed analysis (the others have been returned to the manufacturer).To be concrete, we have run tens of millions of tests and collected more than ten thousand SDC records from these tests to understand their potential impacts on applications and their reproducibility.
3 SDCs in the Wild 3.1 Brief Overview of Test Results Observation 1.In overall, 3.61‱ of the CPUs are identified to cause SDCs in our study.
Our results are consistent with but are more precise than those reported by Google ("the order of a few mercurial cores per several thousand machines" [48]) and Meta ("hundreds of CPUs detected for SDCs in hundreds of thousands of machines" [46]).Google's and Meta's decisions to not disclose exact numbers may be for business considerations.This observation substantiates the notion that SDCs represent a pervasive issue rather than a "black swan" event, especially in the cluster with a significant number of processors.
Observation 2. The failure rates observed during the preproduction testing period and the regular testing period amount to 3.262‱ and 0.348‱, respectively.We have found faulty processors in every micro-architecture we have, indicating SDC is a general problem for modern processors.We have tested hundreds of thousands of samples for each of these micro-architectures.As shown in Table 2, the failure rates of different micro-architectures range from 0.082‱ to 9.29‱.

As shown in
In our tests, the failure rate does not decrease with newer chips.This phenomenon can be attributed to multiple factors.The testing ability may increase with the processor development, but the difficulty of testing also increases, as features and circuits become more complex with the processor development.In fact, due to the complex micro-architecture diagram, comprehensive testing for the chip has prohibitive costs and makes faulty processors escaping from highvolume manufacturing testing a fact of life [58].Moreover, different micro-architectures have different degrees of maturity, which also affects their failure rates.
Despite extensive testing and the advancements in chip development, online services continue to be exposed to potential risks stemming from SDCs.This highlights the essential requirement of SDC-tolerant systems for cloud vendors to enhance the reliability of their services.

Zooming in on Faulty Processors
Table 3 shows the hardware details and error information of a subset of our faulty processors as examples.We make the following observation by studying these faulty processors: Observation 4. A single processor fault may exert its influence on an individual physical core or encompass all cores within the processor.
In about half of the faulty processors, there exists only one defective physical core.This is probably because in these faulty processors, the defects occur in the components that can only be used by a single physical core, like arithmetic units.Note that multiple hardware threads, also known as logical cores, can share a single physical core.In most cases, all the logical cores sharing the same defective physical core are affected and they fail the same testcases with a similar frequency.
In the other half of the faulty processors, defects impact all physical cores.Some are probably due to the fact that the defects occur in the components shared by all cores, like CPU cache.However, we do observe cases that a defect impacts the same non-shared component of every core (e.g., MIX1 and MIX2 in Table 3).These cores fail the same testcases but at a different frequency.The difference can be up to several orders of magnitude under the same test setting, making some of the defective cores difficult to be detected.We presume this phenomenon may come from defects in chip design and manufacturing.The proportion of processors with multiple defective cores in our study (i.e., about half) is significantly higher than what has been reported by Google [50], where a single processor with multiple defective cores is considered a low-probability event.We hypothesize that such difference is primarily due to the fact that we use a different toolchain, and our toolchain appears to have better detection capabilities for coherency problems among cores.
Large companies decommission the whole faulty processor or isolate the whole machine no matter which of its cores are identified as faulty [48,52].This practice is reasonable given the relatively low failure rate.However, it could be worthwhile to investigate the feasibility of continuing to utilize the unaffected cores within a faulty processor [56].
4 Software Symptoms of SDCs 4.1 Impacted Workloads Observation 5. SDCs exhibit a substantial prevalence in particular workloads, exposing five vulnerable features, namely arithmetic logic computation, vector operations, floating point calculation, cache coherency and transactional memory.
This observation can be explained by two contradictory theories: On the one hand, it is possible that, compared to other features, these features are indeed more vulnerable due to their complexity (e.g., cache is known to occupy a big portion of the chip area [57]).On the other hand, it is also possible that other features are equally or even more vulnerable, but since operating systems and applications make use of other features heavily, a fault in other features will cause a crash instead of a SDC [57].Either way, this observation suggests that a developer can focus on a limited set of features when considering SDC related issues.
Figure 2 shows the proportion of faulty processors per feature.Note that the sum of these proportions is bigger than 1.This is because a defect can occur on shared or integrated components of multiple features and thus some processors can encounter errors among multiple features.For example, we find Processor MIX1 has wrong execution results in both vector operations and complicated floating-point calculation, and we blame this problem on the combination of FPU functionalities with vector units.Another example is CNST1, which fails to guarantee the consistency in both cache and transactional memory.3. Hardware details and error information of a subset of our faulty processors.#pcore denotes the number of defective physical cores on the faulty processor; #err is the number of failed testcases on the faulty processor.

CPU id arch age(Y) #
We further categorize SDCs with defective features into two types: computation and consistency.SDCs with computation type are due to defective arithmetic operations, including arithmetic logic computation, vector operations and floating point calculation.SDCs with consistency type are due to defective features related to consistency guarantee, such as cache coherency and transactional memory.We distinguish these two types for two reasons: First, they require different testing strategies since SDCs with consistency type can only be detected with multi-threaded tests.Second, we observe that, if one processor has multiple defective features, they always belong to one type.Among the 27 faulty processors we have tested extensively, 19 processors have computation defects and the remaining ones have consistency defects.
Since each testcase is designed to mimic a real-world workload, we can further speculate potential impacts on realworld workloads, as sampled in Table 3.For example, Processor FPU1 produces incorrect results on a specific floatingpoint calculation operation, which is used by a library widely used in HPC applications.The wide impacts of certain SDCs are due to the wide usage of these defective features.
We have tried to further pinpoint which instructions are problematic, which turns out to be a challenging task.For some of these errors, the toolchain preserves the context and points out the incorrect instructions.For example, in SIMD1, the toolchain reports that a vector instruction that performs multiplication and addition operations simultaneously gives wrong results.The others, however, need manual investigation, but we meet the classic problem that, since these errors are often hard to reproduce, it is unclear where to modify the testcases to print more information.Therefore, we turn to a statistical approach: we instrument the toolchain to catch the number of times each type of instruction is executed during each testcase via Pin [9].This method helps us narrow down the scope of suspected instructions.Take cases in Table 3 as examples: we find one instruction, which uses the floating-point calculation feature to calculate a complex math function (arctangent), is a suspect in FPU1 and FPU2, because all the testcases using this instruction could reproduce SDCs and all the other testcases can pass.In another example, we find instructions responsible for managing the transactional region a suspect in CNST2.
However, not all errors have obvious suspected instructions.The SDCs in CNST1 causes cache coherence issues and we fail to locate the suspected instructions.This is reasonable since cache coherence mechanisms are mostly hidden from a program so a program often does not invoke a specific instruction for cache coherence.
It should be noted that not all testcases executing a defective instruction will generate errors.For example, in MIX1, we find a defective instruction is used in seven testcases, but only two of them generate errors.We study the triggering conditions in details in Observation 10.

Error Breakdown in Mis-calculated Data
We further study the properties of computation SDCs to understand the influence of defective features on the workload results.We exclude consistency SDCs from this investigation since they don't have a deterministic pattern.Figure 3 shows the proportion of faulty processors involving each operation datatype.We find all datatypes under tests are impacted by SDCs, and floating-point datatypes involve more faulty processors than other datatypes.We find two reasons attribute to this issue: Many different vulnerable features are related to floating-point calculation, including vector operations with floating-point datatypes and specific floating-point calculation.Some floating-point operations, such as trigonometric functions, are complex, which increases the difficulty on the design and test of relevant processor features.
Observation 7.For floating-point numbers, bitflips predominantly occur in the fraction part, resulting in minor precision losses.
As for computation SDCs, we investigate which bits are different between the expected result and the actual result, which is also known as bitflip phenomenon.Figure 4(a)-4(d) shows the bitflips of different numerical data types.We find that it is rare that bitflips occur in the most significant bits.Note this bitflip pattern does not apply to non-numerical data, in which all the positions have comparable amount of bitflips (Figure 5).
Furthermore, we find nearly half (51.08%) of bitflips are changed from zero to one, which means there is no tendency of bitflip direction in general.However, a tendency exists in some corner cases.For example, as for 16-bit integer data statistics in MIX1, 72.27% of bitflips are from zero to one.
The impact of an SDC on a numerical data depends on the type of the data.In floating point encoding standard (i.e., IEEE-754 [17]), the bits are divided into three parts: sign, exponent, and fraction.A bitflip usually hits the fraction part, probably due to two reasons.First, as for floating-point numbers, the computation logic of the fraction part is more complex than that of the exponent part, making the fraction part more vulnerable.Second, we observe a concentration of many bitflips in the middle of the data and a gradual decrease towards the ends, which follow related research about failure distribution on registers [57].Given the relatively long fraction part in the data, most of the bitflips tend to occur in this part.Because IEEE-754 assumes an implicit leading 1 in the fraction, the relative precision loss caused by one bitflip in fraction only depends on the position of the bit but does not depend on the value of the number.Other datatypes do not have this property.For example, for an integer, if its value is small, then a bitflip in a less significant bit can still cause a significant precision loss.
We show the relative precision losses between expected data and actual data in Figure 4(e)-4(h).Since the bitflips we observed mostly occur in the fraction bits, the precision losses of floating-point datatypes are small.For example, all of the precision losses on extended double precision (80bit) floating-point numbers are less than 0.002%.99.9% of the precision losses on double precision (64bit) floating-point numbers are less than 0.02%.80.25% of the precision losses on single precision (32bit) floating-point numbers are less than 5%.On the other side, 40.2% of the precision losses on 32-bit integer data are bigger than 100%.Observation 8.As for a specific failed setting (i.e., a combination of a testcase and a processor), bitflips tend to manifest at fixed position(s) within the number representations.
We define a bitflip pattern as a set of positions where bitflips occur with whatever inputs in a given setting, i.e., a combination of a testcase and a processor.To find bitflip patterns, we use a mask, i.e., the exclusive-or value of the expected result and the actual result, to represent the bitflip positions.If more than 5% of the SDC records of a setting have the same mask, we regard this mask as a bitflip pattern.A setting could have multiple bitflip patterns in our observations.We suspect it is because the multiple instructions in the testcase are impacted by the defect and these instructions fail to stably reproduce errors (i.e., in one run of a testcase, some of them fail but others succeed), which causes different combinations of error instructions to generate different bitflip patterns.Figure 6 shows the proportion of SDC records with bitflip patterns in some settings.One potential explanation for bitflip patterns is that the hardware defect of specific faulty processor causes deterministic influence and thus these bitflips tend to occur at fixed position(s).
We further analyze the number of flipped bits within SDCs belonging to some bitflip pattern across different data types in Figure 7.As shown in this figure, in most cases, only one   bitflips, but there is also a considerable number of SDCs with two or even more flipped bits.
Deficiencies on current failure models.Current failure models, such as models based on irradiation, often assume that every bitflip on every position is independent and identically distributed (IID) [26].Another assumption made by current SDC failure models is that multiple flipped bits are unlikely events [8].Our observations challenge current SDC failure models, and suggest areas for improvement: • Location preference: Our study has shown that bitflips tend not to occur in the most significant bits in some datatypes (Observation 7).• Correlation between bitflips: Our study has revealed that there exists a correlation between bitflips, causing some SDCs to have multiple bits flipped (Observation 8).
Further research is necessary to study the application implication of this model.Although floating-point numbers seems to incur less accuracy loss, some cases, such as finance data management and autonomous vehicles, advocate more reliable data [54].It may also be possible to promote data reliability by designing encoding standards in consideration of these bitflip patterns.

Reproducibility of SDCs
After we identify that a CPU can fail in a testcase, we repeatedly run the failed testcase to understand the reproducibility of the problem.Since the length (i.e., number of loops) of each testcase is configurable and the chance of triggering an error depends on the length of the testcase, we use occurrence frequency, which is defined as the number of errors per minute, to quantitatively measure reproducibility.Since the occurrence frequency depends on both the CPU and the workload (i.e., testcase), we record the occurrence frequency per setting.
Observation 9. Some SDCs are highly reproducible, resulting in large impact on applications.
We find that SDC occurrence frequency varies significantly across different settings, from as low as 0.01 times per minute to as high as hundreds of times per minute.In 51.2% of the settings, the occurrence frequency is higher than once per minute.
The high reproducibility of certain SDCs and the fact that existing systems are not designed to tolerate SDCs mean that these SDCs can manifest quickly and repeatedly in production, which is confirmed by our case studies as discussed in Section 2.2.For example, a service in Alibaba Cloud falsely reported 26 invalid-data errors in approximately 4.5 hours because of one faulty processor, which impacted the system performance.This suggests that, although the failure rate of processors is low, SDCs could potentially have a large impact, especially if they are not detected promptly.
Observation 10.Among those less reproducible SDCs, temperature serves as an important SDC triggering condition.In some settings, the occurrence frequency of SDCs demonstrates exponential growth in response to increasing temperatures.Furthermore, the occurrence frequency is associated with the minimum triggering temperature across different faulty processors and workloads.
It is well-known that temperature impacts the functioning of semiconductors [13,22].Processors have allowable range for their working temperature, and datacenters strive to minimize temperature influence through cooling systems.However, we observe that even when temperature remains within the allowable range during workload execution, the rising temperature can still increase the occurrence frequency of SDCs.
We investigate the quantitative relationship between SDC occurrence frequency and temperature.We monitor the processor temperature during testcase execution by reading cooling device monitor data from system kernel file.Some settings can naturally reach a temperature that is close to the upper bound of the processor's working temperature, which allows us to collect adequate testcase execution information with different temperatures.Some settings cannot reach a high temperature naturally.For these settings, before testing, we use stress toolchains (e.g., Linux "stress" cmd tool) to preheat the processor to the desired temperature.
By taking the base-10 logarithm value of the SDC occurrence frequency, we find that this value has a linear dependence on core temperature, based on the least square method, on six out of our 27 processors.
Figure 8(a)-8(c) display this relation for some faulty processors, and their Pearson correlation coefficients are bigger than 0.75, which confirm the exponential correlation between temperature and SDC occurrence frequency.
Furthermore, we observe that in some settings, SDCs only occur when the temperature exceeds some threshold.For example, we observe all the SDC records with testcase C on MIX1 are generated with their temperature above 59℃, which is much higher than its idle temperature (about 45℃), but is still within the normal range.Tests below this temperature threshold have been extensively conducted for several days, but cannot reproduce errors.
In our large-scale tests, we experience several counterintuitive cases, which we later find to be caused by temperature issues: • Other core behaviors: We observe one defective core only produces errors when other cores are busy, with its occurrence frequency increasing as the number of busy cores increases.It is surprising because the defective component is not shared between cores.Upon further investigation, we discover that the cores share cooling devices, which results in the defective core reaching a higher temperature when other cores are busy.• Remaining heat: We observe one faulty processor generates errors depending on the test order.For example, errors in testcase Y occur when testcase X is executed prior to testcase Y, and fail to occur with reversed test order.We later discover that testcase X exerts significant stress on the processor and produces considerable amount of heat, resulting in testcase Y being tested at a temperature that is difficult to attain when solely executing testcase Y. • Toolchain update.We observe that after updating to use a higher version of the detection toolchain, the occurrence frequency of some SDCs in a faulty processor decreased, which was surprising as the update did not modify the logic of the testcases and we had not changed any other test configuration.Further investigation revealed that the updated toolchain uses a more efficient framework, which reduced the heat generated.
Besides temperature, there also exist other triggering conditions.Recall that we have observed that many testcases do not exhibit errors even when they utilize instructions identified as defective or suspected as discussed in Section 4.1.Our run-time instrument study further reveals that instruction usage stress is one of the reasons behind this observation.Failed testcases use this defective instruction several orders of magnitude more frequently than other testcases, highlighting the impact of instruction usage stress on error occurrence.Since temperature is highly correlated with stress, we use the following method to separate their effects: we use stress toolchain on some cores that are not under test while execute test workloads on target cores.In this experiment, since the heat is mainly produced by stress toolchain and dissipated by cooling devices, the tested workload has little impact on temperature.With this approach, we can increase CPU utilization in the faulty processor, with temperature almost unchanged, and we observe a higher occurrence frequency of SDCs with a high CPU utilization.
SDC Mitigation using multiple strategies.We further explore the design space to mitigate SDCs by combining multiple strategies in a coordinated and complementary manner.Figure 9 illustrates the relationship between the minimum triggering temperature for SDCs and their occurrence frequency under the minimum triggering temperature.We perform a linear fit between the logarithmic values of occurrence frequency and the values of minimum triggering temperature, yielding a Pearson correlation coefficient of -0.8272, which indicates a relatively strong correlation.
Motivated by this figure, we classify SDCs into two types based on the occurrence frequency and minimum triggering temperature: apparent and tricky.Different types of SDCs are suitable for different mitigation strategies."Apparent" SDCs can be detected near idle temperature and exhibit high occurrence frequency, making them suitable for SDC tests.
On the other hand, "tricky" SDCs have higher minimum triggering temperature than "apparent" SDCs and tend to have relatively low occurrence frequency.For these SDCs, relying solely on SDC testing would require maintaining processors at high temperatures for a long time, which can be detrimental to processor health.Even worse, since we don't know whether a CPU has such tricky SDCs in the first place, we will need to apply such long high-temperature tests to all CPUs, which is inefficient.Instead of testing, we propose to control the CPU temperature at run time to mitigate this type of SDCs.We can control the temperature by either controlling the cooling devices [7] or by limiting the CPU utilization of the workloads (called "workload backoff" in the rest of this paper).The former has no impact on application performance, but unfortunately it is not widely applicable in Alibaba Cloud yet, so this work explores the latter.Workload backoff can also reduce instruction usage stress, known as another triggering condition.Section 7.1 presents how to apply this idea in detail, in particular how to adaptively adjust the temperature threshold and test duration.Many cloud vendors, such as Alibaba Cloud, Meta [52] and Google [50], conduct SDC tests to remove faulty processors before SDC generation.However, SDC testing can be inefficient without guidance.
Observation 11.In a production environment with tens of thousands of CPUs, 560 out of the 633 testcases have not detected any errors.
Unfortunately, we only have detailed test logs for a subset of the CPUs we have tested (for others, we only know whether the CPU is identified as faulty or not), but we believe they can shed light on how to improve test efficiency.
In this production environment, although we allocate the same test resources to all testcases, 560 of the 633 testcases fail to detect any faults.Moreover, if we further look at each CPU micro-architecture, the number of ineffective testcases per micro-architecture is even higher.This is reasonable since companies usually buy a specific type of CPUs in a batch, and a specific type or batch of CPUs may be vulnerable in the same way.This motivates our following proposal to prioritize tests, considering companies like chip manufacturers and cloud vendors may have a large amount of history data to guide testing: in pre-production tests, since test resources are adequate, every testcase can be fully tested; in regular tests, during which test resources are limited, we can give longer duration to testcases that have found SDCs in either pre-production tests or earlier regular tests.
On the other hand, there exist cases where our toolchain fails to detect faults.We observe that these faulty processors only manifest SDCs under some complex multi-thread conflict scenarios that are difficult to be covered with existing testcases.We believe these issues will be addressed in the future with more comprehensive and powerful testcases contributed by both academia and industry.Since our toolchain is not publicly available, for those who are interested in this field, we recommend OpenDCDiag [60] since we have validated that it can reach the same observations as our toolchain.Another similar tool is SiliFuzz [50], but we haven't got the chance to try it.

SDC Detection and Tolerance
Unlike SDC testing, which performs proactive detection before SDC generation, there exist many approaches to detect SDCs after their generation.This section discusses how our observations challenge these approaches.
Observation 12.The effectiveness of existing fault tolerance techniques is diminished when confronted with CPU SDCs.
Checksum and parity.End-to-end checksums are widely used to detect data corruptions and verify data integrity in the datapath [20,27,47].For example, Checksum calculation algorithms, such as Cyclic Redundancy Check (CRC) and hashing, derive the data to a smaller summary, which can be used to check the integrity of the original data.Error Correcting Code (ECC) can detect and correct errors in processor cache and registers by leveraging parity bits [14,59]; Erasure Coding (EC) techniques apply parity information to recover transferred or stored data when they are lost [2,24].
However, we observe these techniques are often ineffective to detect CPU due to multiple reasons: 1) EC is primarily used to recover lost data, but not used to detect corrupted data.2) ECC and CRC assume the data is correct when computing the parity and afterwards can detect bitflips in either the data or the parity bits, but CPU SDCs may generate a wrong result before parity is computed and in this case these techniques may generate a parity that matches with the already corrupted data.3) Even if the corruption happens after parity is generated, standard ECC can correct only single bitflip errors and detect two bitflip errors, but our study shows multiple bitflip errors are possible (Observation 8).
Even worse, some of these checksum algorithms engage vulnerable features heavily, which means they are more vulnerable to CPU SDCs.For example, both EC and CRC heavily involve vector operations [21,61], which is one of the vulnerable features (Observation 5), to accelerate computation.For EC, this is particularly dangerous since EC itself does not have the ability to detect corruptions, and thus a corrupted data block may be used to construct a lost data block, causing the corruption to propagate.
Redundancy.Some works apply redundancy to detect and tolerate SDCs [3,6,11,18,19,23,25,27,55]: they execute the same logic on multiple replicas and compare their results to detect and even correct errors.The redundancy can also be implemented by the hardware, such as using the DCLS (dual-core lockstep) technique [39].However, considering the low failure rate of CPUs, such kind of techniques are too costly to be applied to every application, though they may be suitable for a small number of critical applications.
Prediction.Some works use machine learning models to predict the appearances of SDCs [29-31, 40-42, 49].Part of them predict a range for the result and assert a silent error when the real result is out of this range [29][30][31].However, real SDCs may have minor precision losses (Observation 7), making it challenging for these methods to determine a narrow range for detecting SDCs.On the other hand, whether such minor losses are acceptable is a topic requiring further investigation.
On the other hand, our study shows some new opportunities to detect and tolerate CPU SDCs: Considering only a small number of features or instructions are vulnerable, can we design techniques targeting those vulnerable features?Considering temperature is a key factor, can we control the temperature to mitigate SDCs?Considering bitflips have location preference, can we design better coding techniques?The next section explores some of these ideas.

Farron: An Efficient SDC Mitigation Tool
To illustrate how our observations assist in SDC mitigation, we propose a concrete strategy called Farron by improving Alibaba Cloud strategies based on observations aforementioned.Farron is able to protect applications from CPU SDCs with low overhead and high testing efficiency.
Baseline.Existing strategies used by Alibaba Cloud mitigate SDC impacts by conducting proactive SDC testing, which helps prevent SDC impacts by identifying and removing faulty processors before they generate SDCs.In summary, SDC tests are conducted both in pre-production and every three months during production, and in every round of tests, all testcases are executed sequentially and allocated with equal testing resources.As for one processor whose core(s) are detected as defective, Alibaba Cloud deprecates the entire processor.

Design
Due to the limitation of SDC testing, Farron uses temperature controls as a complement to SDC testing, based on our insight from Observation 10.To determine when to activate temperature controls and when to apply SDC testing, Farron establishes a temperature boundary, which is adaptive to actual run-time conditions of the application.Farron further performs efficiency-focused SDC tests, especially in regular SDC tests.Moreover, Farron employs the fine-grained processor decommission (Observation 4) and maintains a reliable resource pool to manage unaffected cores [56].
Figure 10 illustrates the Farron workflow, which operates in three states: pre-production, online, and suspected.SDC tests with adequate resources will be performed during the pre-production state.During the online state, user application is executed on cores that have been proven reliable through SDC testing and operates under the triggering condition controlled by Farron.Regular SDC tests are conducted in this state for long-term protection.In the event that SDC tests fail, Farron performs in-depth SDC tests targeted at the suspected processor, and adjusts the reliable resources based on the analysis of test results.
Adaptive temperature boundary.As mentioned in Observation 10, different SDCs can be divided by a temperature boundary, which decides when to perform triggering condition controls and how long SDC testing needs to execute.The primary challenge Farron faces is how to determine the boundary.If the boundary is set too high, long SDC testing duration is required to guarantee reliability under high temperature (Observation 10).Conversely, if the boundary is set too low, some triggering condition controls, such as workload backoff, will be frequently activated, leading to impacts on application performance.
Farron assigns the highest priority to application performance, thereby minimizing the frequent use of workload backoff.To accomplish this, Farron differentiates the temperature boundary for cooling device operation and workload backoff, and makes the boundary for workload backoff adaptive.Farron employs a window to track recent temperature monitoring records, raising the temperature boundary for workload backoff if more than a half of temperature records within the window exceed current boundary, indicating that the temperature is within normal working range for the application in the given situation.By iteratively increasing the temperature threshold, Farron autonomously learns the standard working temperature, thereby preventing the excessive use of workload backoff.If less than half of the temperature records exceed current boundary, workload backoff will be triggered, until the temperature is below the boundary.
Farron further adjusts regular test duration based on this adaptive temperature boundary, adhering to the patterns outlined in Observation 10 (i.e.lower temperature boundary condition will be allocated less test duration).
Efficiency-focused SDC testing.Due to the constraints of online test resources, regular tests are conducted with an emphasis on testing efficiency.However, given the limited guidance available for SDC tests, achieving efficiency in existing testing procedures proves challenging (Observation 11).Farron seeks to enhance SDC testing efficiency by drawing on insights related to testcase prioritization, targeted features and testing environments (Observation 5, 10 and 11).
We designate targeted features and priorities for testcases, establishing three distinct priority levels: basic, active, suspected.The "basic" priority is assigned to testcases that, despite being designed for a particular feature, fail to detect faults in our large-scale tests.The "active" priority is designated for testcases with proven track records of successfully identifying defective features.Lastly, the "suspected" priority is only assigned to testcases that have detected errors on the core(s) of the current processor.
Farron mainly allocates testing resources to testcases whose targeted feature is utilized by the protected application, focusing on those marked as "suspected" (if any) and "active".Remaining testcases are tested in a best-effort mode, ensuring a comprehensive but efficient testing approach.
Additionally, we place a strong emphasis on the testing environment.Farron initiates the testing by running burn-in workloads and tests every core in a processor simultaneously to increase core temperature while testing (Observation 10).We believe this testing method can cover the application execution temperature, since testcases in the toolchain are stressful and effectively generate heat.
Fine-grained processor decommission.Identifying all defective cores in a faulty processor can prove difficult, as some defects may be challenging to detect (Observation 4).Initially, Farron accumulates testcases with the "suspected" priority by performing adequate testing on the cores identified with defects.By conducting adequate SDC tests targeted on these "suspected" testcases, Farron can efficiently validate the function of the remaining cores (Observation 4).If more than two cores within a processor are found defective, Farron deprecates the entire processor in line with the pattern presented in Observation 4. Conversely, Farron masks that particular defective core and continues utilizing the other cores as normal.

Evaluation
We implement and evaluate Farron on our faulty processors, and measure Farron's efficiency and overhead.
Figure 11 shows the coverage of SDCs in one round of tests, which is defined as the ratio of detected errors to the total known errors in the faulty processor.As shown in the figure, the coverage of Farron is higher than the baseline.In terms of overhead, the average one-round regular test duration of Farron is 1.02 hours, whereas in baseline, it is 10.55 hours.Both improvements stem from the prioritization strategy, which gives more resources to testcases that are likely to find errors.
Note that in some processors, there exist cases that are difficult to cover in one round of tests, since these errors need both high temperature and long-term testing.Farron mitigates them with temperature controls.We simulate workloads affected by these errors using our toolchain for hours and find these workloads do not trigger SDCs with the protection of Farron.During the procedure, Farron's workload backoff was triggered 0.864 seconds per hour on average, keeping the temperature under 59℃.Owing to the adaptive temperature boundary, the workload backoff strategy is triggered infrequently, resulting in minimal performance impact.
Table 4 presents the overhead of Farron and the baseline on different faulty processors.For Farron, the overhead includes the testing overhead and the temperature control overhead.The testing overhead is equal to the duration of one round of test over three months since regular tests are performed every three months.The temperature control overhead is equal to the backoff duration over the total duration of the simulation.The baseline only includes testing overhead.Note that for Farron, the testing overhead can vary across CPUs due to its adaptive choice of tesetcases to run and adaptive balance of testing duration and temperature control threshold.

Related Work
Section 6 has discussed SDC testing, detection, and tolerance in detail, so this section discussed other related works.
SDC analysis.Prior works have studied silent errors caused mainly by radiation rays, which are transient and hard to capture [10][11][12]44] have noticed SDCs caused by faulty processors in recent years [46][47][48][50][51][52].For example, Meta provides a case study on a SDC debugging process in the production environment [46].Our work is a systematic study on such processor-caused SDCs in the production environment.Besides CPUs, SDCs can be produced by other system components, such as disks, memory, and TPUs [10,33,44,54].Silent errors can also be produced by software, like bugs in corner cases [47,53].
SDC triggering conditions and fault injection.Environmental factors, e.g., temperature and humidity, can influence the working of electronic devices [13,22,28,34,37,43], and we confirm that core temperature is one of triggering conditions.Modern systems use fault tolerance techniques to prevent the impact of SDCs to applications [27,35,45,47,51].To evaluate the reliability and performance of these systems, fault injection is widely used.Some injectors use neutron beam to create SDCs according to the irradiation model [5,44,54], and others use the simulator or specific experimental devices to inject synthetic faults [4,10,32,36].Our observations can help improve the injector designs so as to better evaluate the solutions to SDCs in production environments.

Conclusion
In this research paper, we undertake a comprehensive investigation of CPU SDC phenomena with measurement and analysis in a large production environment.Our research involves multiple perspectives, including fleet maintenance, software symptoms, occurrence patterns and current practices on SDCs.We further present 12 observations, elucidating their implications on systems.Subsequently, we propose a concrete mitigation approach named Farron, which illustrates how to leverage our study to improve existing mitigation strategies.

Figure 2 .
Figure 2. The proportion of processors with a faulty feature.

Observation 6 .Figure 3 .
Figure 3. Proportion of faulty processors with a certain affected operation datatype.

Figure 4 .
Figure 4. Bitflips and precision losses of data with different numerical types.

Figure 6 .
Figure 6.Proportion of SDCs with some bitflip patterns.

Figure 7 .
Figure 7. Proportion of the number of flipped bits in SDCs with bitflip patterns.

Figure 9 .
Figure 9. Relation between occurrence frequency (log scale) and minimum triggering temperature of different SDCs.Each point in the figure stands for a SDC setting.

Figure 11 .
Figure 11.Regular testing coverage for faulty processors.

Table 1 .
Failure rate of different test timings.

Table 2 .
Failure rate of different micro-architectures.

Table
pcore #err SDC type impacted workloads

Table 4 .
. Some cloud service providers Farron overhead for different faulty processors.