A Systematic Literature Review on Hardware Reliability Assessment Methods for Deep Neural Networks

Artificial Intelligence (AI) and, in particular, Machine Learning (ML) have emerged to be utilized in various applications due to their capability to learn how to solve complex problems. Over the last decade, rapid advances in ML have presented Deep Neural Networks (DNNs) consisting of a large number of neurons and layers. DNN Hardware Accelerators (DHAs) are leveraged to deploy DNNs in the target applications. Safety-critical applications, where hardware faults/errors would result in catastrophic consequences, also benefit from DHAs. Therefore, the reliability of DNNs is an essential subject of research. In recent years, several studies have been published accordingly to assess the reliability of DNNs. In this regard, various reliability assessment methods have been proposed on a variety of platforms and applications. Hence, there is a need to summarize the state of the art to identify the gaps in the study of the reliability of DNNs. In this work, we conduct a Systematic Literature Review (SLR) on the reliability assessment methods of DNNs to collect relevant research works as much as possible, present a categorization of them, and address the open challenges. Through this SLR, three kinds of methods for reliability assessment of DNNs are identified including Fault Injection (FI), Analytical, and Hybrid methods. Since the majority of works assess the DNN reliability by FI, we characterize different approaches and platforms of the FI method comprehensively. Moreover, Analytical and Hybrid methods are propounded. Thus, different reliability assessment methods for DNNs have been elaborated on their conducted DNN platforms and reliability evaluation metrics. Finally, we highlight the advantages and disadvantages of the identified methods and address the open challenges in the research area.


INTRODUCTION
Deep Neural Networks (DNNs) are nowadays extensively applied to a wide variety of applications due to their impressive ability to approximate complex functions (e.g.classification and regression tasks) via learning.Since powerful processing systems have evolved in the recent decade, DNNs have emerged to be deeper and more efficient as well as employed in an ever broader extent of domains.Moreover, using DNN Hardware Accelerators (DHAs) in safety-critical applications, including autonomous driving, raises reliability concerns [1][2] [3].In compliance to ISO 26262 functional safety standard for road vehicles, the evaluated FIT (Failures In Time) rates of hardware components must be less than 10 (meaning 10 failures in 1 billion hours) to pass the highest reliability level [4] which requires diligent design.
DNNs are deployed in target applications using different DHA platforms, including Field-Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), Graphics Processing Units (GPUs), and multi-core processors [5].Depending on the DHA platform and application environment, different fault types can pose a threat to the reliability of the component [6].Fig. 1 illustrates the reliability threats (described in Section 2.3) in an example DHA.In this figure, different fault types originating from various sources may occur in any of the DHA's components, leading to a disastrous misclassification such as detecting a red light as a green light.Although faults are hardware-induced, they can also be modeled in software platforms for ease of study.Therefore, the reliability of DNNs is tightly coupled with the reliability of DHAs.It is worth highlighting that the reliability in this paper does not encompass software engineering or security issues, such as adversarial attacks.
Fig. 1.Hardware-induced reliability threats in an example DHA and their possible impact on the output [1].
It has been shown in several studies that the functionality of DNNs in terms of accuracy is remarkably degraded in the presence of faults [7][8] [9][10] [11].Recently, numerous research works have been published on the assessment and enhancement of DNNs' reliability.However, due to the extent of the DNNs domain, these works approach the problem of the reliability of DNNs from various perspectives.We are faced with several applications of DNNs as well as a variety of DNN algorithms for different tasks.Therefore, it will lead to distinct platforms and reliability threats which hinders unifying and generalizing the methods of reliability assessment and enhancement of DNNs.
Throughout the literature, various methods of DNN reliability assessment and enhancement are presented.Some review papers have been published on the topic of DNNs reliability enhancement methods [5][6] [12][13] [14] [15].These works aim to formulate the reliability problem in DNNs, categorize available reliability improvement methods in this domain, and overview the fault injection methods for reliability assessment.The analysis in [15] is the first review on the subject of fault Manuscript submitted to ACM Journal tolerance in DNNs and describes different fault models and reliability improvement methods in DNNs.However, the topic was still not as mature as it is today, and numerous works have been published afterwards.Subsequent works such as [6][5] [12] provide extensive reviews on the reliability improvement methods for DNNs and characterize taxonomies of different methods.Nevertheless, they do not consider the assessment and evaluation methods of the reliability for DNNs.Other surveys [13] [14] have reviewed fault injection methods for DNNs reliability assessment, with the former work has focused merely on fault criticality assessment and the latter have included only a few papers in the survey.In this paper, we present the first Systematic Literature Review (SLR) dedicated to all methods of reliability assessment of DNNs.
Reliability assessment of DNNs is a process for evaluating the reliability of a DNN that is being executed either as a software model or by a hardware platform.However, the assessment method for reliability may vary depending on the platform.In this regard, it is necessary to comprehend and distinguish the different methods used to assess the reliability of DNNs across platforms.This paper establishes a thorough picture of the reliability assessment methods for DNNs and systematically reviews the relevant literature.To achieve this, we carry out the SLR methodology [16] [17] to present this survey.The primary focus of this review is to investigate the methods of reliability assessment for DNNs, generalize and characterize the methods, and identify the open challenges in the domain.
To the best of our knowledge, this survey represents the first comprehensive literature review on reliability assessment methods for DNNs.We cover all published papers from 2017 to 2022 that could be found through a systematic search.The main contributions of this paper are: • Reviewing the literature of the reliability assessment methods of DNNs, systematically; • Analyzing the trends of published papers over different years and methods; • Characterizing and categorizing the reliability assessment methods for DNNs; • Identifying fault injection methods based on the DNN platforms; • Introducing analytical and hybrid reliability assessment methods along with fault injection; • Addressing the open challenges in the research area and recommendations for future research directions.
The structure of the paper is as follows.Section 2 presents the background on DNNs and reliability concepts, Section 3 explains the methodology of this survey and addresses the research questions, Section 4 reviews the study briefly, presents the statistics of the publications, and depicts the top-level taxonomy of reliability assessment methods for DNNs, in Section 5 the details of the reliability assessment methods are explained, Section 6 includes pros and cons of methods and open challenges of the study domain, and Section 7 provides the conclusions of this survey.

Deep Neural Networks
Deep Learning (DL) is a sub-domain of Machine Learning (ML) which is the study of making computers learn to solve problems without being directly programmed [18].Regarding the impressive ability of DNNs in learning, they are applicable in a vast variety of domains like image and video processing, data mining, robotics, autonomous cars, gaming, etc.
DNNs are inspired by the human brain, and they have two major phases: training and inference.In the training phase which is an iterative process and performed once, the hyper-parameters (e.g., weights, and biases) of the neural network are updated on a determined dataset.A loss function is adopted in the training phase that measures the difference between the expected and the estimated output of DNN to achieve higher accuracy.Accuracy expresses the proportion of the DNN outputs coinciding with the expected output.On the other hand, in the inference phase, representing the Manuscript submitted to ACM Journal DNN deployment, the network is being run several times with the parameters obtained during the training phase [18].
DNNs are constructed of the units of neurons.Each neuron receives some activation inputs and multiplies them by the corresponding weights.Then, it conveys the summation of the weighted activations to its output.A set of neurons build up a layer that may have other additional functions, e.g., activation function (ReLu, sigmoid etc.), batch normalization, (max or average) pooling, etc. [18].Equation (1) represents the function of the i-th neuron in layer l (denoted as    ) with input activations from the previous layer l-1 with n outputs (denoted as  −1 ), where W and b represent weights and bias, respectively.
An abstract view of a neuron and a neural network is depicted in Fig. 2. As shown, inputs are fed into the network through the input layer.The middle layers, called hidden layers, determine the depth of the network and conduct the function of the DNN.The output layer is where the network decides.It produces some probabilities of the possible outputs, i.e., output confidence score, and the class with the highest value is the top-ranked output.

Input Layer
Hidden Layers Output Layer ∑ φ

Fig. 2. Abstract view of a simple neural network with the detail of a neuron
DNNs have various architectures each suitable for specific applications.Nevertheless, it is worth mentioning some terms which are used in this paper.Convolutional Neural Networks (CNNs) are extensively used in classification, object detection and semantic segmentation tasks and consist of multiple convolutional (CONV) and fully-connected (FC) layers.CONV layers have a set of two-dimensional (2D) weights, called filters, that extract a specific feature from the input of the layer.A channel is a set of input feature maps (ifmap) that is convolved with filters resulting in the output feature maps (ofmap) [18].
In addition, regarding the large number of parameters and calculations of DNNs, Quantized Neural Networks (QNNs) [30] and Binarized Neural Networks (BNNs) [31] are introduced to reduce the complexity, memory usage, and energy consumption of DNNs.These DNNs are the quantized versions of existing DNNs that reduce the bit-width of DNNs parameters and calculations with an acceptable accuracy loss.

DNN Platforms
2.2.1 Software Frameworks.DNN software frameworks and libraries in high-level programming languages have been developed to ease the process of designing, training, and testing DNNs.These frameworks are widely used due to their high abstraction level of modeling and short design time.Some of well-known software frameworks that are being used for training the DNNs are: TensorFlow [32], Keras [33], PyTorch [34], DarkNet [35], and Tiny-DNN [36].All these frameworks are capable of using both CPU and GPU to accelerate the training process.

DNN Hardware Accelerators (DHAs).
DHAs are used for the training as well as the inference phase of DNNs.They are called accelerators due to their dedicated design employing parallelism for reducing the execution time of the DNN, either in training or inference.DHAs can be generally categorized into four classes: FPGAs, ASICs, GPUs, and multi-core processors [37][38].
According to the literature review of DHAs in [38], FPGAs are used more frequently than other DHA platforms in terms of implementing DNNs, due to its availability and design flexibility for different applications [39].FPGAs are programmed via their configuration bits that determine the functionality of the FPGA.The system of FPGA-based DNN accelerators usually consists of a host CPU and an FPGA part with corresponding interconnections between them.In this design model, the DNN is implemented on the FPGA part and the CPU controls the accelerator with software, while each part is integrated with memories [39].A typical structure of FPGA-based DNN accelerator is depicted in Fig. 3 which is based on HW/SW co-design, that means separating the implementation of DNNs on the integrated CPU (the software) and FPGA (the hardware) that are communicating with one another [40].High-Level Synthesis (HLS) tools which can synthesize highlevel programming languages to RTL are also used for developing FPGA-based DNN accelerators [39].ASIC-based DNN accelerators are more efficient than FPGAs in terms of performance and power consumption but less flexible in terms of applications and require a long design time [41].There are two general types of architectures for ASIC-based DHA platforms: spatial and temporal [18].Fig. 4 depicts an example of a spatial architecture model that is constructed of 2D arrays of Processing Elements (PEs) flowing data horizontally and vertically from input/weight buffers to output buffers.PEs perform Multiply-Accumulate (MAC) operations on inputs and weights representing a neuron operation in the DNN.Off-chip memories are required to store the parameters of DNNs and save the intermediate results from PEs. Tensor Processing Unit (TPU) produced by Google, one of the most applicable ASIC-based DNN accelerators, is based on this type of architecture [42].
Fig. 4.An example of spatial architecture for ASIC-based DNN accelerators [43] GPUs are a powerful platform for training and inferring deep networks and are vastly used in safety-critical applications [3].GPUs include up to thousands of parallel cores, which make them efficient for DNN algorithms, especially in the training phase [41].GPUs are designed to run several threads of a program and are also exploited to accelerate running DNNs [38].The general architecture of GPUs is depicted in Fig. 5.There are numerous Streaming Multiprocessors (SMs) in the GPU, each having several cores with a shared register file and caches, while a scheduler and dispatchers control the tasks among and within SMs and cores [44].Multi-core processors, e.g., ARM processors, deploy DNNs mostly for edge processing and Internet of Things (IoT) applications [45][46] [47].They facilitate DNNs with parallel computing and low power consumption and provide wider range of applications for DNNs.

Reliability, Threats, Fault Models, and Evaluation
Terms of robustness, reliability, and resilience are mostly used in the research pertaining to the reliability of DNNs.These terms are often used interchangeably and ambiguously.In the following, we present the definitions of these three terms as applied in the current literature review: Manuscript submitted to ACM Journal • Reliability concerns DNN accelerators' ability to perform correctly in the presence of faults, which may occur during the deployment caused by physical effects either from the environment (e.g.soft errors, electromagnetic effects) or from within the device (e.g.manufacturing defects, aging effects, process variations).• Robustness refers to the property of DNNs expressing that the network is able to continue functioning with high integrity despite the alteration of inputs or parameters due to noise or malicious intent.• Resilience is the feature of DNN to tolerate faults in terms of output accuracy.
In this work, we are concerned about the reliability of DNNs, which refers to the ability of accelerators to continue functioning correctly in a specified period of time with the presence of faults.Reliability in this paper does not relate to the reliability and test in software engineering or security issues e.g., adversarial attacks in which an attacker perturbs the inputs or parameters.
Faults are the sources of threatening the reliability of DNN accelerators (See Fig. 1) that can be caused by several reasons, e.g., soft errors, aging, process variation, etc. [1].Soft errors are transient faults induced by radiation that are caused by striking charged particles to transistors [48].Aging is the time-dependent effect of the increasing threshold voltage of transistors due to physical phenomena that will lead to timing errors and permanent faults [49].Process variations are alteration of transistor's attributes in the process of chip fabrication that may cause occurring faults by voltage scaling [50].
Faults as reliability threats are generally modeled as permanent and transient faults [12][6] [15].Permanent faults result from process variations, manufacturing defects, aging, etc., and they stay constant and stable during the run-time.On the other hand, transient faults are caused by soft errors, electromagnetic effects, voltage and temperature variations, etc., and they show up for a short period of time.Nevertheless, once a faulty value from a component is read by another component and the propagated value does not coincide with the expected one, an error happens.Therefore, a fault is an erroneous state of hardware or software, and an error is a manifestation of it at the output.Failure or system malfunction is the corruption or abnormal operation of the system which is caused by errors [15][51] [52].
Faults may have different impacts on the output of DNNs and can be classified based on their effects.A fault may be masked or corrected if detected or result in different outputs compared to the fault-free execution (golden model), in which case the fault is propagated and observed at the output.Faults observed at the output of the system can be classified in two categories: Silent Data Corruption (SDC) and Detected Unrecoverable Errors (DUE), depending on whether a fault is undetected (SDC) or detected (DUE) [12] [53].Fig. 6 illustrates this general fault classification scheme regarding the output of systems adopted from [51].
Reliability assessment is the process in which the target system or platform is modeled or presented, and by means of simulations, experiments, or analysis, the reliability is measured and evaluated.Reliability assessment is a challenging process and several methods can be adopted for modeling and evaluating reliability.In general, evaluating the reliability of a system can be performed by three approaches: Fault Injection (FI) methods, analytical methods, and hybrid methods [54].FI methods are exploited to inject a model of faults into the system implemented either in software or hardware, while the system is in simulation or being executed.Analytical methods attempt to model the function of the system and its reliability with mathematical equations depending on the target architecture.In hybrid methods, an analytical model is adopted alongside an FI to evaluate the reliability.Generally, FI methods are more realistic than analytical and hybrid methods; however, FI is a time-consuming process with a high computational complexity [55].
Fig. 6.The adopted fault classification based on the output point of view, as in [51] In the reliability assessment using FI, it is necessary to determine the target platform, potential fault locations (logic or memory), and the fault type (transient or permanent).Transient faults in logic show up in one clock cycle, while in the memory, they flip a bit that will remain until the end of the execution.Permanent faults are modeled as stuck-at-0 (sa-0), or stuck-at-1 (sa-1), and they exist during the whole execution.According to the selected fault model, perturbation of the model is performed, the system is run, and the outputs are gathered.The output of faulty execution should be compared with the one of the golden-model to measure the impact of faults on the system.FI allows calculating reliability metrics, e.g., Failures-In-Time (FIT), Architectural Vulnerability Factor (AVF), SDC rate, Soft Error Rate (SER), cross-section, etc. FIT is the number of failures in 10 9 hours, AVF is the probability of fault propagation from a component to other components in a design, SDC rate refers to the ratio of the outputs affected by faults, SER refers to the ratio of soft error occurrence and cross-section is the proportion of observed errors over all collided particles.These quantitative evaluation metrics are usually tightly coupled to each other, yet follow a different purpose to express the reliability of a system.
Exhaustive fault injection into all bits of a platform at every clock cycle requires an extensive simulation.Therefore, to determine how many faults could be injected into the system in order to be representative statistically, a confidence level with an error margin is presented [56].It provides a fault rate or Bit Error Rate (BER) for an FI experiment.The number of FI experiments' repetitions regarding the number of possible bit and clock cycle combinations to support the number of injected faults determines the execution space for the FI task.

REVIEW METHODOLOGY
Systematic Literature Review (SLR) is a standard methodology for reviewing the literature in a recursive process and minimizing bias in the study [16][17] [38].Hence, the SLR methodology is adopted in this survey.The methodology determines: • Specifying the Research Questions (RQs), • Specifying the search method for finding and filtering the related papers, • Extracting corresponding data from the found papers based on the RQs, • Synthesizing and analyzing the extracted data.
Therefore, based on the aforementioned steps of SLR, the RQs which we attempt to answer are: • RQ1: What is the distribution of the research works in the domain of reliability assessment?
(To obtain the trend of publications in this domain).
Manuscript submitted to ACM Journal • RQ2: What are the existing methods of reliability assessment for DNNs? (To comprehend the entire variety of methods in this domain).• RQ3: How could the existing methods be characterized and categorized in terms of reliability assessment methods?(To categorize existing works and provide the taxonomy, a systematic instruction for finding the suitable method for potential applications in this domain).• RQ4: What are the open challenges in the domain of reliability assessment methods for DNNs? (To specify the remaining areas for future research).
The motivation for this survey is the numerous recent papers published on the reliability of DNNs emphasizing the need for such a literature review.We have searched for the papers systematically through scientific search servers.The main databases and publishers we have used are: Google Scholar, IEEE Explore, ACM Digital Library, Science Direct, and Elsevier.The initial set of papers are provided by searching some keywords in the mentioned servers, including "reliability of DNNs", "hardware reliability of DNN accelerators", "resilient DNNs", "robust DNNs", "the vulnerability of DNNs", "soft errors in DNNs", "fault injection in DNNs" ("DNN" also replaced with "CNN").
Subsequently, based on the title and abstract of each paper, we select them.This selection is based on the criterion of whether the paper may concern the reliability of DNNs or not.In addition, the references and citations of the papers have been checked for the chosen papers to find more related papers.In this process, we selected 242 papers based on their title and abstract.
In the next step, we study the introduction, conclusion, and methodology sections of each paper to decide whether we include the paper in the review or not.The inclusion criteria of the papers are: • The paper is published by one of the scientific publishers and has passed through a peer-review process, • The focus of the work is DNN, neither generic reliability assessment methods using DNNs as one of the examples nor employing DNNs for assessing the reliability of a platform.• The work includes a reliability assessment method for DNNs, • The method of reliability assessment is clear and well-defined, • Terms including reliability, robustness, resilience, or vulnerability must refer clearly to reliability issues, as defined in subsection 2.3.
Papers that have included similar keywords but have not matched the above conditions are excluded.As a result, we have included 139 papers published from 2017 to the end of 2022 in this literature review to build up the taxonomy of the literature review and methods' categorization.
In the following, we have designed a Data Extraction Form (DEF) based on the RQs.In this form, we have taken note of reviewing the papers to find some specific data such as: • General method of reliability modeling (FI, analytical, or hybrid), • The platform where DNNs are implemented, • The fault model and fault locations in case of FI, • Details of reliability assessment method, • Reliability evaluation metrics.
In the final step, after reviewing all the selected papers and filling in the DEF, we synthesized and analyzed the obtained data from the papers.Thereafter, we have provided the categorization taxonomy of the reliability assessment methods for DNNs, have characterized them in this paper, and analyzed them to find the open challenges.
Manuscript submitted to ACM Journal

STUDY OVERVIEW
This section presents an overview of the study and the analyzed statistics of the included works in different categories.As mentioned, we have included 139 papers from 2017 to 2022 for categorizing the reliability assessment methods for DNNs.

Taxonomy
Fig. 7 represents the top-level categorization overview of the study to address RQ2 and RQ3.Reliability assessment of DNNs, are categorized into three main methods: Fault Injection, Analytical, and Hybrid.4.1.1Fault Injection (FI) Methods.The works based on this method evaluate the reliability of DNNs by fault injection campaign.There exist several taxonomies for the fault injection approaches in the hardware reliability domain [13][54][55] [57][58].Therefore, we adapt them for categorizing the related works on DNNs into three approaches addressed in Fig. 7 and Table 1.FI methods are categorized into three approaches of fault injection as follows: • Fault Simulation: DNNs are implemented either in software by high-level programming languages or Hardware Description Languages (HDL) and faults are injected into the model of the DNN.In the former case, some works consider a DHA model in their software implementations while others do not.We divide works on this approach into hardware-independent, hardware-aware, and RTL model platforms.RTL models represent ASIC-based DHAs.• Emulation in Hardware: Research works on this approach implement and run DNNs on a DHA (i.e., FPGA, GPU, or processor) and inject the faults into the components of the accelerator by a software function, FI framework, etc. • Irradiation: DNN is implemented on a DHA (i.e., FPGA, GPU, or TPU) placed under an irradiating facility to inject beams onto it.
Manuscript submitted to ACM Journal Most of the works on DNNs' reliability assessment use FI methods.Therefore, we characterize three approaches of FI methods in Table 1.In each approach of FI methods, the works are distinguished based on DNN platforms.Furthermore, in each category, we elaborate on how the works determine the fault types and locations and evaluate the reliability by metrics.The details will be discussed in subsection 5.1.4.1.2Analytical Methods.Works relying on an analytical method for estimating DNNs' reliability attempt to determine how parameters and neurons of a DNN affect the output based on the connections of neurons and layers.Therefore, they analyze the structure of DNNs and provide a model for the impact of faults on the outputs to find more critical and sensitive components in the DNN.Hence, they can evaluate the reliability of DNNs by means of vulnerability analysis derived by analyses, and eliminate the complexity of simulating/emulating the faults in reliability assessment.
4.1.3Hybrid Methods.Both, fault injection and analytical methods are used in this category of works to take advantage of both.In this regard, analytical methods can provide some mathematical models in addition to a straight-forward fault injection into the system for reliability evaluation, so that metrics of reliability evaluation can be obtained with less complexity than extensive FI experiments and more realistic than analytical methods.

Research Trends
To address RQ1, we present the main statistics on the papers included in this study.Fig. 8 shows the distribution of the 139 included papers published over years 2017-2022.Regarding the chart of Fig. 8, it can be seen that research on the topic of DNNs' reliability started in 2017 and in the following years it drew increasingly more attention and turned into an active topic of study.
Fig. 9 illustrates the number of papers based on different reliability assessment methods among all identified works in this literature review.It can be observed that the majority of works use fault injection to assess the reliability of DNNs while only 10% of the works consider analytical (11 works) and hybrid analytical/FI (3 works) methods.In this regard, we present Fig. 10 to illustrate the distribution of works using FI over different approaches and DNN platforms.It shows that most of the works belong to the hardware-independent platform of simulation in the software approach.Moreover, in the emulation in hardware approach, most of the works are done on the GPU platform.Hence, the figures present the trend of research domain, and distribution of works over different methods and approaches leading to areas where there is still room for future research.

CHARACTERIZATION
In this Section, details of reliability assessment methods for DNNs are presented based on the categorizations in Fig. 7, and Table 1.We start from FI methods which include the majority of works.Then, analytical and hybrid methods will be discussed.
Manuscript submitted to ACM Journal

Fault Injection Methods
In FI methods of reliability assessment, once the DNN platform and fault model are determined, perturbation and system execution are performed, and the reliability is evaluated.Regarding the categorization in Table 1, the identified approaches of FI methods on DNN reliability assessment are presented in this subsection, separately.Since FI is the most frequently used method in the reliability assessment of DNNs, there are various presented evaluation metrics.To elaborate and distinguish different evaluation metrics, we have presented them for different approaches and platforms, separately.

Fault Simulation.
In this subsection, the works assessing the reliability of DNNs by FI with a fault simulation approach are described.There are three platforms in this approach i.e., hardware-independent, hardware-aware, and RTL models that are explained in the corresponding subsections.), TensorFlow (used in [66] [79]), Caffe (used in [77]), DarkNet (used in [73][140] [142]).Implementing the DNN in software provides a flexible environment for studying the effect of various fault models.As shown in the corresponding branch of Table 1, both transient and permanent faults are studied in this platform.However, most of the works studied transient faults (soft errors, SEU, MBU, etc.).
To model faults at the software level, the fault model is determined differently regarding the fault type and general aspect of DHAs.In this regard, modeling and injecting permanent faults are straight-forward.They are active throughout the entire execution and set the value of a bit or Manuscript submitted to ACM Journal variable (in weights, or activations) to 0 or 1, as experimented in [72] [143] report accuracy loss under fault campaign experiments.They compare the accuracy of the faulty network with the accuracy of the fault-free network on the same test set.Some works classify the injected faults regarding the outputs of the faulty network compared with the golden model output.References [140][142] [143] inject one permanent fault per experiment and classify them into three classes: • Masked: No difference between the outputs of the faulty network and the golden model.
• Observed-Safe: Different output of the faulty network with the golden model, while the confidence score of the top-ranked element is reduced by less than 5% with respect to the one of the golden one.• Observed-Unsafe: Different output of faulty network with the golden model, while the confidence score of the top-ranked element is reduced by more than 5% with respect to the one of the golden one.Moreover, in [65] [67] transient faults are injected into the encrypted weights of a network and they are classified based on the effect of faults on execution of the program and results, as: • Silent or safe: Similar to "masked" mentioned above in [140][142].
• SDC: Only affects the output results of the network.
• Detected as a software exception: Affects the execution of the program and stops it.
• Detected by padding check action: Corrupts the ciphertext.

Burel et al.
[64] have adopted the fault classification scheme for semantic segmentation applications in which DNNs label each pixel of an input image according to a set of known classes.The corresponding classes are: • Masked: Similar to "masked" mentioned above.
• No Impact SDC: No labels of pixels are modified.
• Tolerable SDC: Labels of less than 1% of pixels are modified and no class is removed/added due to the fault.• Critical SDC: Labels of more than 1% of pixels are modified or any class is removed/added due to the fault.A specific way of fault evaluation based on fault classification is only considering the faults which affect the output as SDC, since they are critical.References [66][70] evaluate the network Manuscript submitted to ACM Journal based on the proportion of faults that affect the output classification results as SDC rate.Therefore, the reliability of a network can be evaluated by fault classification based on their effect on the outputs, whether by changing the output results, or by a threshold of accuracy loss, or system exceptions.This way of evaluation assists in understanding how faults would be propagated and affect the network.
Software FI Tools: Some fault injectors are presented as tools that are able to support the reliability study of DNNs with different fault models in software frameworks of DNNs.PyTorchFI [164], TensorFI [165][166] [167] and its extension TensorFI+ [168][169], and Ares [170] inject faults into DNNs which are implemented in PyTorch, Tensorflow, and Keras, respectively.All of these open-source frameworks can inject, both, permanent and transient faults into weights as well as activations with specified error rates, hence, the accuracy loss can be evaluated.TensorFI also benefits from providing the SDC rate.These frameworks are used in the reliability studies of DNNs, e.g., PyTorchFI in [60][70], TensorFI in [66], and Ares in [80].
Moreover, to enhance the efficiency of the aforementioned tools, additional fault injectors have been introduced.One such injector, known as BinFI [171], is an extension of TensorFI that aims to identify critical bits in DNNs.Another fault injector, namely LLTFI [172], is proposed to inject transient faults into specific instructions of DNN models in either PyTorch or TensorFlow and has been found to be faster than TensorFI.Additionally, a check-point based fault injector is proposed in [173] that enables studying the impact of SDCs independently of the DNN implementation framework.
5.1.1.2Hardware-Aware Platform.This platform includes works that consider an abstract model of the accelerator in their implementation of DNNs in software.They implement the network in DNN software frameworks as well as high-level programming languages.Therefore, they take advantages of simulation in software fault injection while they also apply the reliability assessment to the abstract model of the accelerator.
References [83][87] implement a DNN in Tiny-DNN, and map it to the RTL implementation of the accelerator.They study the effect of transient faults in memory and datapath accurately.In these studies, FI is performed in software while all of its parameters are integrated with the corresponding hardware components.Authors in [88] implement the DNN and the fault injector in software inspired by an FPGA-based DNN accelerator.Moreover, in [10][91] DNN and FI are implemented in Keras, and the architecture of a systolic array accelerator is considered for a faulttolerant design.Similarly, authors in [85] and [86] evaluate their proposed reliability improvement technique on memories in TensorFlow while injecting transient faults into the weights.PyTorch is used in [89] [90] to implement the DNN, and transient faults are injected into activations (datapath or MAC units) and weights (memory) regarding the systolic array accelerator model.Reference [84] also uses PyTorch and injects faults by a custom framework called TorchFI to inject faults into the outputs of CONV and FC layers of the network.
The effect of permanent faults at PEs' outputs is studied in [7][144] where the model of the accelerator is adopted from implementing the DNN in an N2D2 framework [174].Furthermore, authors in [145] [149] use PyTorch and study permanent faults in MAC units of an accelerator while training to improve the reliability at inference.Authors in [148] have developed a Keras-based accelerator simulator to study the effect of permanent faults on the on-chip memory of accelerators by injecting permanent faults into fmaps and weights.Weight remapping strategy in memory to decrease the effect of permanent faults is evaluated in [146] using Ares.SCALE-Sim [175], a systolic CNN accelerator simulator, is adopted in [150] to study permanent faults in PEs and computing arrays in systolic array-based accelerators.
Manuscript submitted to ACM Journal Similar to the Hardware-Independent platform, faults are injected based on BER, or fault rate, and experiments are repeated to reach 95% confidence level and 1% error margin [10][87] [91].
Evaluation: Nearly all works in this class, evaluate the DNN by accuracy loss after fault injection [7] [150].References [83] and [85] evaluate the reliability by SDC rate as the proportion of faults that caused misclassification in comparison with the golden model.In addition, authors in [87] differentiate SDCs of injected transient faults into defined classes and calculate FIT for the accelerator (accel) by its components (comp) with (2) in which    is provided by the manufacturer,   is the total number of the component bits, and   is obtained by FI.
In addition, in this work SDCs are classified by comparing the faulty and golden model outputs as: • SDC-1: Fault caused a misclassification in the top-ranked output class.
• SDC-5: Fault caused the top-ranked element not to exist in the top-5 predicted output classes.
• SDC-10%: Fault caused a variation in the output confidence score of the top-ranked output class more than 10% compared to the golden model.• SDC-20%: Fault caused a variation in the output confidence score of the top-ranked output class more than 20% compared to the golden model.• RTL implementation of DNNs [94], • Multi-Processor System-on-Chips (MPSoCs) for DNNs, [58].
In the first group, a configuration of TPU is utilized in [8] [93][153] [154], and a model of a 2D systolic array is implemented in [151] [152].Reference [8] also uses Eyeriss [176] architecture for the accelerator.In this group, FI is performed at RTL, and all works inject random permanent faults into PEs/MACs of the arrays, except [93] which injects random transient faults into buffers, control and data registers.
The second group which includes [94] implements DNNs in RTL to enable a fault simulation study in approximated DNNs.In this work, SEU injected into Look-Up Tables are simulated and studied.
In the third group which exploits MPSoCs, faults are emulated in the components of the target multicore processor.Authors in [58] propose a three-level pipeline FI framework that simulates permanent faults in the hardware model of an MPSoC and evaluate the reliability at the software level.In their framework, the RTL model of the platform is provided as well as the fault injector unit at the lowest level.The software implementation of the DNN exists in the middle level of the framework that performs a pipelined inference and runs each layer of the network on a separate core.In the top-level of the framework, synchronization of layers and reliability evaluation is fulfilled.
Evaluation: Most works in this class evaluate the reliability by accuracy loss.Nonetheless, fault classification is performed in [93][94] [58].Authors in [58] adopted the classification of [87] which was discussed in Hardware-Aware platform (subsection 5.1.1)previously.Furthermore, they added two more classes for the faults that cause Hang (the HDL simulation never finishes) and Crash (the Manuscript submitted to ACM Journal HDL simulation immediately stops).Authors in [94] classify the faults similar to the general fault classification scheme (Masked, SDC, crash) with different terminology.
In addition, [93] classifies SDCs on how they impact classification outputs compared with the golden model: • Tolerable Misclassification: The input is misclassified the same as the golden model with different output confidence scores, • No Impact Misclassification: The input is misclassified in both golden and faulty models but into different classes, • Critical Misclassification: The input is correctly classified in the golden model but misclassified in the faulty model, • Tolerable Correct Classification: The input is correctly classified in both golden and faulty models with different output confidence scores, • Beneficial Correct Classification: The input is misclassified in golden model but correctly classified in the faulty model.

Fault Emulation.
In this subsection, research works that assess the reliability of DNNs by emulating FI in hardware accelerators are explored.FPGA and GPU platforms are described, respectively.
5.1.2.1 FPGA Platform.DNNs are implemented fully or partially (e.g., one layer) on FPGAs to perform the inference phase as described in subsection 2.2, and faults are being emulated on different locations of the accelerator.In most of the works on the FPGA platform, the fault injector unit is implemented in software that is run on a processor and faults are injected into the FPGA running the DNN under analysis.This HW/SW co-design process benefits from the high-performance execution of DNNs and fast fault injection.It is worth mentioning that some works implement only a part of the DNN (e.g., one specific layer) on the FPGA [97][98] [108].
In this group of works, Zynq-based architecture System-on-Chips (SoCs) [177] which take advantage of an ARM processor co-existing with the FPGA are deployed.We categorize this group of studies into three classes: • A host computer (e.g., a PC) initializes the faults [97] In the first class, faults are generated by a host computer of the accelerator design.Then, the faults, network parameters, and FPGA configuration bits will be sent to the board.The FPGA starts running, and the on-board processor would collect the results.The on-board processor is playing the role of a controller between FPGA and the host computer.At the end, the results would be passed back to the host computer for further processing and reliability evaluation.All works of this class emulate transient faults (SEU) in configuration bits of the FPGA and exploit the accuracy loss of the DNN for the reliability evaluation.Nevertheless, authors in [107] explore transient faults in flipflops exhaustively beside random transient faults in configuration memory, and classify them as tolerable, critical, and crashes.
FireNN is proposed in [97][98] as a platform for deploying DNNs on Zynq-based architecture SoCs along with a host computer in a way that DNN is run partially on the FPGA to perform a reliability evaluation.As shown in Fig. 11 FireNN machine runs the neural network and communicates with the FireNN engine for reliability evaluation of the layer under analysis running on the FPGA.Faults are generated by the host computer and are injected to the FPGA through the engine.This platform injects SEUs in weights, layer inputs, and configuration bits.In the second class, faults are generated and injected into the FPGA's configuration bits or on-chip memories by the embedded processor.The embedded processor or a host computer is responsible for the reliability evaluation.The proposed method in [162][163] provides an injection of permanent faults into the configuration bits of the FPGA as well as into the on-chip memory blocks through the interfaces between the embedded processor and FPGA on Zynq SoC.References [95][103] [104] provide a similar design to inject transient faults into configuration bits of the FPGA.The effects of transient faults into both, on-chip memories and configuration bits of an FPGA running pruned DNNs are studied in [100].Authors in [95] provide random-accumulated FI and exhaustive FI approach on the configuration bits to emulate neutron and ionizing radiation.Moreover, permanent and transient faults in on-chip memory (HyperRAM) are studied in [105][106] with a software emulator and are validated by radiation results.
It is worth mentioning that injecting faults into the configuration memory is a repetitive process, where in each experiment of FI, the faulty configuration bits are loaded to the configuration memory.Then, the system is run and the results are collected.Thereafter, the next fault(s) are injected into the fault-free configuration bits loaded to the corresponding memory to analyze the newly injected fault(s).
A framework named Fiji-FIN is proposed in [102] and the underlying method is also used in [9][101].This framework is capable of injecting transient faults into both, configuration bits of FPGA and on-chip memories.In this method, FINN framework [178] is used to develop and train the BNN, and the proposed framework manipulates the FINN's output to prepare it for the fault injection.The bit stream file of the FPGA is obtained by an HLS tool and imported to the FPGA.While the system is running, the faults are generated and injected by the embedded processor and the reliability is evaluated in comparison with the golden model.Fig. 12 depicts in detail the steps of this FI framework.
In the third class, references [155] and [156] inject permanent faults and the work in [96] injects transient faults into the hardware implementation of the network.Authors in [155] use the FINN framework to implement the QNN with 2-bit weights and activations, and a block has been added into the hardware design that is deployed for injecting stuck-at faults into the output of PEs.Reference [156] injects permanent faults into the registers of the RTL model of the network.Authors in [96] explore the effect of transient faults to the configuration bits of FPGAs in which different accelerator architectures (Softcore FGPU and ZynqNet HLS) are implemented.
Evaluation: For evaluating the reliability of DNNs on the FPGA platform, accuracy loss is exploited in [9] [163].References [103][104] classify SEUs in configuration bits of the FPGA as critical if a fault caused misclassification with respect to the golden model; otherwise, the fault is tolerable.In addition, Benign Errors are considered in [104] which are the faults that caused true classification of the inputs that were misclassified in the golden model.Another fault classification is presented in [97][98] that does not only consider critical and tolerable faults, but also categorizes the faults that prevent the accelerator to generate the classification output.In this regard, the effect of faults on the system performance degradation is the criterion for classifying faults in [99].
Reliability is evaluated by different metrics considering accuracy loss regarding the application of the target networks in [162] [163].These works consider top-5 and top-1 accuracy loss for image and audio classification tasks, respectively.For object detection, mean Average Precision (mAP), and for image generation, Structural Similarity Index (SSIM) is adopted.Regarding the adopted metrics for accuracy loss in each network, the faults are classified into three classes with different ranges of accuracy loss (≤1%, 1%∼5%, ≥5%) caused by FI.In addition, they categorize the faults which are caused by a system exception that may delay or terminate processes.
To characterize the status of DNN layers' vulnerability, authors in [9] classify the parameters of layers (i.e., weights and activations) separately by performing FI.In this work, parameters of layers are labeled as Low-risk, Medium-risk, and High-risk if FI process into the target layers' parameters results in less than 1%, 1%∼5%, and more than 5% accuracy loss, respectively.
The metric AVF (defined in 2.3) is adopted in [103][104] and expresses the probability of fault propagating to the output.These works obtain the AVF through the FI, by dividing the number of faults propagated to the output by the total number of injected faults.Furthermore, authors in [104] provide a formula to estimate the cross-section (defined in 2.3) of the configuration memory in (3) where the obtained AVF by FI is multiplied by the number of bits utilized by the design times the cross-section of bits of the configuration memory.This calculation can lead to further reliability metrics that authors present in [104].
In this regard, [105] estimates the SER of HyperRam saving the weights similar to (3) based on the extracted information from radiation experiment reports.By providing the rate of faults likely to occur in the memory, they inject faults into the weights of CNN on an FPGA accelerator.
Moreover, reference [95] expressed the reliability of the neural network with n layers ( 1 ,  2 , ...,   ) that are implemented serially as different modules on the FPGA, as an exponential distribution in (4).
Where  = 1   (MTTF = Mean Time to Failure).5.1.2.2 GPU Platform.In this subsection, we explore FI in DNNs in which faults are emulated and injected into the GPU.Nearly all works on this platform have studied the effect of transient faults on GPUs.Permanent faults are studied in [137][157] [158][159][160] [179].To perform FI on GPUs, researchers adopt an FI framework on GPUs; except in [117][137] which implemented their own FI process on CUDA and TensorRT [180], respectively.FI frameworks in GPUs including FlexGripPlus [181], NVBitFI [182], and CAROL-FI [183] are used in [114,157], [113,115,116,120], and [122], respectively.Nonetheless, an FI framework is proposed in [179] adapting and customizing NVBitFI for studying permanent faults in GPUs and is leveraged in [158][159] [160].Moreover, a cross-layer fault injector framework CLASSES is presented in [184] to inject SEUs at the architecture level enabling study of the corresponding fault effects in [112].In all works, the rate of injected faults and the number of experiments in the target locations varies and depends on the confidence level and error margin as mentioned in [11] [44][109] [121][122].
SASSIFI [185] is the most frequently used framework for FI into GPUs running DNNs that is used in [11] • Masked: Fault does not affect the output, • SDC: Output confidence score differs from that of the golden model, • DUE: The program hangs or the system reboots (also called Crash in [11][121]) Furthermore, SDC is also categorized regarding the effect of faults on the accuracy of the DNN for the object recognition task in [109] [44].They define three categories of SDCs based on the effect of faults on the output confidence score and ranking of objects: • Non-critical: Output confidence score changed, and no misclassification occurred and no objects ranking modified, • Light-critical: Objects ranking modified, and no misclassification occurred, • Critical: Impacted the output confidence score and caused misclassification.
On the other hand, the fault classification of SDCs proposed in [122] is beyond the classic SDCs and is based on the impact of faults on the precision and recall for object detection tasks in a self-driving car, as follows: • Non-critical: Precision maintains larger than 90% (a new object is detected that is not in the original classification) and recall remains 100% (all previous objects are detected).• Critical: Precision is lower than 90% (many wrong objects detected) and recall is not 100% (real objects are not detected).
Furthermore, new classes of faults are presented in [137] which considers the margins of the bounding box in the DNN for object detection.Authors compare the overlaps of the bounding box Manuscript submitted to ACM Journal of the detected objects in each image for golden and faulty models and categorize the SDCs based on a threshold.Their fault classification method is depicted in Fig. 13.Fig. 13.Fault classification in the object detection task based on bounding boxes [137] Vulnerability factors are also adopted to analyze the reliability of DNNs on GPU platform [11] [130].The vulnerability of instructions is studied in [130].To emulate faults modeling soft errors in target processors, ARM-FI is developed and adopted in [128][129] [130] and SOFIA [92] is exploited in [37][92] [123][124] [125][126] [127] as fault injection frameworks.Each of the aforementioned fault injectors enables fault emulation in different components of processors.
Evaluation: All works in this class have evaluated the reliability by fault classification.The classification is performed similarly to the general scheme of classifying faults in the previous platforms (Masked, tolerable SDC, critical SDC, and DUE).
Furthermore, references [37][92] classify the faults in an object detection task for autonomous vehicles as: • Incorrect probability: All objects detected correctly with different output confidence score, • Wrong detection: Misclassification or missing an object, • No prediction: No object detection.Mean Work To Failure (MWTF) is also exploited as a reliability metric to show the amount of work a neural network can perform until meeting a failure, as: where   −  is the probability of an erroneous classification due to faults.MWTF is adopted as a relationship between performance and reliability in [129] [130].AVF is obtained as the reliability metric for the register file in [124][129] [130].Program Vulnerability Factor (PVF) is leveraged to express the vulnerability of operations and instructions in [130].

Irradiation.
The most realistic way of fault injection is to irradiate the devices under the beam of particles, e.g., neutron or ion.In this subsection, the research works which study the reliability of DNN accelerators i.e., FPGA and GPU under radiation, are described.[133] and with protons in [135].References [132] and [135] have applied fault-aware training to DNNs and studied its impact under radiation.HyperRAM which includes constant and dynamic variables (e.g., weights and biases) is bombarded with ionizing particles in [106] [134].The research works set up the configuration of the system before the experiment mostly based on HW/SW co-design and save the results for further analysis.Fig. 14 shows an example of the setup of the FPGA irradiation.Evaluation: Radiation experiments enable reliability evaluation by SER or FIT metrics [103][106] [108] [134].To formulate the SER, cross-section is defined as the proportion of observed faults () over all particles collided to the surface (), as expressed in (6) [108].Cross-section  is expressed as a unit of  2 and is the probability that a particle may cause an observable error [103].The cross-section is exclusively adopted in [131] [132].
The cross-section can lead to SER or FIT calculation by getting multiplied by the particle flux that the device will experience in the environment ().SER represents the number of failures of the device in 10 9 hours as shown in (7).
Most research works that study irradiation on FPGAs evaluate the reliability of devices under test by the above metrics.In addition, some works classify the faults radiated into FPGA by observing the outputs [103][133] [135].Here, both works provide fault classification based on output confidence scores of the neural network.[103] sets up a HW/SW co-design implementation on a target board and identifies the faults causing no misclassification (tolerable) and misclassification (critical).Thereafter, the FIT of different classes of faults is obtained.[133][135] also present the crosssections of the device for different classes of faults (including tolerable errors, critical errors, and crashes).Moreover, the reliability is estimated by the aforementioned metrics in [95] as expressed in (4).
Manuscript submitted to ACM Journal  [137] tests the GPU equivalent to 2,000 years of exposure to terrestrial neutron, or [11] reports data that cover more than 110,000 years of GPU operation.Fig. 15 illustrates the radiation test setup in [11][121] [136].Evaluation: Research works of this group present reliability evaluation of DNNs on GPUs by FIT as well as fault classification similar to the works on FPGAs radiation.Authors in [11] [121] identify faults that caused SDC and Crash and report their FIT, separately.[115] and [122] report FIT of faults caused SDC and DUE separately in different data representations of the DNN, and in [137] irradiated faults are classified based on Fig. 13.SDC rate is also the adopted evaluation metric in [117].

TPU Platform.
The reliability of Google's Tensor Processing Unit (TPU) is studied under neutron beam radiation in [139] and [138].These works experimented Coral TPU chip, a low-power accelerator for DNNs, with several neural networks for image classification and object detection tasks.
Evaluation: The research works performing radiation experiments on Coral TPU have evaluated the reliability by FIT and cross section as well as by fault classification.In this regard, SDC and DUE fault effects are reported based on FIT and cross section.

Analytical Methods
Analytical methods in reliability assessment model the reliability mathematically and do not inject faults into the platform to be simulated to evaluate the reliability.These methods rely on the function and algorithm of DNNs, and if needed, also consider the structure of the accelerator.Nevertheless, they carry out fault injection to assess the efficacy of the methods.For the sake of generalization, all works in this group analyze the relations of neurons and layers to find their effect and contribution to the output.In this regard, they estimate the vulnerability of neurons and analyze how a faulty neuron may impact the output to find critical neurons.Therefore, they link the reliability of the network with the vulnerability of its neurons and provide an analytical model of calculating the reliability for DNNs.
In the first approach, DNNs are analyzed based on an algorithm called Layerwise Relevance Propagation (LRP) that leads to obtaining critical scores for neurons/fmaps.The second approach is based on the gradients of weights/fmaps with respect to the output leading to their sensitivity.Research works in the third approach estimate the vulnerability of DNNs by finding correlations between some information from DNNs and the vulnerability of layers/fmaps.In the last approach, ML-based techniques are adopted in the context of fault analysis in DNNs.
In the LRP-based analysis, a hypothesis is raised in [189] proposing that the higher the contribution of neurons to the DNN's output, the more impact they have on the classification accuracy.Accuracy loss is one of the most important metrics in the reliability evaluation.Therefore, the more impact a neuron has on the accuracy, the more vulnerable it is which means it has more influence on the reliability of the network, consequently.Hence, the authors adopted the Layerwise Relevance Propagation (LRP) algorithm to obtain the value of the contribution of each neuron to the output.LRP indicates the proportion of each connected neuron in constructing the value of the target neuron and calculates this ratio for all neurons from the last layers to the first.LRP specifies  , ( 0 , ) for each neuron j in layer i which is its output contribution score between 0 and 1 with the input  0 and output class t.Then, the average score of each neuron over the entire training set of M inputs is obtained representing the resilience of the corresponding neuron as (8).
Thereafter, the sorted list of neurons regarding their  , represents the most to least vulnerable neurons that can lead to protecting the most vulnerable neurons to improve reliability.Furthermore, by this analytical method, another reliability improvement method is presented in [190] based on balancing the resilience distribution inside the DNN.Similarly, [186] proposes an approach to extract the saliency or importance of each neuron and proposes a mapping scheme for neurons on PEs of a systolic array to minimize the score of corrupted weights.
Authors in [187] extend the LRP algorithm based on different output classes of input images and provide the list of neurons' resilience scores (score maps) for individual classes separately, as well as the score map of the whole network regardless of the output classes.Then, all sorted score maps are combined in descending order to set the maximum score to each corresponding neuron.Subsequently, a scheduling algorithm is applied to map neurons to PEs of an MPSoC based on the score maps.
In gradient-based analysis, three papers are identified.Explainable AI that explains how the network computes the output by the input is exploited in [194] to obtain the sensitivity of layers and importance of weights.This work defines the sensitivity of layers in compliance to the difference of the two highest output confidence scores of the last layer.Therefore, they obtain the average sensitivity of all layers and relate it to the importance of weights.They provide the most important weights and their critical bits consequently to be protected.
Sensitivity of filters and weights are analyzed in [191] that refers to the amount of accuracy drop with bit-flip occurrence in weights.In the proposed method in this paper, the gradient of weights with respect to the output is calculated over a dataset considering a cost function.Also, the expectation for the probability of weights to be faulty is obtained as a noise measurement (  ).The sensitivity of a weight  is measured as (9).
Sensitivity analysis in this work leads to allocation of robust hardware to the more sensitive weights.
Manuscript submitted to ACM Journal [192][193] have presented three gradient-based approaches for vulnerability estimation of fmaps in a DNN.Gradient approach considers the absolute values of fmaps' gradients with respect to the cross-entropy loss at the output in a backpropagation as the vulnerability of fmaps.Gain approach measures the noise gain by obtaining the expectation for a set of corrupted neurons affecting the DNN's accuracy, based on the derivatives of outputs with respect to the neurons over a set of data and the variance of noise source.Modified Gain is also proposed based on the Gain approach to violate the independence between neurons and noise.The three mentioned approaches evaluate the vulnerability of fmaps in a DNN.
Authors in [192][193] also presented three estimation-based approaches for the vulnerability of fmaps.They estimate the relative fmaps' vulnerability by calculating the max neuron value, fmap range, and average L2 over the input samples.They have provided approximate yet scalable and fast approaches to estimate the vulnerability of fmaps.[195] presents an equation to estimate the misclassification rate of CNNs in case of soft error occurrence in a specific layer.The authors consider any operation resulting in a non-zero value as a critical computation, since soft errors may corrupt their results.The estimation is based on the proportion of critical operations (Crit_OPs) in the target layer i and subsequent layers relative to all operations in those layers, to model the misclassification rate (SERN ) in a CNN with n layers.Equation (10) provides a representation of this estimation.
An ML-based approach for analytical reliability analysis is presented in [196] where Open-Set Recognition (OSR) methods are explored to analyze the criticality of faults in DNNs' parameters.The concept of OSR is to identify whether the output classification corresponds to the trained classes of the DNN.This concept is adapted to analyze the output logits (output of softmax in the last layer) of DNNs to identify the critical fault in the parameters.Four different OSR-based methods have been leveraged for this task and their efficacies are reported.In each method, a threshold for the output logits is obtained for identifying critical fault occurrence.
All the works in this group evaluate their analytical methods on the reliability by FI.The FI methods that are used in these works are similar to the FI methods presented and characterized in section 5.1.It is shown that analytical methods can evaluate/estimate the vulnerability/sensitivity of different components of DNNs including neurons, fmaps, and weights.Analytical methods are more lightweight than FI by far and are accelerator-agnostic.However, their analysis results can be utilized for designing robust DNN accelerators.Among the existing approaches, estimationbased analyses are faster than others while less accurate when the results are compared with FI experiments.LRP-based and gradient analyses provide more accurate results close to FI experiments yet they are faster and incurring less complexity.

Hybrid Methods
In hybrid methods, both FI and analytical methods are carried out to assess the reliability of DNNs.To that end, [197] proposes a reliability assessment framework called Fidelity based on a hybrid method.This framework studies the transient faults in both, data and control path of accelerators.Fidelity contains fault injection in software framework TensorFlow to obtain the probability of masking faults in the DNN.In addition, the framework is capable of analyzing the architectural model of the accelerator, and map Flip Flops (FFs) of datapath and control logic to the parameters of a high-level implementation of the DNN.By the fault injection and elaborate analysis, it models the probability of activeness/inactiveness of FFs during the execution time as well as the probability of masking faults.Subsequently, the framework provides the FIT rate of the accelerator.Furthermore, Manuscript submitted to ACM Journal the framework is validated by analyzing the NVDLA [198], i.e., an open-source NVIDIA's DNN accelerator.To further improve this method, a software model for NVDLA is proposed in [199] to enable reliability study of accelerators at the software level and provide a more accurate, more hardware-aware, and faster method to obtain FIT rate of the accelerator.
Zhang et al. [200] propose a hybrid of ML-based analysis and FI to estimate the vulnerability of all parameters in DNNs by a low number of fault injections.The proposed method involves selecting a set of random parameters of the DNN and evaluating their vulnerabilities by injecting bitflip faults and measuring the accuracy loss.Thereafter, some features for the selected parameters (absolute value, gradient, calculation times, and layer location) are extracted.A random forest as a machine learning approach is trained and tested using the features and vulnerability of the corresponding parameters so that when it reaches a high accuracy, it can be used for vulnerability estimation of the entire set of parameters.

DISCUSSION
In this section, we will first discuss the reliability assessment methods for DNNs based on the works reviewed and presented in Section 5.Then, we will summarize the current status in the three main categories of reliability assessment: FI, analytical, and hybrid methods, respectively and address their pros and cons in the research domain of this literature review.Thereafter, we will present a qualitative comparison of different reliability assessment methods for DNNs.Lastly, we will list the open challenges as well as major potential research directions for the future.
Table 2 lists the pros and cons of all the methods categorized in this work and described in Section 5.
Of the reviewed papers, FI as a conventional method for reliability assessment, is frequently used for evaluating the DNNs' reliability.FI provides realistic results about how faults impact the system's execution.FI methods can be conducted for modeling various faults which can be injected at the different locations in the platform for reliability evaluation.Moreover, they are applicable to any platform at any system abstraction level and provide various reliability evaluations based on metrics and fault classifications.Therefore, many research works choose FI as their primary method of DNNs' reliability assessment.Nevertheless, FI methods are accompanied by a prohibitively high complexity due to the need to consider several cases for fault occurrence and to iteratively repeat the executions.
Analytical methods have been proposed as a way to cope with the high complexity of FI methods.These methods study the function of DNNs and assess the model's reliability using mathematical equations, leading to less complex approaches.Since analytical methods are developed mathematically, they have the potential to be generalized and adapted to various DNNs.Notably, analytical methods have the potential to be exploited in the reliability assessment of the training phase.However, current analytical methods do not consider the accelerator models, and there is a gap in the use of reliability evaluation metrics.While this survey identifies a relatively small number of works relying on analytical methods for DNNs' reliability assessment, the future of research in this area should pay greater attention to the potential of analytical methods.
Finally, hybrid methods combine the strength of both, FI and analytical methods.By applying analysis of the network or the accelerator in addition to conducting fault injection, hybrid methods are capable of obtaining a comprehensive and realistic evaluation of reliability.Although a limited number of research works have been identified in this category in the present survey, there is a huge room to explore these methods for DNNs' reliability assessment in the future.The analysis of statistics presented in Fig. 9 highlights that the majority of the identified research works employ FI to assess the DNNs' reliability.This can be attributed to the fact that, while DNNs are an emerging topic in computer science, the problem of reliability has been a classic issue for a long time.In addition, the investigation of reliability over DNNs has started gaining traction since 2017, as indicated in Fig. 8.As a result, it is not surprising that the early research in this area has primarily focused on conventional methods such as FI.This could be the main reason for the significant imbalance in the number of published papers across different method categories.However, in the future, the emergence of analytical and hybrid methods is expected to bridge this gap and increase their application in the field of DNN reliability assessment.To address open challenges in reliability assessment methods for DNNs, this survey has identified the following main observations: • Although some research works, such as [201], have studied the impact of faulty data during training, no work on the reliability assessment of the training phase has been identified that considers faulty parameters or computational units.This issue should be studied in future research; • Nearly all included works focus on CNNs, with image classification and object detection tasks excluding other types of DNNs, such as RNNs and LSTMs as well as different applications that should also be evaluated in terms of reliability; • The survey has identified no software FI framework in hardware-aware platforms.Hence, DNN accelerator simulators could be exploited or developed for reliability assessment of DNNs in this platform; • Fault emulation on FPGAs can take advantage of HLS designs.Therefore, a general FI framework for these platforms could be presented using HLS to minimize design time; • Based on this survey, very few works study the reliability of the control part of DHAs, especially in FPGAs and ASICs.The control part may play a significant role in the reliability of DNN accelerators and this should be explored in future studies; • There is a limited number of analytical methods for DNNs reliability assessment in this survey, all of which rely on finding critical neurons for fault-tolerant designs.Also, only one work tries to predict the accuracy loss caused by soft errors, and ML-based approaches are proposed in one work.Nevertheless, none of them can estimate the reliability of DNNs on their own or evaluate the reliability using specific metrics.ML-based algorithms can significantly assist in efficient reliability assessment, and therefore, there is a huge potential for developing new analytical methods of reliability assessment for DNNs; • Analytical methods could be generalized for other DNNs and applications rather than considering only CNNs and image processing; • Hybrid methods appear to be powerful and capable of being exploited for developing reliability assessment frameworks.They can be one of the major methods for reliability assessment of DNNs in future works; Manuscript submitted to ACM Journal • Several FI research works carry out accuracy loss and fault classification as an evaluation of reliability.Also, some works considered FIT.However, there is still an urgent need to present DNN-specific metrics for reliability evaluation.As an outcome of this survey, in addition to the listed open challenges, the major possible research directions for future studies in this domain are addressed below: • Although analytical and hybrid methods have potential in the literature, they are not evolved to the extent that their effectiveness can be fully realized.Existing methods have shown that analytical and hybrid methods are capable of assessing the DNNs' reliability as realistically as FI, and lead to effective fault-tolerant designs.Moreover, ML-based approaches in conjunction with analytical and hybrid methods are emerging.Therefore, researchers can be directed to develop novel analytical and hybrid methods, especially those that adopt ML-based algorithms, for reliability assessment of DNNs that are faster, less complex, more scalable, and more specific to DNNs than the conventional FI approaches.• Bringing reliability as a classical issue into an emerging topic such as DNNs requires new tools to respond to the requirements of the new domain.Therefore, the new research not only needs to adopt commonly used metrics in the reliability domain, but also requires the introduction and proposal of novel DNNs-specific reliability evaluation metrics.• There are several IoT and edge applications for DNNs emerging day by day, and reliability is not only a concern for safety-critical applications.New research can focus on the unstudied applications of DNNs while taking reliability into consideration.

Fig. 7 .
Fig. 7. Top-level overview of the reliability assessment methods in this work.

Fig. 8 .
Fig. 8. Number of included papers over years

Fig. 9 .
Fig. 9. Proportion of each method in the reliability assessment of DNNs among included works

Fig. 14 .
Fig. 14.Block diagram of the setup of beam experiment in[108]

5. 1
.3.2 GPU Platform.Reliability of DNNs on GPUs are assessed under neutron beam radiation in [11][115][117] [121][122][136][137].All GPUs under test are manufactured by NVIDIA and have different architectures.They also provide tests by enabling and disabling ECC configurations, and different data representations.Each work has specified flux of neutrons and radiation time, e.g.,

-
High time complexity to achieve a sufficient confidence level -Not realistic model of fault effects in high-level software implementations -Inaccurate results at high-level software implementations -Time-consuming design and development for HDL implementations Fault Emulation -Providing realistic reliability analysis of DHA -Enabling experiments for real conditions of DHA operation -Providing full access to possible locations of the DHA for FI -Enabling realistic studying of faults in datapath -Providing fault-tolerant designs and evaluating them directly -Providing several evaluation metrics and fault classifications -Time consuming design and development -Need for the physical DHA -Different platforms need their own specific design and development to perform FI -Need for platform-specific frameworks for FI Irradiation -Performing realistic experiments as real physical faults are injected into the chip -Suitable for developing fault models -Enabling the study for validating simulation and emulation approaches -Providing the real behavior of the DHA when faced with a physical effect -Need for specific facilities for performing radiation -Low control over accuracy of fault injection in terms of number and locations of occurred faults -Lack of the visibility of fault propagation Analytical -Implementable at software-level -Scalable and less complex than FI -Leading to fault tolerant hardware designs -Providing information for algorithm-level resiliency for DNNs -DHA-agnostic -Not providing quantitative evaluation metrics -Not considering DHA models -Inaccurate in estimating the vulnerabilities of DNN components (neurons, fmaps, etc.) Hybrid -Combining fast FI with an analytical approach -Capability of reliability study for DHAs -Possibility of evaluation by either vulnerability estimation or quantitative metrics -Need for detailed information of the DHA (depending on the method) -Accuracy of the results could be low (depending on the method)

Table 1 .
Fault injection categorization with the corresponding references.

Table 3
presents a qualitative comparison between the categorized methods of reliability assessment for DNNs regarding the papers included to this survey.
Manuscript submitted to ACM Journal

Table 2 .
Pros and cons of reliability assessment methods for DNNs.

Table 3 .
Qualitative analysis comparing different reliability assessment methods for DNNs.