GraphPrior: Mutation-based Test Input Prioritization for Graph Neural Networks

Graph Neural Networks (GNNs) have achieved promising performance in a variety of practical applications. Similar to traditional DNNs, GNNs could exhibit incorrect behavior that may lead to severe consequences, and thus testing is necessary and crucial. However, labeling all the test inputs for GNNs can be costly and time-consuming, especially when dealing with large and complex graphs, which seriously affects the efficiency of GNN testing. Existing studies have focused on test prioritization for DNNs, which aims to identify and prioritize fault-revealing tests (i.e., test inputs that are more likely to be misclassified) to detect system bugs earlier in a limited time. Although some DNN prioritization approaches have been demonstrated effective, there is a significant problem when applying them to GNNs: They do not take into account the connections (edges) between GNN test inputs (nodes), which play a significant role in GNN inference. In general, DNN test inputs are independent of each other, while GNN test inputs are usually represented as a graph with complex relationships between each test. In this article, we propose GraphPrior (GNN-oriented Test Prioritization), a set of approaches to prioritize test inputs specifically for GNNs via mutation analysis. Inspired by mutation testing in traditional software engineering, in which test suites are evaluated based on the mutants they kill, GraphPrior generates mutated models for GNNs and regards test inputs that kill many mutated models as more likely to be misclassified. Then, GraphPrior leverages the mutation results in two ways, killing-based and feature-based methods. When scoring a test input, the killing-based method considers each mutated model equally important, while feature-based methods learn different importance for each mutated model through ranking models. Finally, GraphPrior ranks all the test inputs based on their scores. We conducted an extensive study based on 604 subjects to evaluate GraphPrior on both natural and adversarial test inputs. The results demonstrate that KMGP, the killing-based GraphPrior approach, outperforms the compared approaches in a majority of cases, with an average improvement of 4.76% ~49.60% in terms of APFD. Furthermore, the feature-based GraphPrior approach, RFGP, performs the best among all the GraphPrior approaches. On adversarial test inputs, RFGP outperforms the compared approaches across different adversarial attacks, with the average improvement of 2.95% ~46.69%.


INTRODUCTION
In recent years, graph machine learning [27,38] has been widely adopted for modeling graphstructured data.In this realm, the emergence of graph neural networks (GNNs) [71] has offered promising results in diverse domains, such as recommendation systems [25,85,91], social network analysis [47,84,93], and drug discovery [4,73].GNNs, like typical neural networks [45,75], are abstractions of the underlying data.Thus, their inference can suffer from faults [28,53,58], which can lead to severe prediction failures, especially in security-critical use cases.Testing is considered to be a fundamental practice that is widely adopted to ensure the performance of neural networks, including GNNs.However, like traditional deep neural networks (DNNs), GNN testing also suffers from the lack of automated testing oracles, which necessitates the manual labeling of test inputs.However, this labeling process can require significant human effort, especially for large and complex graphs.Moreover, in certain specialized domains, such as the protein interface prediction [62] of drug discovery, labeling intensively relies on domain-specific knowledge, further increasing its costs.
Prior works [6,26,44,81] have focused on test prioritization to relieve the labeling-cost problem for DNNs.Test prioritization approaches aim to prioritize test inputs that are more likely to be misclassified (i.e., fault-revealing test inputs) so such inputs can be identified earlier to reveal system bugs.Existing approaches are mainly divided into two categories: coverage-based and confidence-based test prioritization approaches.Coverage-based approaches prioritize test inputs based on neuron coverage through adapting coverage-based prioritization methods from traditional software testing [51,92].Confidence-based approaches assume that test inputs for which the model is less confident are more likely to be misclassified and thus should be prioritized higher.Feng et al. [26] proposed the state-of-the-art confidence-based approach DeepGini, which considers that a test input is more likely to be misclassified by a DNN model if the model outputs similar prediction probabilities for each class.More recently, Wang et al. [81] proposed PRIMA, which leveraged mutation analysis and learning-to-rank methods to prioritize test inputs for DNNs.However, despite its effectiveness in DNN test prioritization, PRIMA cannot be directly applied to GNNs, since their mutation operators are not adapted to graph-structured data and GNN models.
Furthermore, existing studies [36] have focused on metrics for data selection (e.g., margin and least confidence), which can also be used to detect possibly misclassified test data.Although the aforementioned approaches have been demonstrated to be effective for DNN models in some cases, they have the following limitations when applied to GNN models: -First, to the best of our knowledge, current coverage-based approaches do not provide interfaces for GNN models and thus cannot be directly applied.Moreover, existing research [26] has demonstrated that coverage-based approaches are not effective compared to confidencebased approaches.-Second, despite the effectiveness of confidence-based approaches on traditional DNNs, they do not take into account the interdependencies between test inputs of GNNs, which are particularly crucial for GNN inference.In other words, GNN test inputs are typically represented as graph-structured data consisting of nodes and edges, while confidence-based prioritization approaches usually deal with test sets in which each test is independent and has no connections with others.-Third, the effectiveness of uncertainty-based metrics can be limited when facing some specific adversarial attacks.If the aim of an attack is to generate test inputs that maximize the probability of incorrect classification, then the utility of uncertainty metrics can be limited.This is because the underlying assumption of uncertainty-based metrics is that: If a model is more uncertain about classifying a test, then this test is more likely to be misclassified.However, in such scenarios, even if a model is confident on a test, this test can still have a high probability of being misclassified.
To overcome the aforementioned problems, in this article, we propose GraphPrior (GNNoriented Test Prioritization), a set of test prioritization approaches specifically for GNNs.Graph-Prior identifies and prioritizes possibly misclassified test inputs via mutation analysis.Given a test set for a GNN model, GraphPrior regards a test input that kills more mutated models (i.e., variants of the original GNN model that is slightly changed) of the original GNN model as more likely to be misclassified.Here, killing means the prediction result to the test input via the GNN model and the mutated model is different.To this end, we design a set of mutation rules to generate mutated models specifically for GNNs by slightly changing the training parameters of the original model.After obtaining the mutation results of each test input, GraphPrior introduces several ranking models (ML/DL models) [5,42,83] to rank the test set.The working principle of GraphPrior is inspired by mutation testing research, as this has been realized for both model-based [1,18,63] and code-based [2,17,64] testing.The key underlying principle in all cases is that test cases that distinguish the behavior of mutants from that of the original artifact are useful and more likely to detect other underlying faults [1,9,63].
While both the GraphPrior and PRIMA (i.e., the state-of-the-art DNN test prioritization approach) use mutation analysis, GraphPrior differs from PRIMA in terms of its mutation rules, feature generation, and ranking models: (1) GraphPrior's mutation rules can directly or indirectly affect the message passing between nodes in graph data.In contrast, the mutation rules of PRIMA are designed for traditional DNNs, where the test inputs are independent, and therefore, the mutation rules do not affect the relationships between tests; (2) GraphPrior generates a mutation feature vector for each test input based on its mutation results, where the ith element in the vector denotes whether the ith mutated model is killed by this input.This feature generation strategy is intuitive and reproducible.In addition to this, the generation method exhibits several other advantages.First, by using binary indicators (1 or 0) as elements of the mutation feature vector, the information is transformed into a concise vector representation.Second, the fine-grained nature of the mutation feature vector allows for a detailed analysis of the effects of individual mutations.In particular, further analysis can be conducted to assess the contributions of each mutated model to GraphPrior.By tracing back to the corresponding mutation rules for the top critical mutated models, we can gain insights into which mutation rules made higher contributions to GraphPrior.The experimental results demonstrate its effectiveness; (3) GraphPrior employs five ranking models and compares their effectiveness in utilizing mutation features for test prioritization, while PRIMA only uses a single ranking model.By comparing multiple ranking models, GraphPrior can identify the optimal ranking model for learning mutation features in test prioritization.
GraphPrior has broad applicability across a wide range of contexts, including software development, scientific research, and financial systems.For instance, GraphPrior can be employed to gain insights into the vulnerabilities of GNN models used in financial transaction fraud detection.In this specific context, where nodes represent accounts and edges represent transaction transfers, the first step is to utilize the GNN model under test to identify a group of potentially fraudulent accounts.Subsequently, these identified accounts serve as test inputs for GraphPrior.By prioritizing accounts that are more likely to be misclassified by the model (i.e., accounts falsely classified as fraudulent), GraphPrior places them at the top of the recommendation list.Consequently, by labelling and analyzing these bug-revealing tests earlier, the fraud analysis team can unveil the bugs and vulnerabilities of the GNN model more efficiently.
It is important to note that GraphPrior is specifically designed for GNNs, and its impact on DNNs has not been evaluated.This is because, in graph datasets, nodes are interconnected, and the mutation rules of GraphPrior can directly or indirectly affect the message passing between nodes in the prediction process.In contrast, in traditional DNNs, each sample in a dataset is typically independent, and as a result, such mutation rules are unlikely to affect the transmission of information between tests.Therefore, the effectiveness of GraphPrior's mutation rules for DNNs remains uncertain, as no related experiments have been conducted to evaluate it.
We conducted an extensive study to evaluate the performance of GraphPrior based on 604 subjects.Here, a subject refers to a pair of graph dataset and GNN model.We compare GraphPrior with six uncertainty-based metrics [26,80,82] that can be used to prioritize possibly misclassified test inputs and adopt random selection as the baseline method.Our experimental results demonstrate that GraphPrior performs well across all subjects and outperforms the compared approaches, on average.
As mentioned before, one essential problem of confidence-based approaches is that adversarial attacks may lead to a model being more confident in the incorrect prediction, resulting in the failure of the approach.Therefore, we also evaluate GraphPrior on test inputs generated from graph adversarial attacks of existing studies [3,48,86,100].Furthermore, since the effectiveness of test prioritization methods may vary depending on the degree of the adversarial attack, we set different attack levels to generate adversarial data and compared GraphPrior with the compared approaches.In addition to the evaluation of GraphPrior, we compare the effectiveness of different mutation rules in generating top contributing mutated models, aiming to identify which mutated rules contribute more to each GNN model.In the last step, we investigate whether GraphPrior and the uncertainty-based metrics can select informative retraining tests to improve a GNN model.Our experimental results demonstrate that GraphPrior achieved better effectiveness compared with the uncertainty-based test prioritization methods.We publish our dataset, results, and tools to the community on Github. 1ur work has the following major contributions: -Approach.We propose GraphPrior, a set of mutation-based test prioritization approaches for GNNs.To this end, we design a set of mutation rules that mutate GNN models by slightly changing their training parameters.We carefully select ranking models to analyze the mutation results for effective test prioritization.-Study.We conduct an extensive study based on 604 GNN subjects involving natural and adversarial test sets.We compare GraphPrior with existing DNN approaches that could detect possibly misclassified test inputs.Our experimental results demonstrate the effectiveness of GraphPrior.-Mutation rule analysis.We compare the effectiveness of the GNN mutation rules in generating top contributing mutated models, observing that the mutation rule HC (i.e., mutating Hidden Channels) makes top contributions to most GNN models in test input prioritization.

BACKGROUND
In this section, we introduce the key domain concepts for our work, including Graph Neural Networks and Test Input Prioritization for DNNs.

Test Input Prioritization for DNNs
In Deep Neural Networks (DNNs) testing, test input prioritization aims to prioritize tests that are more likely to be misclassified (i.e., bug-revealing test inputs) by the DNN model.In this way, more important test inputs can be labeled earlier in a limited time, which can improve the efficiency of DNN testing.In the literature, several prioritization approaches have been proposed to deal with the labeling-cost issues [6,26,81,94].The majority of approaches for prioritizing tests in DNNs can be classified into two categories, coverage-based and confidence-based [81].Confidence-based approaches, such as DeepGini [26], prioritize test inputs based on the model's confidence.Specifically, these methods identify inputs that are likely to be incorrectly predicted by the DNN model, given that the model outputs similar probabilities for each class.In contrast, coverage-based approaches, such as CTM [92], simply extend traditional software system testing methods to DNN testing and have been shown to underperform compared to confidence-based approaches [26].Weiss et al. [82] conducted a comprehensive investigation of the capabilities of various DNN test input prioritization techniques, including some notable uncertainty-based metrics such as Vanilla Softmax, Prediction-Confidence Score (PCS), and Entropy.The Vanilla Softmax metric is calculated as the highest activation in the output softmax layer for a classification problem, subtracted from 1. PCS, however, is defined as the difference in softmax likelihood between the predicted class and the second runner-up class.Additionally, Entropy is considered as an alternative metric in the softmax layer proposed by the authors of DeepGini.These metrics have been demonstrated to be effective in identifying possibly misclassified test inputs and can aid in guiding test prioritization efforts.
The aforementioned uncertainty-based test prioritization can be adapted for test input prioritization for GNNs.GraphPrior differs from these approaches in that GraphPrior leverages mutation analysis to perform test prioritization.The mutation analysis of GraphPrior exploits the specific properties of GNNs.Specifically, GraphPrior's mutation rules can directly or indirectly affect the message passing between nodes in a graph.In contrast, uncertainty-based approaches rely on the prediction uncertainty of the DNN model to prioritize test inputs without accounting for the interdependence between nodes.
Currently, the state-of-the-art technique for DNN test prioritization is PRIMA, which prioritizes fault-revealing test inputs based on mutation analysis.However, PRIMA is not suitable for GNN test prioritization because: (1) its input mutation rules are specifically designed for DNN testing datasets where each sample is independent of each other.In contrast, graph datasets have complex interdependence between nodes, making PRIMA unsuitable for test prioritization in this context; (2) GNNs employ graph operations and message passing mechanisms to aggregate and update information from neighboring nodes, thereby facilitating improved representation and learning within graph structures.The model mutation rules employed in PRIMA are not suitable for accommodating the graph operation mechanisms intrinsic to GNNs.
In addition to the aforementioned test prioritization techniques, several active learning [80] methods can also be adapted to prioritize DNN tests, such as Least Confidence and Margin.Active learning aims to select the most informative samples to be labeled by a human expert.When applied to test prioritization, active learning can be used to identify the most critical and informative test cases that can reveal bugs in the system.

Overview
In this article, we propose GraphPrior, a set of test prioritization approaches for GNNs to prioritize test inputs.GraphPrior consists of six mutation-based test prioritization approaches: KMGP, LRGP, RFGP, LGGP, DNGP, and XGGP.These approaches are discussed later in this section.We present the overview of GraphPrior in Figure 1, in which the input of GraphPrior is a GNN test set, and the output is the test set that has been prioritized.Given a test set T for a GNN model G, the implementation process of GraphPrior is presented as follows: Generating mutants for the GNN model G. First, GraphPrior generates mutated models (i.e., mutants) for the GNN model G based on carefully designed mutation rules (cf.Section 3.2).Obtaining mutation results through killing mutants.For each test input, GraphPrior identifies which mutated models it kills.Here, a mutated model is killed by a test input if the prediction results of this input via the mutated model and the original model G are different.In this way, GraphPrior obtains the mutation result of each test input.Generating feature vectors from the mutation results.For each test input, GraphPrior generates a mutation feature vector for it based on its mutation results.The ith element of this feature vector denotes whether this input kills the ith mutated model.More specifically, given a test input t ∈ T , if t kills a mutated model M i , then the ith element of t's mutation feature vector is set to 1. Otherwise, the ith element is set to 0. Ranking test input based on mutation feature vectors via ranking models.Graph-Prior utilizes ranking models [5,42,83] to calculate a misclassification score for each test input based on its feature vector.This score can indicate how likely a test input will be misclassified by the GNN model.Finally, GraphPrior ranks them based on their misclassification scores in descending order and outputs the prioritized test set T .

Mutation Rules
In GraphPrior, mutation rules are employed to generate mutated models of a GNN model by making slight changes to its training parameters.We select the following parameters, because they can impact the message passing in the GNN prediction process.More specifically, in the mutated GNN model, the manner in which nodes acquire information from their neighboring nodes is slightly different from that of the original GNN model.Although variations of GNNs can be obtained even without changing training parameters, the resulting model mutants cannot produce meaningful differences in the GNN model's behavior.By changing the selected training parameters to generate mutants, we can intentionally introduce meaningful modifications to the model's behavior in terms of the interdependencies between nodes during the prediction process.We present all the mutation rules of GraphPrior as follows: -Self Loops (SL) [45,79].SL is a Boolean parameter, which controls whether to add selfloops to the input graph.When the SL parameter is set to True, self-loops are introduced to each node in the graph.By incorporating self-loops, the inherent information of nodes can be effectively aggregated into their representation vectors, leading to a change in the weighting of their neighboring nodes, and thus affecting the interdependence of nodes in the prediction process.-Bias (BIA) [30,45,79].BIA is a Boolean parameter, which determines whether to introduce a predetermined offset to the representation vectors of nodes.When the BIA parameter is enabled (set to True), each node will be assigned a corresponding bias parameter to its representation vector, allowing the GNN model to better capture the inherent properties of the graph and improve the interdependence between nodes in the prediction process.-Cached (CA) [45].CA is a Boolean parameter that controls whether to cache the computation of node embeddings during the forward pass.When the CA parameter is set to True, the node embeddings are cached and reused during the backward pass to save computation time.Caching the computation of node embeddings can affect the interdependence between nodes by altering the order and efficiency of message passing.-Improved (IMP) [45].IMP is a Boolean parameter that controls whether to use the improved message passing strategy, thus affecting the interdependence between nodes in the prediction process.-Normalize (NOR) [21,30].NOR is a Boolean parameter, which determines whether to normalize the messages passed between nodes in the prediction process.When this parameter is set to "True, " the messages are normalized by the number of neighbors that a node has before being passed to the next layer.This normalization can impact the contribution of each neighbor to the node's final representation, thus affecting the message passing between nodes in the prediction process.-Concat (CON) [79].CON is a Boolean parameter, which controls how the representations of neighboring nodes are combined during message passing.When it is set to True, the representations of neighboring nodes are concatenated before being passed, resulting in a more expressive representation of the nodes, enabling the GNN to capture more nuanced interdependencies between them.-Heads (HDS) [79].HDS is an integer parameter that determines the number of attention heads used in multi-head attention.Increasing the number of heads allows the model to capture more complex interdependence among nodes in the graph.Each attention head can focus on a different aspect of the node neighborhood, enabling the model to learn different representations of the graph.-Epoch (EP) [21,30,79].EP is an integer parameter that controls the number of times a GNN model iterates over the training dataset.By increasing the number of epochs, a GNN model can better capture the interdependence between nodes for model inference.-Hidden Channel (HC) [21,30,45,79].HC is an integer parameter, which controls the dimensionality of the hidden representation in each layer of the GNN.Therefore, changing this parameter can impact the interdependence between nodes in a graph by enabling the GNN to learn more expressive node embeddings.-Negative Slope (NS) [79].NP is a float parameter, which controls the slope of the negative part of the activation function used in the Gated Linear Unit (GLU) operation.GLU is a common non-linear function used in GNNs for message passing.Specifically, the GLU operation is used to combine the node features with the weighted sum of their neighboring nodes' features, which is the message passed between nodes in the graph.The negative slope parameter determines the slope of the activation function for negative input values in the GLU operation, thus impacting the message passing between nodes.
Based on the above mutation rules, for a given test set and a GNN model, GraphPrior generates N mutated models of the original model.We consider that a test input kills a mutated model if the

22:9
predictions for this input via the mutated models and the original GNN model are different.Based on it, GraphPrior obtains the mutation results of all the test inputs.
Considering that the primary objective of generating mutated models is to obtain informative features for test prioritization, a statistical analysis is employed to validate their effectiveness.To achieve this, a series of repeated experiments are conducted, as outlined in Section 5.The results of these experiments demonstrate that GraphPrior's effectiveness is statistically significant, thereby confirming the statistical validity of the generated mutated models for the purpose of test prioritization.

Killing-based GraphPrior
This section presents the workflow of KMGP, the Killing Mutants-based GNN Test Prioritization approach.Notably, KMGP operates on a "killing-based" principle, where test inputs that can kill more mutated models are considered as more likely to be misclassified and will be prioritized higher.It is worth noting that KMGP assigns equal importance to each mutated model in the process of test prioritization, a distinct feature that distinguishes it from feature-based approaches, which will be elaborated upon in subsequent sections.Given a GNN model G and a test input set T = {t 1 , t 2 , . . ., t n }, the detailed execution of KMGP can be divided into three key stages: mutation generation, killing-based mutation analysis, and test prioritization.Mutation generation.In the mutation generation stage, a group of mutated models {G 1 , G 2 , . . .,G N } is generated for the original GNN model G. Killing-based mutation analysis.This stage involves obtaining the mutation results of each test input t ∈ T using the process outlined in Section 3.2.Subsequently, KMGP counts the number of mutants killed by each test input based on their mutation results.Test prioritization.In the third stage, KMGP prioritizes all the test inputs in T based on the number of mutated models they killed, with those that kill more mutants being prioritized higher in the test sequence.

Feature-based GraphPrior
In comparison to the killing-based GraphPrior approach, the feature-based approaches are characterized by automatic mutation feature analysis.This process involves the generation of mutated feature vectors based on the execution of mutated models, followed by the use of ranking models (ML/DL models), which assign different importance to each mutated model for test prioritization.
Overall, the feature-based approaches' workflow entails three key stages: mutated model generation, mutation feature generation, and learning-to-rank.
Otherwise, it is set to 0. ❸ Learning-to-rank.In the final stage, the feature-based approaches input the mutation features of each test input to the ranking model (ML/DL models) [5,15,42,78,83].The ranking models can automatically learn different importance for each mutation feature to output misclassification scores.Here, each mutation feature corresponds to the execution result of a mutated model so we can consider that the ranking models learn the importance of each mutated model for test prioritization.Finally, the feature-based approaches rank all the test inputs based on their misclassification scores in descending order.
In our study, we propose five feature-based GraphPrior approaches, which follow the similar workflow described above, but leverage different ranking models.These five approaches are XGGP (XGBoost-based GNN Test Prioritization), LRGP (Logistic Regression-based GNN Test Prioritization), LGGP (LightGBM-based GNN Test Prioritization), RFGP (Random Forest-based GNN Test Prioritization), and DNGP (DNN-based GNN Test Prioritization).We briefly introduce the basic principle of the ranking models of these approaches as follows: (1) XGGP leverages the XGBoost algorithm [15] as the ranking model.XGBoost is a highly effective gradient boosting algorithm that combines decision trees to enhance the accuracy of predictions.XGGP utilizes the XGBoost algorithm to predict the misclassification score for a given test input based on its mutation features.This score reflects the likelihood that the input will be misclassified by a GNN model.(2) LRGP leverages the Logistic Regression algorithm [83] as the ranking model.Logistic regression leverages a logistic function to model the association between a categorical dependent variable and one or more independent variables.(3) LGGP leverages the LightGBM algorithm [42] as the ranking model.LightGBM is a gradient boosting framework that employs tree-based learning algorithms.The fundamental principle of LightGBM is similar to XGBoost, which employs decision trees based on learning algorithms.However, LightGBM introduces a novel optimization in the framework, with a primary focus on enhancing the speed of model training.(4) RFGP leverages the random forest algorithm [5] as the ranking model.Random Forest is an ensemble learning algorithm that constructs multiple decision trees using random subsets of the training data and input features.The predictions from individual trees are combined to produce the final prediction using averaging or voting.(5) DNGP leverages a DNN model [78] as the ranking model.The DNN model can learn to rank test inputs based on their mutation features.After training, the DNN model can generate a score that reflects their misclassification probability.This score can then be used to rank test inputs in a test set.
Compared to the mutation features of PRIMA, the distinctive aspect of GraphPrior's mutation features lies in their utilized mutation rules, which are specifically designed for GNNs.These mutation rules have the potential to directly or indirectly impact the message passing mechanism between nodes in graph data.Our experiment results in Section 5 demonstrate the effectiveness of the feature-based GraphPrior approaches.The observed effectiveness can be attributed, in part, to the selection of mutation rules and ranking models.Specifically, our mutation rules have been designed to generate informative mutation features by changing the message passing between nodes in the GNN prediction process.Furthermore, our ranking models are able to utilize these mutation features for test prioritization effectively.After sufficient training, ranking models can output a misclassification score that indicates how likely a sample would be misclassified based on its mutation features.A score closer to 1 indicates a higher probability of misclassification.By sorting the misclassification scores of test inputs in descending order, the feature-based GraphPrior approaches can effectively prioritize tests that are more likely to be misclassified.

Usage of GraphPrior
By utilizing ranking models, GraphPrior predicts a misclassification score for each test input within a given test set.These predicted scores are then utilized for test prioritization, whereby test inputs with higher scores are prioritized higher.Particularly, the ranking models are pre-trained before the execution of GraphPrior.The training process is standardized across all the different ranking models and follows a consistent set of procedures, which are presented in detail below.In addition to the killing-based KMGP, GraphPrior involves five feature-based approaches.The core difference is that the killing-based approach regards the importance of each mutated model as equal, while the feature-based approaches learn different importance for each mutated model for test prioritization.More specifically, feature-based approaches extract features from mutation results and adopt ranking models [5,42,83] to utilize the mutation features for test prioritization.In this research question, we compare the effectiveness of killing-based and feature-based approaches to investigate the effect of ranking models in leveraging mutation results.-RQ3: How does GraphPrior perform on test inputs generated from graph adversarial attacks?When faced with graph adversarial attacks, confidence-based test prioritization approaches may be fooled, thus becoming more confident in incorrect predictions.Therefore, we evaluate to what extent the effectiveness of GraphPrior is affected by graph adversarial attacks.
We compare GraphPrior and confidence-based approaches [26,36] on test inputs generated from graph adversarial attacks of existing studies [3,48,86,100] to demonstrate its effectiveness.-RQ4: How does GraphPrior perform against different levels of graph adversarial attacks?
In this research question, we investigate the effectiveness of GraphPrior against different levels of graph adversarial attacks.To answer this research question, we set different levels of attacks to generate test inputs and compare GraphPrior with existing approaches to demonstrate its effectiveness.-RQ5: Which mutation rules generate more top contributing GNN mutants?
We investigate the contributions of each mutation rule in generating effective mutants of GNNs.For each GNN model, we select the top contributing mutation features to it through the XGBoost ranking algorithm [15], which is an optimized ML algorithm for ranking tasks based on the implementation of gradient boosting.We match each selected feature with the corresponding GNN mutant and identify the mutation rule that generates it.In this way, we obtain which mutation rules generate more top contributing mutants for test prioritization.-RQ6: Can GraphPrior and the uncertainty-based metrics be used in active learning scenarios to improve a GNN model by retraining?
In the face of a large number of unlabeled inputs and a limited time budget, it is not feasible to manually label all the inputs and use them to retrain a GNN.One established solution to reduce data labeling costs is active learning [67], which involves selecting informative subsets of training samples to improve the model performance.In this research question, we investigate the effectiveness of GraphPrior and the uncertainty-based metrics in selecting informative retraining inputs to improve the quality of a GNN model.

GNN Models and Datasets
In our study, we totally adopt 604 subjects to evaluate the effectiveness of GraphPrior and the compared approaches [26,36].Table 1 exhibits their basic information.Among the 604 subjects considered in this study, 16 subjects were utilized in the experiments of RQ1, 16 subjects in RQ2, 108 subjects in RQ3, 432 subjects in RQ4, 16 subjects in RQ5, and 16 subjects in RQ6.It is worth noting that, among these subjects, a total of 64 subjects (which were utilized in RQ1, RQ5, and RQ6) were associated with clean datasets, while the remaining 540 subjects (which were utilized in RQ3 and RQ4) were associated with adversarial datasets.
Our study involves four GNN models: GCN (Graph Convolutional Networks) [45], GAT (Graph Attention Networks) [79], GraphSAGE (Graph SAmple and aggreGatE) [30], and TAGCN (Topology Adaptive Graph Convolutional Network) [21], tested by four datasets, namely, the Cora [88], CiteSeer [88], PubMed [88], and LastFM [70].We present their descriptions as follows: -GCN [45].GCN is a class of convolutional neural networks that can work directly on the graph.It solves the problem of classifying nodes (such as documents) in graphs (such as citation networks), of which only a small number of nodes are labeled.The core idea of GCN is to use the edge information of a graph to aggregate node information to generate new node representations.GCN has been used in several existing studies [31,35,89].-GAT [79].GAT introduces a self-attention mechanism in the propagation process.Compared to GCN, which regards all neighbors of a node equally, the attention mechanism assigns different attention scores to each neighbor, thereby identifying more important neighbors.
-GraphSAGE [30].GraphSAGE is a generalized inductive framework that generates node embeddings by sampling and aggregating features of neighbor nodes.-TAGCN [21].TAGCN introduces a systematic approach to design a set of fixed-size learnable filters to perform convolutions on graphs.These filters are topology-fit to the topology of the graph as they scan the graph for convolution.

Datasets.
- Notably, we evaluate GraphPrior on different types of test inputs (i.e., both natural test inputs and adversarial test inputs).We adopted eight graph adversarial attacks, presented in Section 4.4.

Compared Approaches
In our study, we considered seven compared approaches in total, including one baseline (i.e., random selection), four DNN test prioritization approaches, and two active learning approaches.We select these approaches due to the following reasons: (1) These approaches can be adapted for GNN test prioritization; (2) The selected approaches have been demonstrated as effective for DNNs in existing studies [26,36,82]; (3) The implementations of these approaches have been released by the authors.
-DeepGini.DeepGini [26] prioritizes test inputs based on model confidence.DeepGini leverages the Gini coefficient to measure the likelihood of a test input being misclassified.Deep-Gini leverages Equation ( 1) to calculate the ranking scores.
where ξ (x ) refers to the likelihood of the test input x being misclassified.p i (x ) refers to the probability that the test input x is predicted to be label i. N refers to the number of labels.-Margin.Margin [80] regards a test input with less difference between the top two most confidence predictions as more likely to be misclassified.Margin score is calculated by Equation (2).
where M (x ) refers to the margin score.p k (x ) refers to the most confident prediction probability.p j (x ) refers to the second most confident prediction probability.-Least Confidence.Least Confidence [80] regards test inputs for which the model has the least confidence as more likely to be misclassified.Least confidence is calculated by Equation (3).
where L(x ) refers to the confidence score.p i (x ) refers to the probability that the test input x is predicted to be label i via a model M. -Vanilla Softmax.Vanilla Softmax [82] is computed by subtracting the highest activation probability in the output softmax layer from 1, resulting in a metric that is positively correlated with the misclassification probability.Equation ( 4) presents the calculation of the Vanilla Softmax metric.
where l c (x ) belongs to a valid softmax array in which all values are between 0 and 1, and their sum is 1. -Prediction-Confidence Score (PCS).PCS [82] calculates the difference between the predicted class and the second most confident class in softmax likelihood.-Entropy.Entropy [82] calculates the entropy of the softmax likelihood.
-Random selection.[22] In random selection, the execution order of the test inputs is determined randomly.

Graph Adversarial Attacks
In RQ3 and RQ4, we evaluate the effectiveness of GraphPrior on test inputs generated through diverse graph adversarial attacks, in which attackers aim to generate graph adversarial perturbations by manipulating the graph structure or node features to fool the GNN models.We introduce all the attacks we applied in our experiments as follows: -Disconnect Internally, Connect Externally (DICE) [100].The DICE attack is a type of white-box attack whereby the adversary has access to all information about the targeted GNN model, including its parameters, training data, labels, and predictions.Specifically, the DICE attack randomly adds edges between nodes with different labels or removes edges between nodes sharing the same label.Through this, the attack can generate adversarial perturbations that can fool the targeted GNN model.-PGD attack [86].The PGD attack leverages the Projected Gradient Descent (PGD) algorithm to search for optimal structural perturbations to attack GNNs.-Min-max attack (MMA) [86].The min-max attack is a type of untargeted white-box GNN attack.The attack problem is formulated as a min-max problem, where the inner maximization is designed to update the model's parameters (θ ) by maximizing the attack loss, and it can be solved using gradient ascent.However, the outer minimization can be achieved by using PGD [59].-Node Embedding Attack-Add (NEAA) [3].In node embedding attack-add, the attackers are capable of modifying the original graph structure by adding new edges while adhering to a predefined budget constraint.-Node Embedding Attack-Remove (NEAR) [3].In node embedding attack-remove, the attackers modify the original graph structure by removing edges.-Random Attack-Add (RAA) [48].The Random Attack-Add approach randomly adds edges to the input graph to fool the targeted GNN model.-Random Attack-Flip (RAF) [48].The Random Attack-Flip approach randomly flips edges to the input graph to fool the targeted GNN model.-Random Attack-Remove (RAR) [48].The Random Attack-Add approach randomly removes edges to the input graph to fool the targeted GNN model.

Evaluation of Mutation Rules (RQ5)
In RQ5, we investigated the contribution of different mutation rules in generating top contributing mutated models.First, for each GNN model, we utilize the cover metric in XGBoost [15] to evaluate the importance of its mutation features and rank them according to the descending order of the importance scores.The cover metric can evaluate the importance of mutation features by quantifying the average coverage of each instance by the leaf nodes in a decision tree.Specifically, it calculates the number of times a particular feature is used to split the data across all trees in the ensemble and then sums up the coverage values for each feature over all trees.This coverage value is then normalized by the total number of instances to obtain the average coverage of each instance by the leaf nodes.The importance of a feature is then calculated based on its coverage value, and features with higher coverage values are considered more important.
Upon obtaining the importance of each mutation feature, which corresponds to a specific mutated model, we proceed to match and determine the importance of the respective mutated models.Subsequently, we select the top N critical mutated models and identify the specific mutated rules employed in their generation.This enables a comparative analysis of the contributions of various mutation rules.

Implementation and Configuration
We implemented GraphPrior in Python based on the PyTorch 1.11.0 framework [65].We also integrate the available implementations of the compared approaches [26,57,80,82] into our experimental pipeline to adapt to the GNN prioritization problem.Regarding our mutation rules, we set the number of mutated models as 80~240 across different subjects.Balancing the tradeoff between execution time and the effectiveness of GraphPrior is a critical consideration in determining the number of mutants.Building on relevant literature [81], we identified a suitable range of mutants.Our preliminary investigations on multiple subjects demonstrate that these settings effectively maintain the effectiveness of GraphPrior while controlling the runtime within a reasonable range.In the case of subjects associated with longer mutant generation times, we choose to generate a comparatively smaller number of mutants compared to other subjects.Additionally, the range was achieved through the full execution of all pre-defined mutation rules.It is worth noting that the total number of mutation rules was predetermined and fixed.Thus, even with the addition of new mutants, the impact on the performance of GraphPrior is minor, as the new mutants are created based on the existing mutation rules.
With regard to the specific mutation rules that change the integer/float training parameters, we define a parameter range close to the original parameter values to achieve slight mutations.We conducted a preliminary study using multiple subjects, demonstrating the effectiveness of such settings.Moreover, to obtain parameter values from the specified range, we adopt uniform sampling [56] as the sampling methodology.This technique ensures an equitable probability of selecting each value within the parameter range and has been widely adopted across the ML testing field [56,60,96].
More specifically, we set the hidden channel parameter in the range of [15][16][17][18][19][20], epochs parameter as <= 50, heads parameter as <= 5, and negative slope parameter as <= 0.2.For the mutation rules that change the Boolean type parameters, if the parameter value of the original model is true, then we set it to false.If the original value is false, then we set it to true.The parameter ranges for our mutation rules are carefully selected to ensure the change to the original GNN model is slight.
With respect to the configuration of the ranking models utilized in GraphPrior, we made several parameter selections: For the random forest, XGBoost, and LightGBM ranking algorithms, we set the n_estimators parameter to 100.For the DNN ranking model, we set the learning_rate parameter to 0.01.Finally, for the logistic regression ranking algorithm, we set the max_iter parameter to 100.
We conducted the following experiments on a high-performance computer cluster, and each cluster node runs a 2.6 GHz Intel Xeon Gold 6132 CPU with an NVIDIA Tesla V100 16 G SXM2 GPU.For the data process, we conducted corresponding experiments on a MacBook Pro laptop with Mac OS Big Sur 11.6, Intel Core i9 CPU, and 64 GB RAM.

Measurements
Following the existing study [26], we leverage Average Percentage of Fault-Detection (APFD) [92] to evaluate the prioritization effectiveness of GraphPrior and the compared approaches.APFD is a standard metric for prioritization evaluation.Typically, higher APFD values indicate faster misclassification detection rates.We calculate the APFD values by Equation ( 5) where n is the number of test inputs in the test set in the test set that has been prioritized.When k i=1 o i is small (i.e., the total index sum of the misclassified tests within the prioritized list is small), indicating that that the misclassified tests are prioritized higher, the APFD will be large according to Equation (5).Therefore, large APFD indicates better prioritization effectiveness.Following the existing study [26], we normalize the APFD values to [0,1].We consider a prioritization approach better when the APFD value is closer to 1.We present the comparison results in tables.
For more detailed analysis, we utilize PFD (Percentage of Fault Detected) [26] to evaluate the fault detection rate of each approach on different ratios of prioritized test inputs.High PFD values refer to high effectiveness in detecting misclassified test inputs.
where F c is the number of faults (i.e., misclassified test inputs) correctly detected.F t is the total number of faults.More specifically, we evaluate the fault detection rate of GraphPrior against different ratios of prioritized tests.We use PFD-n to represent the first n% prioritized test inputs.

RESULTS AND ANALYSIS 5.1 RQ1: Effectiveness of the Killing-based GraphPrior Approach (KMGP)
Objectives: We investigate the effectiveness of the killing-based GraphPrior approach, KMGP (cf.Section 3.3), comparing it with existing approaches that can be used to identify possibly misclassified test inputs.
Experimental design: We used 16 pairs of datasets and GNN models as subjects to evaluate the effectiveness of GraphPrior.Table 1 exhibits their basic information.We carefully selected seven compared approaches (i.e., DeepGini, least confidence, margin, Vanilla SM, PCS, entropy, and random selection), which can be adapted for GNN test prioritization.Random selection is considered the baseline.We adopt two metrics to measure the effectiveness of GraphPrior and the compared approaches: Average Percentage of Fault-Detection (APFD) and PFD, which are explained in Section 4.7.
Due to the randomness of the training process of a GNN model, we conduct a statistical analysis by repeating all the experiments 10 times.More specifically, for each subject (a dataset with a GNN model), 10 different GNN models are generated through separate training processes.Results: The GraphPrior approach KMGP outperforms all the compared approaches (i.e., DeepGini, Least Confidence, Margin, Vanilla SM, PCS, Entropy, and Random) in GNN test prioritization.Table 2 presents the comparison results of the KMGP and a set of compared approaches using the APFD metric.We highlight the approach with the highest effectiveness for each case in grey.The results demonstrate that KMGP outperforms the other approaches in the majority of cases, specifically, in 87.5% (14 out of 16) subjects.Vanilla SM, however, performs the best in only 12.5% of cases.Additionally, the average APFD value achieved by KMGP was 0.748, which is higher than that of the compared techniques, with improvements of 4.76%~49.6%.These results suggest that KMGP offers a promising solution for prioritizing GNN test inputs.
Table 3 exhibits the comparison results among the test prioritization techniques with respect to PFD.We highlight the approach with the highest effectiveness for each case in grey.The findings indicate that, for 68.75% (11 out of 16) of the subjects, KMGP performs best when prioritizing less than 50% of tests.Furthermore, for a majority of the subjects, specifically, 87.5% (14 out of 16), KMGP exhibits the best performance when prioritizing less than 30% of tests.Furthermore, Table 4 exhibits the overall comparison results in terms of PFD.We can see that when prioritizing 10%~30% test inputs, the average effectiveness of KMGP outperforms that of the compared approaches in 100% cases.When prioritizing 10%~50% test inputs, the average effectiveness of KMGP outperforms that of the compared approaches in 90% cases.Figure 2 plots the ratio of detected misclassified tests against the prioritized tests.We see that GraphPrior achieves a higher APFD value in comparison to DeepGini, entropy, least confidence, margin, Vanilla SM, PCS, and random.These results confirm the effectiveness of KMGP in GNN test input prioritization.
To demonstrate the stability of our findings, a statistical analysis is performed.Specifically, all the experiments are repeated 10 times for each subject, resulting in 10 distinct GNN model instances obtained through separate training processes for a given original GNN model.Based on the statistical analysis of the resulting data, the p-value was found to be lower than 10 −05 , indicating that the KMGP approach can consistently outperform the compared approaches in terms of test prioritization.

RQ2: Effectiveness of the Feature-based GraphPrior Approaches
Objectives: We investigate the effectiveness of feature-based approaches in GraphPrior, including XGGP, LRGP, RFGP, LGGP, and DNGP, compared with the killing-based approach KMGP.Experimental design: We evaluated the effectiveness of feature-based GraphPrior approaches with the killing-based approach KMGP on 16 subjects (four graph datasets × four GNN models).Due to the randomness of the training process of a GNN model, we repeat all the experiments 10 times and calculate the average results.For each subject (a dataset with a GNN model), 10 different GNN models are generated through separate training processes.For evaluation, we calculated the APFD values of all the approaches on each subject, which can reflect the misclassification detection rate.Moreover, we calculated the PFD values of all the approaches on different ratios of prioritized tests to further investigate the effectiveness of feature-based approaches.

Results:
The experimental results of this research question are exhibited in Tables 5, 6, and 7.Among all the GraphPrior approaches, RFGP demonstrates the highest level of effectiveness in most cases.Table 5 exhibits the comparison results among KMGP (i.e., the killingbased GraphPrior approach) and the feature-based GraphPrior approaches in terms of APFD.The results demonstrate RFGP outperforms other GraphPrior approaches, on average.Moreover, the average APFD values of RFGP exceed that of KMGP by around 0.02.Additionally, across different subjects, RFGP outperforms other GraphPrior approaches in the majority of cases.To provide a more detailed analysis, Tables 6 and 7 exhibit the comparison results of all GraphPrior approaches in terms of PFD.The findings also confirm that RFGP is the most effective GraphPrior approach.Furthermore, Table 7 indicates that, on average, RFGP is consistently more effective than other GraphPrior approaches across different test prioritization ratios.Figure 3 presents some examples aimed at providing a more visually intuitive understanding of the performance of the various GraphPrior approaches.Collectively, these results suggest that RFGP is the most effective Graph-Prior approach for the evaluated datasets.
Additionally, although the killing-based GraphPrior approach, KMGP, shows good effectiveness in some specific datasets, its average effectiveness is lower than several feature-based GraphPrior approaches, such as RFGP, LGGP, and XGGP.This result suggests that KMGP is less stable compared to some feature-based approaches.For example, in Figure 3(b), we can see that KMGP (represented by the red line) is less effective than other GraphPrior approaches.In fact, the main difference between KMGP and feature-based GraphPrior approaches lies in their strategy for utilizing mutation results.Specifically, KMGP treats all mutated models as having equal importance, whereas feature-based GraphPrior approaches, such as RFGP, employ ranking models to assign higher weights to the more important mutated models, thereby better utilizing mutation results for test prioritization.The superior performance of RFGP indicates that the random forest algorithm it utilizes can effectively identify important mutated models and assign them high weights.The efficiency of GraphPrior (all the six approaches) is acceptable.Table 8 illustrates the efficiency of GraphPrior in comparison with other approaches.The time cost of GraphPrior can be decomposed into three phases, namely, mutant generation, training, and execution.Mutant generation involves the production of mutated models based on retraining the original GNN model.
The training time represents the average duration needed for training a ranking model.Finally, execution time denotes the average duration expended on test prioritization.By decomposing the time cost into these distinct phases, we provide a more detailed understanding of the efficiency of GraphPrior in contrast to other approaches.As evident from Table 8, the average execution time of GraphPrior for test prioritization is 40 seconds, with the most time-consuming phase being mutant generation, which takes around 35 minutes.In contrast, the average execution time of the compared approaches is less than one second.Although GraphPrior is not as efficient as the compared approaches, it provides a viable alternative to costly and time-consuming manual labeling, and its total time cost remains acceptable in real-world scenarios.
Answer to RQ2: Among all the GraphPrior approaches, RFGP demonstrates the highest level of effectiveness in most cases.The efficiency of GraphPrior (all the six approaches) is acceptable.inputs in the dataset.For example, 0.4 means that 40% tests in the dataset are adversarial tests.We select these attack levels because a high attack level (e.g., 80%) would engender a substantial proportion of adversarial test inputs.Consequently, such circumstances could yield a greater number of bug cases selected by any prioritization method, thereby affecting the evaluation of GraphPrior.Therefore, we carefully selected a range of attack levels that are not unduly high for the evaluation of GraphPrior.In this research question, we totally evaluate GraphPrior and the compared approaches on 432 subjects.
Results: GraphPrior outperforms all the compared approaches on the adversarial test inputs generated from different attack levels.More specifically, Table 12 presents the effectiveness of GraphPrior and the compared approaches under the attacks DICE, MMA, RAA, and RAR, with the attack level ranging from 0.1 to 0.4.In this research question, we totally apply eight adversarial attacks.The remaining experimental results (i.e., results of the other four adversarial attacks) are presented on our Github. 2he experimental results presented in Table 12 demonstrate that GraphPrior, consisting of DNGP, KMGP, LGGP, LRGP, RFGP, and XGGP, outperforms all the compared approaches across different levels of the adversarial attacks.
Table 13 demonstrates the overall comparison results among GraphPrior and the compared approaches across eight adversarial attacks with different attack levels.Specifically, we evaluate the effectiveness of each test prioritization approach in terms of the number of cases where it performed the best, as well as its average PFD values across different attack levels.For example, the "All-0.1"refers to the overall results of each approach under all the adversarial attacks with an attack level of 0.1.Table 13 demonstrates that GraphPrior outperforms all compared approaches, achieving the best effectiveness in 99.94% of the tested cases.Only one best case is achieved by the compared approach margin.Furthermore, GraphPrior approaches such as RFGP and KMGP consistently exhibit the largest average PFD values across different attack levels.
Among all the GraphPrior approaches, RFGP and KMGP exhibit superior performance across different attack levels in comparison to other GraphPrior approaches.In Table 12, we see that, across the attack levels from 0.1 to 0.4, RFGP performs the best in the largest number of best cases, followed by KMGP.For example, when the attack level is 0.1, RFGP performs the Additionally, our experimental results, as illustrated in Table 13, reveal that the RFGP technique exhibits the largest average PFD values when compared to the other evaluated approaches across varying attack levels.Specifically, when 40% of the test inputs are prioritized, RFGP achieves a PFD value ranging from 0.832 to 0.836, which indicates the ability to detect more than 80% of misclassified tests.

Answer to RQ4:
GraphPrior outperforms all the compared approaches on the adversarial test inputs generated from different attack levels.Among all the GraphPrior approaches, RFGP and KMGP exhibit superior performance across different attack levels in comparison to other Graph-Prior approaches.

RQ5: Contribution Analysis of Different Mutation Rules
Objectives: For each evaluated GNN model, we investigate which mutated rules generate more top contributing mutated models for test prioritization.Experimental design: In our study, we employed one or more mutation rules to generate a mutated model.Each mutated model corresponds to one mutation feature.Thus, to evaluate the importance of different mutation rules, we initially evaluate the importance of various mutation features.We adopted the cover metric of the XGBoost algorithm to identify the importance of each mutation feature for ranking models.A detailed account of this approach is presented in Section 4.5.After computing the importance scores of all the mutated features, we selected the top-N important features for each subject and subsequently identified the top-N mutated models.We then identified the mutation rules utilized to generate each mutated model and compared the contributions of the mutation rules accordingly.Additionally, for different subjects in this research question, we generate 80~240 mutated models.Results: The mutation rule HC made high contributions to the effectiveness of Graph-Prior on all the four types of GNN models.Tables 14 to 17 illustrate the contributions of different mutation rules to the effectiveness of GraphPrior on different GNN models (i.e., GCN, GAT, GraphSAGE, and TAGCN).For each GNN model, we identify the top-N mutated models that made top contributions to the effectiveness of GraphPrior.The corresponding mutation rules applied to generate each mutated model are highlighted in grey.Table 14 presents the contributions of Top-N mutated models to the effectiveness of GraphPrior for the case of GCN model.Notably, the mutation rules BIA and HC made contributions to 100% of the top contributing mutated models, while SL, NOR, CA, and IMP contributed to a lower percentage of the top contributing mutated models.We conclude that, for the GCN model, the mutation rules SL and HC were the most effective in generating the top important mutated models.Moving to GAT, GraphSAGE, and TAGCN, whose results are presented in Tables 15,16, and 17, the mutation rule HC also generates a large ratio (i.e., 100%, 90%, and 90%, respectively) of top contributing mutated models.We can conclude that, across the four different types of GNN models, HC can continuously make top contributions to the effectiveness of GraphPrior.
Some mutated rules, such as NOR and BIA, made high contributions to the effectiveness of GraphPrior on some specific GNN models.Moreover, some mutation rules, such as BIA and NOR, also generate a considerable ratio (i.e., from 50% to 100%) of top-critical mutated models.For example, on GCN and GraphSAGE, BIA made contributions to 100% top-N mutated models.On TAGCN, NOR made contributions to 100% top-N mutated models.Answer to RQ5: The mutation rule HC made high contributions to the effectiveness of Graph-Prior on all the four types of GNN models.Some mutated rules, such as NOR and BIA, made high contributions to the effectiveness of GraphPrior on some specific GNN models.

RQ6: Enhancing GNNs with GraphPrior
Objectives: We investigate whether GraphPrior and the uncertainty-based metrics can select informative retraining subsets to improve the performance of a GNN model.Experimental design: Following the prior research by Ma et al. [57], our retraining experiments are structured as follows: First, we randomly partitioned the dataset into three sets: an initial training set, a candidate set, and a test set, with a ratio of 4:4:2.The candidate set was reserved  exclusively for retraining purposes, while the test set was kept untouched for the purpose of evaluation.In the first round, we trained a GNN model using only the initial training set and computed its accuracy on the test set.We employed the best model obtained over the training epochs for the subsequent retraining process.In the second round, we incorporate an additional 10% of new inputs from the candidate set into the existing training set without replacement.The inputs selected for inclusion are those that are prioritized in the first 10% by the test prioritization approaches, namely, GraphPrior and the compared techniques.Following Ma et al. [57], we retrain the GNN models by utilizing the complete augmented training set.This approach ensures that the old and new training data are treated equally.We repeat the retraining process for multiple rounds until the candidate set is empty.We kept the test data untouched during the retraining process.Moreover, we account for the randomness involved in the model training process and repeat all the experiments 10 times to report the average results (averaged over 10 repetitions).
Results: Table 18 illustrates the average accuracy of GNN models after retraining with 10% to 100% prioritized test inputs.For each case, we highlight the approach with the highest effectiveness in grey to facilitate quick and easy interpretation of the results.GraphPrior and the uncertainty-based test prioritization approaches outperform the random selection approach.However, the observed improvement is relatively small, indicating that GNN test prioritization approaches can guide the retraining of GNN models but with limited effect.In Table 18, we observe that test prioritization methods, including GraphPrior and compared approaches, consistently demonstrate better performance across varying ratios of added data compared with the random selection.Furthermore, when incorporating prioritized tests exceeding 10%  of the total, a significant majority of the test prioritization methods-specifically, 83.4% (10 out of 12)-outperform random selection in each case.However, the improvements achieved by these test prioritization methods compared to random selection are relatively small, with the highest increase being only 0.014.Additionally, Figure 4 visually depicts an example outcome of the retraining experiments conducted on the Cora dataset using the GCN model, showcasing a comparative evaluation of the performance of test prioritization approaches against random selection (indicated by the black line).As observed from the results, the test prioritization approaches demonstrate a better performance compared to random selection, but the improvement is visually slight.One reason that leads to the effectiveness of GraphPrior and uncertainty-based test prioritization approaches being limited lies in their inadequate consideration of node importance (i.e., impact on other nodes in the dataset).In a GNN dataset, the complex interdependence among test inputs and their neighbors can lead to them having different importance.For example, nodes with greater connectivity can affect more of other nodes, making them relatively more critical.However, the current test prioritization approaches only focus on the ability of test inputs to reveal system bugs without regard to the importance of nodes.Although the selected test input by them GraphPrior: Mutation-based Test Input Prioritization for Graph Neural Networks 22:31 can have a higher likelihood of misclassification, their importance within the dataset can be minor if they have a very small number of neighbors.Retraining such inputs would have less effect.Consequently, it is crucial to consider node importance in the selection of retraining data to achieve more effective outcomes.
GraphPrior achieved better effectiveness than the uncertainty-based test prioritization methods.In Table 18, we see that, when adding more than 20% (including 20%) test cases for retraining, the GraphPrior approaches perform the best in 100% cases.Figure 4 visually demonstrates that the GraphPrior approaches (solid line) perform better than the compared approaches (dotted line) in most cases.
Answer to RQ6: GraphPrior and the uncertainty-based test prioritization approaches outperform the random selection approach.However, the observed improvement is relatively small, indicating that GNN test prioritization approaches can guide the retraining of GNN models but with limited effect.GraphPrior achieved better effectiveness than the uncertainty-based test prioritization methods.

Generality of GraphPrior
Although the confidence-based test prioritization approaches demonstrate excellent effectiveness in traditional DNNs, they do not consider the interdependencies between test inputs, which are particularly crucial in GNN test prioritization.Our proposed GraphPrior leverages the mutation analysis of GNN models to perform GNN test input prioritization, which has been demonstrated effective on graph classification tasks through 604 carefully designed subjects.In fact, the scheme of GraphPrior, (i.e., modifying training parameters to mutate the GNN model for test prioritization) can also be generalized to other dimensions of GNN tasks, including graph-level and edge-level tasks.In the future, we will further verify the extension of GraphPrior from this perspective.
[The applicability of GraphPrior on regression tasks].In this section, we will also discuss the potential applicability of GraphPrior to regression tasks.Currently, the mutation rules and ranking models of GraphPrior are specifically designed for classification tasks.To extend GraphPrior to regression tasks, modifications to the mutation rules and ranking models would be required.If appropriate mutation rules can be identified for regression tasks and suitable ranking models can be designed, then GraphPrior could also be applied to regression tasks.

Limitations of GraphPrior
[Diversity of the prioritized data].One limitation of GraphPrior lies in guaranteeing the diversity of selected bug data.This limitation is also noted in prior work on the uncertainty-based test prioritization approaches [26], which did not consider the diversity of bugs when prioritizing test inputs.Similarly, GraphPrior also does not aim for diversity in the prioritized tests.However, GraphPrior has demonstrated the ability to identify a significant majority of misclassified test inputs using a small ratio of prioritized test cases.Specifically, RFGP (i.e., the most effective GraphPrior approach) has been shown to detect over 80% misclassified tests by prioritizing only 40% of the test inputs.This highlights GraphPrior's ability to efficiently identify a large proportion of bugs using a small set of prioritized tests, even without explicitly ensuring bug diversity.While prioritizing diverse bugs can improve the overall quality of testing, prioritizing a significant majority of bugs can still be a practical strategy in situations where time and resources are limited.Therefore, GraphPrior's ability to efficiently identify a large proportion of bugs using a small set of prioritized tests can be particularly useful in scenarios where time and resources are constrained.
[GraphPrior in active learning scenarios].Active learning [68] operates under the assumption that samples within a dataset have varying contributions to the improvement of the current model and aims to select the most informative samples for inclusion in the training set.Our investigation in RQ6 has demonstrated that GraphPrior and uncertainty-based metrics can be utilized to select informative retraining tests.However, the effectiveness of these approaches is limited.Specifically, despite the demonstrated success of uncertainty-based metrics such as DeepGini and margin in previous studies [26,36] on DNNs, their effectiveness in the context of GNNs is slight.We explore potential reasons for this phenomenon.
One crucial reason for their limited effectiveness lies in their inadequate consideration of node importance, i.e., the impact that a node has on other nodes in the graph dataset.In a GNN dataset, the complex interdependence among test inputs and their neighbors can result in differing levels of importance for different nodes.For instance, nodes with higher connectivity can be more influential and hence more critical.However, current test prioritization approaches only focus on the ability of test inputs to expose system bugs without taking into account the node importance.Although these approaches may identify inputs with a higher likelihood of misclassification, their importance within the dataset may be negligible if they have only a few neighbors.Retraining such inputs is, therefore, less effective.
Furthermore, we elaborate on the difference between GraphPrior and the existing active learning methods evaluated in our study.The active learning methods used for comparison in our article are primarily uncertainty-based, aimed at datasets where each sample is independent of others.However, for graph datasets, these methods select retraining data without considering the interdependencies between nodes and also neglect the importance of nodes, merely selecting possibly misclassifed nodes.In contrast, GraphPrior employs mutation analysis to identify test inputs that are more likely to be misclassified while considering the interdependencies between nodes during the mutation process.Despite this added consideration, GraphPrior's goal remains to select misclassified test inputs and does not explicitly consider node importance, leading to slight effectiveness similar to the uncertainty-based methods.
[Generating mutants for large-scale GNN models].In our experiments, which are based on our current model and datasets, the time cost of our retraining method (for generating mutants) is within an acceptable range.When dealing with large-scale GNN models, GraphPrior can require large computational resources, but it can remain feasible in situations where the cost of manual labeling outweighs the computational cost.

Threats to Validity
Threats to Internal Validity.The internal threats to validity mainly lie in the implementation of our proposed GraphPrior and the compared approaches.To reduce the threat, we implemented GraphPrior based on the widely used library PyTorch and adopted the implementations of the compared approaches published by their authors.Another internal threat lies in the randomness of the model training.To mitigate this threat and ensure the stability of our experimental results, we conducted a statistical analysis.Specifically, we repeated the training process 10 times for both the original model and the mutated model and calculated the statistical significance of the experiments.
The selection of mutation rules in our study presents another internal threat to validity.Despite our best efforts to collect a comprehensive set of mutation rules, it is possible that other training parameters beyond our current knowledge could serve as mutation rules.To mitigate this threat, we selected mutation rules that can directly or indirectly affect node interdependence in the prediction process.The selection of parameter ranges for mutation rules is another internal threat that could affect the effectiveness of the rules.To mitigate this threat, we adopted a strategy in which we inverted the values of Boolean parameters, setting true to false and false to true.For integer 22:33 and float parameters, we selected a range that introduces only slight changes to the original GNN model.Our experimental results demonstrated the effectiveness of GraphPrior, indicating that the mutation rules and selected parameter range are suitable for GNN test prioritization.Threats to External Validity.The external threats to validity mainly lie in the GNN models under test and the testing datasets we used in our study.To mitigate this threat, we adopted a large number of subjects (pair of model and dataset) in our study and leveraged different types of test inputs.We applied eight graph adversarial attacks from public studies to generate adversarial test inputs and varied the attack level for more detailed evaluation.In the future, we will apply GraphPrior to more GNN models and test datasets with diversity.

RELATED WORK
We present the related work in three aspects, which are test prioritization techniques, deep neural network testing, and mutation-based test prioritization for traditional software.

Test Prioritization Techniques
In traditional software testing, test prioritization [11-13, 19, 20, 33, 69, 92] aims to find the ideal order of test cases to reveal system bugs earlier.Prioritizing test cases contributes to two critical constraints, time and budget for software testing, to detect more fault-revealing test cases in a limited time.Di Nardo et al. [19] conducted a case study of coverage-based prioritization strategies on real-world regression faults, evaluating the effectiveness of several test case prioritization techniques in bug detection.Rothermel et al. [69] presented and compared three types of test case prioritization techniques for regression testing that are based on test execution information.They demonstrated that each of the studied prioritization techniques increased the fault detection rate of the test suite.Henard et al. [33] conducted a comprehensive study to compare existing test prioritization approaches, finding that the differences between white-box [23,24,49,92] and black-box strategies [32,34,46] are little.Chen et al. [13] proposed LET to prioritize test programs for compiler testing acceleration and demonstrated its effectiveness.LET works through two processes: the learning process to identify program features and predict the bug-revealing probability of a new test program and the scheduling process to prioritize test programs based on bug-revealing probabilities.Chen et al. [11] proposed to prioritize test programs based on the prediction information of the test coverage for compilers.
In terms of test prioritization for DNNs, Feng et al. [26] proposed the state-of-the-art approach, DeepGini, which identifies possibly misclassified tests based on model uncertainty.DeepGini assumes a test is more likely to be mispredicted if the DNN outputs similar probabilities for each class.Byun et al. [6] evaluated several metrics that prioritize bug-revealing inputs based on the white-box measures of DNN's sentiment, including softmax confidence (i.e., predicted probability for output categories in DNNs that use softmax output layers), Bayesian uncertainty (i.e., the uncertainty of the prediction probability distributions for Bayesian Neural Networks), and input surprise (i.e., the distance of the neuron activation pattern between a test input and the training data).Wang et al. [81] proposed PRIMA to prioritize test inputs for DNNs via intelligent mutation analysis.PRIMA further improves DNN test prioritization in two main aspects.First, PRIMA can be applied not only to classification modes but also to regression models.Second, PRIMA can deal with the case in which test inputs are generated from adversarial input generation approaches [8] that can make the probability of the wrong class larger.Furthermore, some data selection approaches [80] are also proposed to detect possibly misclassified tests for DNNs.Despite its effectiveness in DNN test prioritization, the PRIMA approach cannot be directly applied to GNNs.This is because PRIMA's mutation operators are not adapted to graph-structured data and GNN models.
More specifically, GNN models operate on graph-structured data, where nodes and edges represent entities and their relationships.Conversely, the input mutation rules of PRIMA were designed for independent test samples, rendering them unsuitable for GNNs.Moreover, GNNs incorporate unique graph operations and aggregation mechanisms, including graph convolution operations and message passing mechanisms.PRIMA's model mutation rules are not applicable to the graph-level mechanisms employed by GNNs.As such, GNNs require specialized test prioritization techniques, such as GraphPrior, which leverages the properties of GNN models in its mutation analysis for test prioritization.More specifically, to address the limitations of PRIMA, GraphPrior introduces mutation rules that are designed based on the graph operations and aggregation mechanisms of GNNs.These rules can directly or indirectly impact message passing.Consequently, GraphPrior enables prioritizing tests for graph-structured data.

Deep Neural Network Testing
Besides test input prioritization, some test selection approaches have also been proposed to improve the efficiency of DNN testing.Test selection aims to precisely estimate the accuracy of the whole set by only labeling the set of selected test inputs.In this way, the labeling cost for DNN testing is reduced.Li et al. [50] proposed CES (Cross Entropy-based Sampling) and CSS (Confidence-based Stratified Sampling) to select a small group of representative test inputs to estimate the accuracy of the whole testing set.CES minimizes the cross-entropy between the selected set and the entire test set to ensure that the distribution of the selected test set is similar to the original test set.CSS leverages the confidence features of test inputs to guarantee the similarity between the selected test set and the entire test set.Chen et al. [14] proposed PACE (Practical Accuracy Estimation), which selects test inputs practically based on clustering, prototype selection, and adaptive random testing.PACE first clusters all the test inputs into different groups based on their testing capabilities.Then, PACE utilizes the MMD-critic algorithm [43] to select prototypes from each group.For test inputs not in any group, PACE leverages adaptive random testing to select tests from them.Compared to the aforementioned research, our work focus on test prioritization, which ranks all the test inputs without discarding any test input.In this way, testers or developers can find the test inputs that reveal bugs earlier.
In addition to improving the efficiency of DNN testing, several existing studies [37,44,[54][55][56]66] have focused on measuring the adequacy of DNNs.Pei et al. [66] proposed a metric of neuron coverage to evaluate how adequate a test set covers the logic of a DNN model.Based on this metric, they proposed a white-box framework for testing DNNs.In the following study, Ma et al. [55] proposed DeepGauge, a set of DNN testing coverage criteria to measure the test adequacy of DNNs.DeepGauge also considers neuron coverage to be a good indicator of the effectiveness of a test input.Based on the basic neuron coverage metric, they proposed new metrics with different granularities to differentiate adversarial attacks from legitimate test data.Kim et al. [44] proposed the surprise adequacy for testing of DL models, which identify how effective a test input by measuring its surprise with respect to the training set.More specifically, the surprise of a test input refers to the difference in the activation value of neurons in the face of this new test.

Mutation Testing for DNNs
Several existing studies have explored the use of mutation testing for DNNs and developed different mutation operators and frameworks.Ma et al. [56] propose DeepMutation to measure the quality of test data for DL systems based on mutation testing.To this end, they design a set of source-level and model-level mutation operators to inject faults into the training data, training programs, and DL models.The quality of test data is evaluated by analyzing the extent to which the injected faults can be detected.The work by Ma et al. was later extended into a mutation testing tool for DL systems named DeepMutation++ [37], which proposed a set of new mutation operators for Feed-forward Neural Networks (FNNs) and Recurrent Neural Networks (RNNs) and can dynamically mutate runtime states of an RNN.Humbatova et al. [39] proposed DeepCrime, which is the first mutation testing tool that implements a set of DL mutation operators based on real DL faults.Shen et al. [72] proposed MuNN, a mutation analysis method for neural networks.MuNN defined five mutation operators based on the characteristics of neural networks.The results reveal that mutation analysis has strong domain characteristics, indicating the need for domain mutation operators to enhance the analysis, and that new mutation mechanisms are required for deep neural networks.
The above studies in mutation testing have focused on traditional DNNs, which are typically evaluated on datasets with independent samples.However, the mutation rules employed in these studies do not account for the interdependence among test inputs, which is a crucial factor to consider in the context of GNN testing.In contrast, the mutation rules of GraphPrior are designed to impact the message passing mechanism in the GNN prediction process.In the mutated GNN model, the way nodes acquire information from their neighboring nodes differs slightly from that of the original GNN model.The mutation features generated based on these mutation rules are fed into ranking models to predict the likelihood of a test input being misclassified by the GNN model.

Mutation-based Test Prioritization for Traditional Software
In traditional software testing, mutation testing is a well-established technique to evaluate the quality of test sets.Mutation-based test prioritization focuses on prioritizing test cases based on their ability to detect mutants.The key idea is that test cases that can detect mutants are likely to be more effective at finding real faults in the code and, therefore, should be given higher priority.Several mutation-based approaches [52,74] have been proposed.Lou et al. [52] proposed a test-case prioritization approach based on the fault detection capability of test cases.They introduced two models to calculate the fault detection capability: the statistics-based model and the probability-based model.Based on the experimental study, they found that the statistics-based model outperforms all the approaches.Shin et al. [74] proposed a test case prioritization technique guided by the diversity-aware mutation adequacy criterion and empirically evaluated the effectiveness of mutation-based prioritization techniques with large-scale developer-written test cases.Papadakis et al. [63] proposed mutating Combinatorial Interaction Testing models and using them to prioritize tests based on their ability to kill mutants and showed that the number of model-based mutants that are killed yields a strong correlation to code-level faults revealed by the test cases.The aforementioned DNN-oriented approaches consider each test input independent of each other, while in a graph dataset, there are usually complex connections between test inputs.Our proposed GraphPrior specifically targets GNNs and utilizes several mutation rules to generate GNN mutants for test prioritization.Moreover, to better leverage the mutation results, we adopt several ranking models [5,42,83] that can learn to predict the probability of a test input to be misclassified.

CONCLUSION
To improve the efficiency of GNN testing, we aim to prioritize possibly misclassified test inputs to reveal GNN bugs earlier.However, a crucial limitation of existing test prioritization approaches is that, when applying to GNNs, they do not take into account the interdependence between test inputs (nodes).In this article, we propose GraphPrior, a set of test prioritization approaches specifically for GNN testing.GraphPrior assumed that a test input is more likely to be misclassified if it can kill many mutated models.Based on it, GraphPrior leveraged carefully designed mutation rules to generate mutated models for GNNs.Subsequently, GraphPrior obtained the mutation results of test inputs based on the execution of the mutated models.GraphPrior utilized the mutation results in two ways, namely, killing-based and feature-based methods.In the process of scoring a test, killing-based methods considered each mutated model equally important, while feature-based methods learned different importance for each mutated model through ranking models.Finally, GraphPrior ranked all the test inputs based on their scores.We conducted an extensive study to evaluate the effectiveness of GraphPrior approaches on 604 subjects, comparing them with existing approaches that could detect possibly misclassified test inputs.The experimental results demonstrate the effectiveness of GraphPrior.In terms of APFD, the killing-based GraphPrior approach, KMGP, exceeds the compared approaches (i.e., DeepGini, margin, Vanilla Softmax, PCS, Entropy, least confidence, and random selection) by 0.034~0.248,on average.Furthermore, RFGP (i.e., the feature-based GraphPrior approach) exhibited better performance compared to other GraphPrior approaches.Specifically, RFGP outperforms the uncertainty-based test prioritization approaches against different adversarial attacks, with the average improvement of 2.95%~46.69%.

Cora [ 88 ]
. The Cora dataset is a citation graph composed of 2,708 scientific publications (nodes) and 5,429 links (edges) between them.Nodes represent ML papers, and edges represent citations between pairs of papers.Each paper is classified into one of seven classes, such as reinforcement learning and neural networks.-CiteSeer[88].The CiteSeer dataset consists of 3,327 scientific publications (nodes) and 4,732 links (edges).Each paper belongs to one of six categories such as AI and ML.-PubMed [88].The PubMed dataset contains 19,717 diabetes-related scientific publications (nodes) and 44,338 links (edges).Publications are classified into three classes such as Cancer and AIDS (i.e., Acquired Immune Deficiency Syndrome).-LastFM Asia Social Network [70].The dataset LastFM Asia Social Network was collected from the social network of users on the Last.fmmusic platform in Asia.Nodes are LastFM users, and edges are mutual follower relationships between them.LastFM contains 7,624 nodes and 27,806 edges.The classification task of the LastFM dataset is to predict the home country of a user (e.g., Philippines, Malaysia, Singapore).

Fig. 4 .
Fig. 4. Enhancing the accuracy of the GNN with prioritized tests (Cora with GCN).

❶ Mutated model generation. Given
a GNN model G and a test set T , during the first stage, the feature-based approaches generate a group of mutated models (denoted as {G 1 , G 2 , . . .,G N }) of the GNN model G based on the mutation rules specified in Section 3.2.❷

Mutation feature generation. Subsequently
, the feature-based approaches associate a feature vector V t of size N with each test input t, where N represents the number of mutated models, and v k (= V t [k]) maps to the execution output for the mutated model G k .If t kills the mutated model G k (i.e., the prediction results for t via the mutated models G k and the original model

the training set for ranking models.
Based on the training set R, we aim to build a training set R for training the ranking models.First, we generate a group of mutated models for each input r i ∈ R.Then, we obtain the mutation feature vector V i of r i (i.e., a one-dimensional vector in which the ith element denotes whether the ith mutated model is killed by this input).The mutation feature vector of r i is used to build the training set R (i.e., the training set of the ranking models).Second, we let the original GNN model G classify each input r i ∈ R and compare it with the ground truth of r i .In this way, we can identify whether r i is misclassified by the GNN model G.If r i is misclassified by G, then we label it as 1.Otherwise, we label it as 0. In this way, we have built the ranking model training set R .

Table 2 .
Effectiveness Comparison among KMGP and the Compared Approaches in Terms of APFD

Table 5
presents the comparison results in terms of APFD, while Table6and Table7present the comparison results in terms of PFD.

Table 3 .
Effectiveness Comparison among KMGP and the Compared Approaches in Terms of PFD

Table 4 .
Average Comparison Results among KMGP and the Compared Approaches in Terms of PFD

Table 5 .
Effectiveness Comparison among KMGP and the Feature-based GraphPrior Approaches in Terms of APFD

Table 6 .
Effectiveness Comparison among KMGP and the Feature-based GraphPrior Approaches in Terms of PFD

Table 7 .
Average Effectiveness Comparison among KMGP and the Feature-based GraphPrior Approaches in Terms of PFD

Table 8 .
Time Comparison between GraphPrior and Compared Approaches

Table 10 .
Effectiveness Comparison of GraphPrior and the Compared Approaches on Adversarial TestInputs in Terms of PFD RFGP consistently outperforms all other GraphPrior approaches in terms of average effectiveness.Moreover, Table11presents the overall comparison results in terms of PFD, further indicating that RFGP outperforms all other approaches in terms of average effectiveness.Notably, when prioritizing 20% to 40% of the test inputs, RFGP consistently exhibits the highest number of best cases across a variety of subjects.

Effectiveness of GraphPrior against Adversarial Attacks at Varying Attack Levels Objectives:
We investigate the effectiveness of GraphPrior on adversarial test inputs with different attack levels.Experimental design: To investigate the effectiveness of GraphPrior on test inputs generated via different levels of graph adversarial attacks, we set different attack levels (i.e., 0.1, 0.2, 0.3, and 0.4) on eight graph adversarial techniques (i.e., DICE, Min-max attack, NEAA, NEAR, PGD attack, RAA, RAF, and RAR).As mentioned in RQ3, the attack level indicates the ratio of adversarial

Table 11 .
Average Effectiveness Comparison among GraphPrior and the Compared Approaches on Adversarial Test Inputs in Terms of PFD

Table 12 .
Comparison Results of GraphPrior and the Compared Approaches against Different Levels of the Attacks DICE, MMA, RAA, and RAR in Terms of PFD : Mutation-based Test Input Prioritization for Graph Neural Networks 22:27 best in 46.47% cases.KMGP performs the best in 35.33% cases.Notably, when prioritizing 10% test inputs, KMGP takes the largest number of best cases.When the attack level is 0.2~0.4,RFGP takes the largest number of best cases.

Table 13 .
Overall Comparison Results among GraphPrior and the Compared Approaches on Adversarial Tests with Different Attack Levels

Table 14 .
The Contributions of Different Mutation Rules (GCN)

Table 15 .
The Contributions of Different Mutation Rules (GAT)