LaF: Labeling-free Model Selection for Automated Deep Neural Network Reusing

Applying deep learning (DL) to science is a new trend in recent years, which leads DL engineering to become an important problem. Although training data preparation, model architecture design, and model training are the normal processes to build DL models, all of them are complex and costly. Therefore, reusing the open-sourced pre-trained model is a practical way to bypass this hurdle for developers. Given a specific task, developers can collect massive pre-trained deep neural networks from public sources for reusing. However, testing the performance (e.g., accuracy and robustness) of multiple deep neural networks (DNNs) and recommending which model should be used is challenging regarding the scarcity of labeled data and the demand for domain expertise. In this article, we propose a labeling-free (LaF) model selection approach to overcome the limitations of labeling efforts for automated model reusing. The main idea is to statistically learn a Bayesian model to infer the models’ specialty only based on predicted labels. We evaluate LaF using nine benchmark datasets, including image, text, and source code, and 165 DNNs, considering both the accuracy and robustness of models. The experimental results demonstrate that LaF outperforms the baseline methods by up to 0.74 and 0.53 on Spearman’s correlation and Kendall’s τ, respectively.


Introduction
Deep learning (DL) is helping solve all sorts of real-world problems in various domains, such as computer vision [38], natural language processing (NLP) [34], code understanding [40], and autonomous driving [21].Due to the outstanding performance of deep neural networks (DNNs), software engineering (SE) researchers have attempted to apply DNNs to solve various SE tasks, such as source code processing [8,40], automatic software testing [52,53], and GUI designs [51].Building machine learning (or deep learning) systems generally requires model architecture design The rest of this paper is organized as follows.Section 2 introduces preliminary knowledge behind this work.Section 3 presents the problem statement and our proposed solution.Section 4 explains our experimental design.We present the empirical results and corresponding discussions in Section 5. Section 6 details the threats that may affect the validity of conclusions.Section 7 reviews related work.The last section concludes our work and points out future research directions.

MLOps and Model Reusing
Generally, building machine learning systems needs a set of operations [5] that are depicted in the first row of Figure 1.Roughly speaking, first, developers connect their interested data and design the model architecture that is suitable to train the data.Then, developers fed the model and data into the high-performance server to tune the parameters of the model.After a model with expected performance was trained, it will be deployed (embedded) into the application and function in the wild.Finally, similar to conventional software systems, machine learning systems also need to be evolved and maintained from time to time.For the first two steps, i.e., data preparation and model design, huge human efforts and expert domain knowledge are needed to label the data and design the model.And for the model training process, expensive computing resource is necessary to handle the complex parameter tuning procedure.In conclusion, the first three steps make the whole process heavy before the model is deployed for real usage.
In practice, model reusing is a commonly adopted way to lighten machine learning operations (MLOps).As shown in the second row of Figure 1, the original three operations data preparation, model design, and model training are replaced by the two, tests preparation and model reusing.The hidden reason for this replacement is that, nowadays, given a specific task, e.g., face recognition, we can access many well-established pre-trained models from online resources, e.g., GitHub.Thus, the straightforward way to build our machine learning system is to reuse such models.To do so, we only need to prepare some test data (which can be much less than the training data) for our task, then collect potential pre-trained models, and finally, select the best model using the test data to build our system.In this way, we can reduce the human effort the training data labeling and the difficult procedure of model architecture design.In this work, we focus on how to efficiently select the best model from the massive number of available models.

Comparison testing and test selection in deep learning
In conventional software engineering, comparison testing [23,42,43] aims at figuring out the strength and weaknesses of a newly developed software product compared with existing products.The end goal is to facilitate the deployment of a product with high functionality and reliability.Recently, Meng et al. [33] re-framed "comparison testing" as testing methods that aim to compare alternative software artifacts, especially DNNs [22].Concretely, the problem turns into how to find out the most discriminative data that can amply distinguish the difference.In their proposed sample discrimination-based selection method, the majority voting [41] is first applied to produce pseudo labels based on which DNNs are classified to top, middle, and bottom groups following the item discrimination [15].Via the prediction difference between the top and bottom DNNs, each data has a unique discrimination score and the high ones are selected for the final ranking.
A close topic to comparison testing is test selection for DNN model performance estimation.The key idea of this kind of test selection is, given massive unlabeled test data and a DNN model, we estimate the performance of the model on these data by using a subset of data selected by test selection metrics.In this way, the labeling effort can be reduced and the budget can be saved.For instance, Li et al. [28] proposed the cross Entropy-based sampling method to identify the most representative data of a test set.Similarly, Chen et al. [11] developed the practical accuracy estimation.The difference is that in test selection, the objective is a single DNN, while in comparison testing, the objective is multiple DNNs.Undoubtedly, one can first approximate the performance of each DNN by selecting its corresponding representative set and then undertake the comparison.However, this will largely increase the effort in labeling and is less practical than selecting once.
Both comparison testing and test selection can be used in the model reusing process to indicate the model with the best performance.

Distribution Shift
Distribution shift is a crucial problem in machine learning which means the train data and test data follow different data distributions.Generally, compared to the in-distribution data (IID), the model is more difficult to handle the data with distribution shift, which makes the reported performance (using IID) of the model unreliable.
Roughly speaking, there are two types of distribution shifts, artificial and natural.Artificial distribution shift mainly comes from adding artificial perturbations (corruptions) into raw data.Dan and Thomas [19] proposed to add 15 types of algorithmically generated corruptions with 5 levels of severity to image data to mimic realistic situations, such as noise, blur, snow, and zoom.Based on these corruptions, different benchmark datasets, such as CIFAR-10-C [19] and MNIST-C [35], have been developed for testing the robustness of DNN models.The first two figures in Figure 2 show two examples of artificial distribution shift -the original bird picture with two types of noises, white noise and brightness.On the other hand, natural distribution shift is usually induced by the change of environment or population and exists in raw data, such as the change of camera traps [9], new customers [36], and new repositories [24].A recent benchmark [24] provides in-the-wild distribution shifts covering diverse data domains and applications.The last two figures in 3 Methodology

Problem Formulation
In this paper, we are interested in the classification task.Given a -class task over a sample space Z = X × Y → R, where  ∈ X is an input data and  ∈ Y is its class label.Let  :  →  be a deep neural network (DNN) that maps  to the problem domain.Given  models,  1 ,  2 , . . .,   , extracted from public sources and a set of unlabeled test data  , the problem we study is to estimate the rank of models regarding their performance on    = { 1 ,  2 , . . .,   }.We assume that manual labeling is expensive especially when domain knowledge is required.To this end, we propose to tackle the ranking problem by only querying the predictions, which is highly applicable in practical scenarios.Remarkably, in this paper, we consider the performance of both accuracy and robustness.The accuracy is the correctness ratio of prediction on ID data.The robustness is the correctness ratio of OOD data.

Motivating Example
Figure 4 gives an example of LaF.In this simple 3-class example, there are 3 DNN models ( 1 ,  2 ,  3 ) given 6 unlabeled samples ( 1 ,  2 , . . .,  6 ).The goal is to rank the 3 models concerning their accuracy on these samples in the absence of true labels.First, we compute the predicted label of each sample by each model and remove  6 where all the models have the same prediction.Second, we initialize the two parameters,    and   , contained in our approach.   refers to how difficult a sample is for all models to predict the correct label.   indicates how good a model is to output the correct labels of all samples.Initially, we use the simplest and most commonly used majority voting heuristic [41] to give a pseudo label to each sample.For instance, the pseudo label of  1 is 0 because 2 ( 1 ,  2 ) of 3 models predict the label as 0.    is defined as the ratio of mismatched models that output a different label instead of the pseudo one.   is calculated as the ratio of correctly predicted samples over the entire set.Third, since the pseudo labels are not the true labels,    and    cannot truly reflect the data difficulty and model ability.We optimize these two parameters by a likelihood estimation method in the presence of true labels.Finally, based on the optimized    ( 4 5 , 1 5 , 3 5 ), we obtain the ranking 1, 3, and 2 for  1 ,  2 , and  3 , respectively.

Our Approach: LaF
Given that no label is available in the test data, the main idea of our approach is to infer the specialties of DNNs by approximately maximizing the likelihood between the predictions and true labels via the expectation-maximization (EM) algorithm [13] Step 1: Pruning.Inevitably, some data will receive the same predictions by all models, which is useless for discriminating the performance and causes computational cost.For this reason, we filter these data without losing any information for ranking and obtain a smaller set  ′  ′  ′ (Lines 1-6 in Algorithm 1).
Step 2: Initializing.First, for each data   , a pseudo label is voted by a majority of DNNs, namely,  ′  = mode    1≤  ≤ (Lines 7-9).Next,   is the number of DNNs that gives a different label from the pseudo label and   is the accuracy based on pseudo labels (Lines 10-15).Formally, the definitions are: Step 3: Optimizing.The EM algorithm solves the optimization problem by iteratively performing an expectation (E) step and a maximization (M) step (Lines 16-26).In the E-step, it estimates the expected value of the log-likelihood: where    = (  ,   ) and       is from the last E-step.For the computation, we use the definition from [47] where     =   |   ,   = 1 1+ −    .Besides, as    and    are independent given   ,  (  ) =    |   ,   .Remarkably,   represents the true label of a sample.In the ranking problem,   is absent but the probability of taking it as a true label is can be inferred by    |   ,   .
In the M-step, the gradient ascent is applied to search for    and    that maximize : where       is the updated parameters for the next iteration.
Step 4: Ranking.Finally, as    well estimate the abilities of each DNN given the observed labels, we use this vector to rank DNNs (Line 27).A high specialty indicates a good performance on the data.

Implementation
All experiments were conducted on a high-performance computer cluster and each cluster node runs a 2.6 GHz Intel Xeon Gold 6132 CPU with an NVIDIA Tesla V100 16G SXM2 GPU.We implement the proposed approach and baseline methods based on the state-of-the-art frameworks, Tensorflow 2.3.0 and PyTorch 1.6.0.For artificial data distribution shift, we consider 2 benchmark datasets where each includes 15 types of natural corruption with 5 levels of severity.In total, we test on Algorithm 1: LaF: Labeling-free comparison testing 2 * 15 * 5 = 150 datasets with the artificial distribution shift.Due to the space limitation, we only report the average results on corrupted data for baseline methods.The remaining results corroborate our findings and are available on our companion project website [3].

Research Questions
In this study, we focus on the following three research questions: • RQ1 (effectiveness given ID test): How is LaF ranking multiple DNNs given ID test data?
• RQ2 (effectiveness under distribution shift): How is LaF ranking multiple DNNs given OOD test data (including artificial and natural distribution shifts)?• RQ3 (impact factors of LaF): What is the impact of model quality and diversity on ranking DNNs?
The first two research questions successively evaluate the effectiveness of our proposed solution given test data with and without distribution shift.The second one also tends to show how flexible and practical LaF is in real-world applications, especially by the test data with natural distribution shifts.The last one investigates the impact factors that may affect the ranking performance.

Datasets and DNNs
Datasets.We choose 7 datasets, MNIST [26], Fashion-MNIST [48], CIFAR-10 [25], iWildCam [9], Amazon [36], Java250, and C++1000 [40] that are widely studied in previous work.These datasets cover the image (first 4), text (Amazon), and source code (Java250 and C++1000) domains.The test data that follow the same distribution as the training set are the so-called in-distribution (ID) data.The test data with data distribution shift are out-of-distribution (OOD).In our work, we consider two types of distribution shifts: artificial and natural.For the artificial distribution shift, we use two benchmark datasets, MNIST-C [35] and CIFAR-10-C [19], for MNIST and CIFAR-10, respectively.Each benchmark includes 75 datasets with 15 types of natural corruptions, such as Gaussian noise, shot noise, impulse noise, defocus blur, frosted glass blur, motion blur, zoom blur, snow, frost, fog, brightness, contrast, elastic, pixelate, and jpeg.Besides, each type of corruption has 5 levels of severity.For the natural distribution shift, we use two datasets for iWildCam and Amazon, respectively, from a recent-published benchmark, WILDS [24].The distribution shift comes from new came traps in iWildCam and new users in Amazon.For Java250, we manually collect the OOD dataset based on the definition in WILDS that the distribution shift of source code comes from new repositories.For each class in Java250, we extract java files from [40] under the constraint that the corresponding users do not exist in ID data.Table 1 lists the details of datasets.

Baseline Methods
In our study, we compare LaF to 3 baseline methods, random sampling, SDS, and CES.All baseline methods are sample-selection-based.Following [33], the labeling budget of the baseline methods ranges from the number of DNNs (i.e., 30 DNNs in MNIST) to 180 at intervals of 5.
Random sampling is a basic and model-independent method for data selection where each data has an equal probability to be considered.A subset of data is randomly selected and annotated to rank DNNs.
Sample discrimination based selection (SDS) [33] is the state-of-the-art approach in ranking multiple DNNs with respect to accuracy.Following [33], among data in the top 25% with high discrimination scores, we randomly select a given budget of data to label and annotate to perform the ranking task.
Cross Entropy-based Sampling (CES) [28] is designed to select a set of representative data to approximate the actual performance given a single DNN.We follow the same procedure as [33] to adapt CES for multi-DNN comparison.
Due to the random manner in the sampling methodology, each experiment of the baseline methods is repeated 50 times and we report the average result.

Evaluation Measures
To evaluate the effectiveness of each method, we follow the baseline work [33] and apply the statistical analysis, Spearman's rank-order correlation [12], and Jaccard similarity [33].The first one evaluates the general ranking of all models, while the last one specifically estimates the ranking of top- DNNs.In addition, we add the evaluation on Kendall's  rank correlation [12].Similar to Spearman's rank-order correlation, Kendall's  measures the non-parametric rank correlation.However, Kendall's  calculates based on concordant and discordant pairs and is insensitive to errors (if any) in data.By contrast, Spearman's rank-order correlation calculates based on deviations and is more sensitive to errors (if any) in data.
Kendall's  is where  and  are the numbers of ordered and disordered pairs in { (  ) ,  ′ (  )}, respectively. and  are the numbers of ties in { (  )} and { ′ (  )}, respectively.A large  indicates a strong agreement between the ground truth and estimation.Meng et al. proposed to apply the Jaccard similarity for measuring the similarity between the top- models.The similarity coefficient is defined as: A large   implies a high success in identifying the top- models.

RQ1: Effectiveness Given ID Test Data
First, we compare the effectiveness of four methods in ranking multiple DNNs based on the accuracy of ID data.Figure 5 shows the result measured by Spearman's rank-order correlation.The first conclusion we can draw is that, over seven datasets, all methods succeed in outputting positively correlated rankings.By comparison, LaF continuously outperforms (by up to 0.74) the baseline methods regardless of the labeling budget.Namely, the ranking by LaF is strongly correlated with the ground truth.In general, for the three sample-selection-based baseline methods, the correlation between the estimated rank and the ground truth increases when more data are labeled.However, for some datasets, the performance is still far from LaF.For example, in Amazon, LaF obtains a correlation coefficient of 0.80, while the best baseline, SDS, only achieves 0.48 using the maximum labeling budget of 180.Besides, due to the sampling randomness, each baseline method obtains different ranking results over 50 experiments, which is indicated by the large standard deviation (up to 0.36, shaded area in the figure) at each labeling budget.As a result, the rank by one experiment is not reliable by occasionally being good and poor.In particular, the standard deviation becomes smaller when more data are labeled, which means the ranking method highly relies on the labeling budget.By contrast, since LaF is labeling-free, there is no sampling randomness, in other words, the rank is deterministic.Additionally, Figure 6 presents the effectiveness of all ranking methods based on Kendall's  rank correlation.By comparison, the result confirms the conclusion drawn from the analysis based on Spearman's rank-order correlation.Namely, our approach stands out concerning the effectiveness without sampling randomness.
Besides, to demonstrate the significance of the two statistical analyses, we calculate the corresponding -value of all methods.A -value lower than the common significance level of 0.05 indicates that the ranking is strongly correlated with the ground truth.Except for the iWildCam dataset, the ranking results by LaF are all strongly correlated.However, due to the effectiveness and sampling randomness, the baseline methods always achieve insignificant rankings.For the iWildCam dataset, we believe the reason is that the difference between multiple DNNs is too slight given 182 classes.For instance, the accuracy difference between the best and worst is only 1.54% (Table 1).The impact of the accuracy/robustness on the ranking is investigated in Section 5.3.
On the other hand, we evaluate different methods concerning identifying the top- DNNs ( = 1, 3, 5, 10).Table 4 lists the result of Jaccard similarity.On average, LaF achieves the best result regardless of the datasets.It is better than the worst performance by up to 0.33, 0.32, 0.33, and 0.27 in the top 1, 3, 5, and 10 rankings, respectively.Concretely, in the top-1 ranking, for datasets MNIST, Fashion-MNIST, and iWildCam, all methods (Random, SDS, and ours) are not effective (under 0.08).Remark that CES takes the best results of all models for each labeling budget when knowing the ground truth.Specifically, it takes  (number of DNNs) times of labeling budget.Therefore, it sometimes outperforms others but is not applicable in practice.
Answer to RQ1: Based on the accuracy of ID test data, LaF outperforms all the 3 selectionbased baseline methods in outputting strongly correlated ranking.In addition, statistical analysis demonstrates that outperforming is significant.

RQ2: Effectiveness Under Distribution Shift
For the synthetic distribution shift, Tables 5 and 6 summarize the results of Spearman's rankorder correlation and Kendall's  on MNIST-C and CIFAR-10-C, respectively.We observe that our approach achieves the best performance in most cases, for instance, 291 of 300 cases in MNIST-C J. ACM, Vol.37, No. 4, Article 0. Publication date: August 2018.and 289 of 300 cases in CIFAR-10-C concerning Spearman's correlation, and 287 of 300 and 288 of 300, respectively, in two benchmarks concerning Kendall's .Furthermore, as shown in RQ1, SDS performs the second best among the four ranking approaches.However, compared to random and CES, SDS tends to lose its performance in these two tables (highlighted in yellow).For example, in MNIST-C with Defocus Blur, severity-2, SDS ranks the models wrongly with a correlation of -0.03 (Table 5), while both random and CES can achieve comparable ranking performance as LaF.In short, this existing state-of-the-art approach is sensitive to artificial distribution shifts, which calls for the testing under distribution shifts of existing approaches.Concerning the Jaccard similarity, in the 75 corruptions of MNIST-C, both LaF and CES outperform the random sampling and SDS to identify the top DNNs precisely.In CIFAR-10-C, LaF achieves the best performance (similarity of 1) in most cases (173 of 300).
For the natural distribution shift, the results are shown in Figure 7. LaF can better distinguish the performance of DNNs than the baseline methods.In particular, in Amazon, LaF is significantly better by up to 0.70 based on Spearman's correlation.In addition, concerning the Jaccard similarity in Table 7, LaF is consistently the best in identifying the top DNNs for iWildCam and Amazon.
In addition, compared to the effectiveness given ID test data, the ranking by all methods is different since the performance of DNNs changes given OOD test data.However, we notice an opposite phenomenon happens.Given ID test data, LaF achieves 0.39, 0.80, and 0.96 concerning Spearman's coefficient for iWildCam, Amazon, and Java250, respectively.While given OOD test data, the results are 0.91, 0.71, and 0.85, respectively.In other words, the effectiveness improves on the OOD test data in iWildCam but degrades in Amazon and Java250.To make clear the reason behind this, we analyze the accuracy and robustness of multiple DNNs on ID and OOD test data (Table 1), respectively.In iWildCam, the performance difference of its 20 DNNs becomes larger on OOD test data, from 1.54% to 11.52%.In Amazon, the performance of all 20 DNNs degrades, e.g., from 74.84% to 72.35%.Besides, the performance difference in Amazon becomes smaller.Therefore, we believe that the model's ability and the performance difference among DNNs have an impact on the ranking effectiveness, which leads to the investigation in RQ3.OOD data.The diversity indicates the performance difference among DNNs and is the standard deviation of accuracy or robustness over all DNNs on each dataset.Figure 8 plots the distribution of ranking performance concerning quality and diversity.Most good rankings happen with a high model quality (greater than 50%).The reason is that in our scenario, we only have the access to the predictions of test data of multiple DNNs, which setups the initial inference of data difficulty and model specialty.Therefore, the learned Bayesian model can be more precise when the qualities of DNNs are high.Furthermore, this also explains why LaF outperforms the sampling-based methods.For example, SDS selects a few discriminative data to annotate to rank DNNs and the selection of data highly relies on the predicted labels.As a result, since the low qualities of DNNs always give a wrong estimation of the discrimination ability of data, the ranking performance is poorer.For instance, in Java250 and C++1000, SDS only reaches 0.82 and 0.79 on Spearman's correlation with 20 labeled data, respectively.However, LaF achieves 0.96 and 0.95 in two datasets, respectively, with no labeling effort.On the other hand, concerning the diversity, Figure 8 reveals that there is a high chance of a good ranking when DNNs are diverse (greater than 5%).Additionally, a poor ranking mostly happens when DNNs are too close to each

Threats to validity
The internal threat is mainly due to the implementation of the baseline methods, our proposed approach, and the evaluation metrics.For SDS, we use the original implementation on GitHub provided by Meng et al. [33].For random sampling and CES, we implement it based on the description in [33] and carefully check the result to be consistent with that in [33].Regarding the evaluation metrics, we adopt popular libraries, such as SciPy [6].
The external threat comes from the evaluated tasks, datasets, DNNs, and baseline methods.Regarding the classification tasks, we consider three different ones, image, text, and source code.For the datasets, we select the publicly available datasets.In particular, for datasets with the artificial distribution shift (15 types of natural corruptions) and natural distribution shift, we employ four public benchmarks.Concerning the DNNs, we collect them (either the off-the-shelf models or train with the provided scripts) from different repositories on GitHub.These models are with different architectures and parameters.For the comparison, we consider three sample-selectionbased baseline methods and apply different numbers of labeling budgets to imply their performance.
The construct threat mainly lies in the sampling randomness in the baseline methods and the evaluation measures.To reduce the impact of randomness, for each baseline method, we repeat each experiment concerning the labeling budgets, datasets 50 times and report the results of both average and standard deviation.Since our proposed approach does not rely on sampling data to annotate, there is no sampling randomness.Considering the randomness (gradient ascent search) in the EM algorithm, we repeat LaF 50 times and found that the randomness was negligible (less than 1.84E-03).Regarding the evaluation measures, we consider three popular statistical analyses.Kendall's  rank correlation and Spearman's rank-order correlation can infer the effectiveness of the methods concerning the general ranking, while the Jaccard similarity can specifically check the performance concerning the top- ranking.Besides, for the statistical analyses, we report the -value to demonstrate the significance.

Related Work
We review the related work from two aspects, deep learning testing and test selection for deep learning.

Deep Learning Testing
Deep learning (DL) testing refers to evaluating the quality of developed deep neural networks (DNNs) for further deployment [50].A simple and local testing strategy is to split a dataset into training, validation, and test sets.The training and validation sets contribute to the training process to tune parameters.The test set is untouched by the training process to provide an unbiased evaluation of the accuracy.Typically, this testing is built on the assumption that the training and test sets are independent and identically distributed.
Instead of simple performance testing, multiple advanced testing techniques have been proposed in recent years.Pei et al., [39] proposed neuron coverage-guided testing for deep learning systems which borrows the idea from code coverage-based testing in traditional software engineering.Here, the coverage is calculated base on the outputs of neurons in DNN.After that, based on the basic neuron coverage criterion, Ma et al., [31] designed different types of coverage criteria to further explore the coverage-guided testing.Based on these coverage criteria, DeepTest [45] and TACTIC [30] tended to test DNN-based self-driving systems.Both of them utilize the coverage information to guide the search algorithm to generate error-prone test sets to challenge the target systems.Besides, the famous technique -fuzzing testing was also applied for testing deep learning models.Odena et al., [37] proposed the first fuzzing testing framework by randomly injecting noise into the image to generate new tests to find the error inputs against the DNN model.Xie et al., [49] used neuron coverage as fitness to fuzz the data and generate tests for the DNN testing.More practical, Guo et al., [18] provided a tool to support fuzzing testing of DL models.Different from the above works which focus on a single DNN model and utilize test data to measure the quality of the model, our work studies multiple models and provides a new technique to rank multiple models without label information.

Test Selection for Deep Learning
The purpose of test selection for deep learning is to reduce the labeling effort during DL testing.Generally, test selection methods can be divided into two types, test selection for fault identification and test selection for performance estimation.
Test selection for fault identification is to find the test data that are most likely been mis-predicted by the model.Multiple methods have been proposed in the last few years.Feng et al., [16], and Ma et al., [32] proposed metrics based on the uncertainty of output probabilities, and also demonstrated that these metrics can be used to select data and retrain the pre-trained model to further enhance its performance.Chen et al., [46] utilized the technique of mutation testing to mutate input data and models and select the error-prone test data based on the killing score.More recently, Li et al., [27] proposed a learning-based method that uses graph neural networks to learn the difference between the fault data and normal data and then predicts the new faults.Gao et al., [17] considered the diversity of faults and selected faults from different fault patterns.
Test selection for performance estimation aims to select a subset of data that can represent the whole test set.In this way, we can only label and test this subset and know the performance of the model on the entire test data.Li et al., [29] proposed CES, which selects samples that have the minimum cross entropy with the entire test set.Chen et al., [11] proposed a clustering-based method PACE that selects data from the center of each cluster.
Even though test selection for performance estimation can be also used for selecting the best model during the model reusing process, we considered it as our comparison baseline.The major difference between test selection and our proposed method LaF is -LaF is labeling free which means the model resuing process can be fully automated.

Conclusion
Observing the limitations (labeling effort, sampling randomness, and performance degradation on out-of-distribution data) of existing selection-based methods, we proposed a labeling-free approach to undertake the task of ranking multiple deep neural networks (DNNs) without the need of domain expertise to lighten the MLOps.The main idea is to build a Bayesian model given the predicted labels of data, which allows for free labeling and non-sampling randomness.The experimental results on various domains (image, text, and source code) and different performance criteria (accuracy and robustness against artificial and natural distribution shifts) demonstrate that LaF significantly outperforms the three baseline methods concerning both Spearman's correlation and Kendall's .In addition, the results of the Jaccard similarity show the efficiency of LaF in identifying the top- ( = 1, 3, 5, 10) DNNs.
This work currently only focuses on the classification task, we will explore it for other tasks, such as regression, in future work.Observing the ranking difference on ID and out-of-distribution (OOD) test data, our approach might be useful to detect the existence of distribution shifts.We will consider this in future work.
Fig 2 are examples of natural distribution shift.The two pictures have the same label cow but the cows are captured by different positions.

Fig. 5 .
Fig. 5. Spearman's correlation coefficient of ranking results based on ID test data.The higher the better.The shaded area represents the standard deviation."Budget" is the number of labeled data.

Fig. 6 .
Fig.6.Kendall's  of ranking results based on ID test data.The higher the better.The shaded area represents the standard deviation."Budget" is the number of labeled data (only apply to Random, SDS, and CES).

Fig. 7 .
Fig. 7. Spearman's correlation coefficient of ranking results based on OOD test data.The higher the better.The shaded area represents the standard deviation."Budget" is the number of labeled data.
[47]t    =    1≤ ≤,1≤  ≤ be the predicted labels of    and    = {  } 1≤ ≤ be the true labels.Here,    refers to   and model   .Given the observed    and latent    governed by unknown parameters    , the likelihood function is defined as     ;    =     |    =   |    .The goal is to search the best    that maximizes the likelihood, in other words, the probability of observing    .As for    , inspired by[47], we consider two factors, data difficulty    = {  } 1≤ ≤ and model specialty    =   1≤  ≤ , that influence the performance of DNNs.Namely,    = (  ,   ).Algorithm 1 presents the pseudo-code of our approach.

Table 1 .
Summary of datasets."#ID" is the number of in-distribution test data."#OOD" is the number of out-of-distribution test data with artificial or natural distribution shifts.

Table 2 .
Summary of models."#DNNs" is the number of DNNs collected for each dataset."#Parameters"shows the minimum and the maximum number of parameters of collected DNNs."Accuracy" and "Robustness" lists the lowest and highest accuracy and robustness on test data with and without distribution shift, respectively.MNIST-C and CIFAR-10-C are two benchmark datasets and corresponding robustness is summarized in Table3.

Table 4 .
Jaccard similarity of ranking the top- DNNs based on the clean accuracy.For baseline methods, we report the average results over all labeling budgets.The best performance is highlighted in gray.The higher the better.

Table 5 .
Spearman's correlation coefficient of ranking results based on MNIST-C and CIFAR-10-C.For baseline methods, we compute the average and standard deviation over all labeling budgets and 50-repetition experiments.The best performance is highlighted in gray.Values highlighted in yellow indicate CES or random outperform SDS.The higher the better.

Table 6 .
Kendall's  of ranking results based on MNIST-C and CIFAR-10-C.For Random, SD, and CES, we compute the average and standard deviation over all labeling budgets and 50-time experiments.The best performance is highlighted in gray.The values highlighted in yellow are where CES or random outperform SDS.The higher the better.