KAPE: kNN-based Performance Testing for Deep Code Search

Code search is a common yet important activity of software developers. An efficient code search model can largely facilitate the development process and improve the programming quality. Given the superb performance of learning the contextual representations, deep learning models, especially pre-trained language models, have been widely explored for the code search task. However, studies mainly focus on proposing new architectures for ever-better performance on designed test sets but ignore the performance on unseen test data where only natural language queries are available. The same problem in other domains, e.g., CV and NLP, is usually solved by test input selection that uses a subset of the unseen set to reduce the labeling effort. However, approaches from other domains are not directly applicable and still require labeling effort. In this article, we propose the kNN-based performance testing (KAPE) to efficiently solve the problem without manually matching code snippets to test queries. The main idea is to use semantically similar training data to perform the evaluation. Extensive experiments on six programming language datasets, three state-of-the-art pre-trained models, and seven baseline methods demonstrate that KAPE can effectively assess the model performance (e.g., CodeBERT achieves MRR 0.5795 on JavaScript) with a slight difference (e.g., 0.0261).


INTRODUCTION
Code search aims to retrieve semantically relevant code snippets from a large code corpus that mostly match a natural language query, which is an essential practice of software developers to avoid "reinventing the wheel." An early case study conducted in Google shows that a developer, on average, makes 5 search sessions with 12 queries every workday [40].Beyond serving as a critical development activity, code search can support other software engineering tasks, such as defect localization [49], program repair [2], and code synthesis [38].Generally, given a functionality, developers seek to reuse previously written code examples by searching over popular platforms, such as Stack Overflow [51], GitHub [10], and Google [30].For example, developers have sought coding help from Stack Overflow over 45.1 billion times, and more than 21 million queries were made.Due to constantly growing demand, researchers have leveraged the data from these platforms as a way to power the code search engines.Studies have shown that deep learning (DL) is the most popular modeling technique for code search [27] given its ability to embed the code representation [4,14,31].
Although deep code search has attracted popular attention from researchers to devote to the development of ever-better deep neural networks (DNNs) [4,14,31], the testing of such models for secure and reliable deployment is lagging behind.For instance, the common scenario of testing the model performance given unknown queries has not been studied.Differing from traditional software systems, the DL-based system has a fundamentally different nature and computing logic.In conventional programming, developers design the computing logic to obtain the executable code for solving a given task.The change of data will not influence the functionality.However, in DL, developers design the architecture of the DNN that learns the computing logic from the input data and expected results.Specifically, the logic is defined by the weights and bias parameterizing the connections inside a DNN.By contrast, the DNN's behavior may evolve in response to the change in new data.Namely, given a trained deep code search model, the reported performance on the original test data cannot reflect the actual performance on unseen data, leading to the demand for a dedicated testing before deployment.For example, the mean reciprocal rank (MRR) performance of a pre-trained CodeBERT is 0.8048 on the original Ruby test data but changes to 0.7301 on new test data (see Table 2; more details of the dataset and MRR can be found in Section 4).
Due to the supervised learning nature of DL [11], testing a trained deep code search model requires query-code pairs to calculate the correctness of identifying the best code snippets.In practice, it is easy to collect a large number of natural language queries from public platforms, such as Stack Overflow.However, the corresponding code snippets are usually missing, which makes the testing challenging.Assigning the matched code snippet for each unseen query is straightforward but impractical and almost impossible for four main reasons.❶ Time cost: Collecting queries from online platforms is easy and free, but manually checking the code snippet is time-consuming, especially with the continuous surging queries every day.❷ Domain knowledge: Generally, a specific programming language code data is used to test the DL model.Given various languages, e.g., Java, Python, and Go, domain experts are required to give reliable matching.❸ Correctness: Even with domain knowledge, errors are inevitable in human beings' work.❹ Financial cost: For instance, labeling 1,000 units (50 words per unit) per human labeler of texts for classification costs 129 dollars by the Google Cloud AI platform data labeling, and each unit needs at least three labelers to guarantee correctness [13].Remarkably, the cost increases significantly when it requires excellent query understanding and strong developing experience in specific programming languages.To the best of our knowledge, there is no work in the literature that focuses on this specific testing.
The same issue also happens to DL models in other domains, such as computer vision (CV) [5,26] and natural language processing (NLP) [19].Researchers tend to apply the test input selection technique [5] to solve this issue in CV and NLP, especially for classification tasks, e.g., image classification and text classification.In this technique, a subset of data is selected based on a specific selection metric to represent the entire set.Many selection metrics have been proposed in these two domains [5,17,26].However, they are not directly applicable to deep code search models.For instance, in CV and NLP, studies have shown the success of selecting data based on the prediction probability [26].However, deep code search is not a classification task, and it aims to, given a query, find the best-match code snippet from a large codebase.Thus, there is no prediction probability to use but only the contextual representation of natural queries.
This article focuses on testing a trained deep code search model on unseen data without extra human power, namely, manual query-snippet matching.We propose the kNN-based performance estimation (KAPE) that takes advantage of the similar queries from the training set for each unseen data to undertake the performance estimation.Concretely, KAPE first feeds both the training and test queries into a trained model to obtain the contextual representations for the following similarity calculation.Then, we leverage the widely used cosine similarity [50] to quantify the semantic similarity between representations.Next, to locate the corresponding queries from the training set for each test query, we utilize the simple yet efficient non-parametric k-nearest neighbors algorithm [3].Finally, concerning the difference in data, we propose to adaptively determine the relevant nearest neighbors and their weights contributing to the performance based on the Z-Score [53].The evaluation on six programming languages and three pre-trained models demonstrates that KAPE is capable of estimating the model performance on unseen data.Meanwhile, our investigation on the parameter sensitivity and data distribution influence shows that KAPE is stable and flexible to possible changes.Additionally, our ablation study demonstrates the usefulness of the calculation of the adaptive weights.To summarize, the main contributions of our work are as follows: (1) To the best of our knowledge, this is the first work that undertakes performance estimation testing for deep code search models.(2) We propose KAPE, which requires no manpower to manually match code snippets to unseen test queries.KAPE is automatic and practical in real-world applications.(3) We conduct comprehensive experiments on six popular programming languages, three stateof-the-art pre-trained models, and seven baseline methods to evaluate the effectiveness of KAPE.
The rest of this article is organized as follows: Section 2 introduces the background and related work of this article.Section 3 details the methodology of our proposed KAPE.Sections 4 and 5 cover the experimental setup and the results analysis.Section 6 discusses the strengths and limitations of KAPE and lists the potential threats that influence our conclusions.In Section 7, we conclude our work and point out the future work.

Code Search and Pre-trained Models
Code search is a daily activity of software developers during software development.Given its importance, many code search tools/models have been developed, which can be divided into two categories: traditional and DL-based methods [27].The traditional manner mainly utilizes the information retrieval technique, such as the Boolean model [23,32,55], vector space model [23,34], and PageRank algorithm [35].The DL-based models take advantage of DL given its powerful ability in learning language representations [4,7,14].In particular, pre-trained models (PTMs) have gained remarkable attention.Figure 1 illustrates a general process of using  deep learning models for code search where the DL model learns the mapping between queries and snippets from a large set.
PTMs were proposed initially for NLP tasks [37].A PTM is first fed with a huge text corpus to learn the representations.Then the PTM is transferred to a downstream task with fine-tuning on a specific dataset.The performance of a PTM is usually better than a task-focused model that trains on a specific dataset directly.Typical PTMs are Google's bidirectional encoder representations from transformers (BERT) [6] and Facebook's robustly optimized version named RoBERTa [29].Due to their considerable functionality, researchers have been seeking to apply PTMs to source code analysis.Built on the top of BERT and RoBERTa, Microsoft proposed CodeBERT [9], which learns the representations of both programming and natural languages.Specifically, concerning the inherent structure of code, GraphCodeBERT was developed [15].

Deep Learning Testing
DL is a machine learning technique that learns complex patterns in data by multiple layers of neurons that mathematically transform the data and the connection between neurons forms the data flow.Due to this complex computing logic, DL models are lacking interpretability [59] and require comprehensive testing before being deployed in real-world applications.DL testing refers to evaluating the quality of DL systems for further deployment [20,57].The testing-related works in the literature mainly focus on the domains of CV and NLP, while very few consider deep code search systems [46].
As a new type of data-driven software, DL models have advantages in learning features from a large input space.However, the input space is expected to cover all possible cases in the real world to ensure performance, which is infeasible.In practice, only a fixed training set is applied to approximate the input space, and undoubtedly, this approximated input space is much smaller than the real one.Hence, generally, a DL model performs well on designed test data during development but exhibits performance degradation on unseen test data when deployed in the real world [18,24].
Test input selection is essential for developers to estimate a DL model performance after deployment and has been well studied for classification tasks in the domains of computer vision and natural language processing [5,26].This technique decides which test data should be used from available unlabeled tests to cut the cost associated with the labeling effort.Instead of the manual and ad hoc way, these tests can be selected strategically based on the behavior of DL models.For instance, Chen et al. proposed the practical accuracy estimation (PACE) [5] to approximate the accuracy of classifiers and regression models.PACE utilizes the model output at different levels (e.g., first layer, last hidden layer, confidence output) as data features.Based on extracted features, PACE performs clustering to group data into different clusters where representative data are selected proportionally and separately.For the same tasks, Li et al. proposed the cross-entropybased sampling (CES) [26].The main idea is to select data that has the minimum cross-entropy with the entire test set.CES also uses the confidence output (prediction probability of belonging to a certain class) as the data feature.However, concerning deep code search, the prediction probability is not available from the model but the contextual representations of text query with the absence of corresponding code snippets.Therefore, these probability-based SOTA approaches are directly inapplicable.Some other test input selection approaches have been proposed to locate error inputs where the model has not learned sufficiently.Note that these approaches are also called test input prioritization in the literature [5].Pei et al. proposed the first white box testing, Deep-Xplore [36], to find test inputs that trigger the error behavior of DL models with a high neuron coverage.Similarly, References [25,33,54] also select data based on the neuron coverage.Some other approaches utilize the prediction probability to select data.For instance, Shen et al. select data with a small probability ratio between the first two predicted classes [47].Feng et al. identify test inputs with the largest Gini impurity [8].However, these types of approaches mainly select test inputs to enhance the model performance, which is different from this article's focus.
Another relevant technique to test input selection is active learning from the machine learning community [44].In active learning, a DL model is trained iteratively through multiple steps.At each step, a small set of training data is selected based on a specific acquisition function to update the pre-trained model from the previous step.The goal of both test selection metrics and acquisition functions in active learning is to reduce the labeling cost when given massive unlabeled data.The main difference is that selection metrics target the test data while acquisition functions select data from the training set.In the literature, studies have proved that these acquisition functions perform effectively as test selection metrics [17][18][19].For example, Ozan and Silvio proposed the core set selection [43] for the image classification task.The core set tends to select the representative data from the training set for the current step training.Nevertheless, this approach also requires the prediction probability to select data.Additionally, due to its greedy algorithm in the selection procedure, its execution is prohibitive and impractical.

KAPE
We first describe the problem that is targeted in this work.Next, before the detailed explanation of KAPE, we present a motivating example.Via the example, one gets a preliminary insight into the key idea of our proposed KAPE.Finally, we explicitly present KAPE.

Problem Definition
Given a deep code search model f and its training set D D D, where D D D consists of query-code pairs, this article focuses on the problem of estimating the model performance P T T T on a set of unseen data T T T .Specifically, T T T only includes a number of queries and their corresponding code snippets are missing.However, the code snippets are required to precisely calculate the model performance.
To solve this problem, there is the test input selection technique that is well-studied in the CV and NLP fields, especially for classification tasks (e.g., image classification, sentiment classification).The core idea is to select a subset of data from the test set and manually assign the corresponding labels (code snippets in code search).This subset is assumed to be representative of the entire set, and, thus, the performance of this subset is considered as the estimated performance of the entire set.Formally, However, manually assigning the best-match code snippet to a test query is more challenging than annotating the label (e.g., cat) of an animal in image classification or the review sentiment (e.g., positive, negative) in sentiment classification.Concerning this, we are interested in solving this problem without any manpower.

Motivating Example
Our intuition is that a model outputs the same code snippet if two queries are similar enough.Figure 2 presents a motivating example of our proposed kNN-based performance estimation (KAPE).In this example, the training set includes seven query-code pairs (d 1 , d 2 , . . .,d 7 ) and the test set has three queries.We first find the two nearest neighbors for each test query from the training set.Next, we obtain the individual model performance of each neighbors group.Concretely, we create a neighbors group of query-code pairs where the queries from the training set are the nearest neighbors of test queries.The model can output the performance on each individual pair (P 2 , P 4 , P 7 ).Finally, for each test query, given its two nearest neighbors, we assign the weights (e.g., ω 12 is the weight of the second nearest neighbor for the first test query) and obtain the final result P.

Methodology
Our proposed KAPE consists of four main steps.Figure 3 presents an overview of KAPE, and Algorithm 1 gives details of KAPE.
Step 1: Feature extraction.Since KAPE relies on the semantic similarity between the unseen test set and training set to conduct performance estimation, the first step is to determine the query feature for the following similarity measure.In this article, we utilize the contextual representation of natural language queries obtained from the trained model.The reason is that, in general, deep code search models utilize the similarity between the representation vectors of queries and code snippets to find the best match.Namely, if a training query has the same representation vector as a test query, then the model will identify the same code snippet.Line 1 extracts the natural language representation vectors of the training and test queries, respectively.
Step 2: Similarity calculation.We utilize the cosine similarity to capture the similarity between two queries, which is defined as: where are the contextual representations (vectors) of the ith query in the training set and the jth query in the test set, respectively.t is the length of the representation vector.For each test query, we compare its similarity with all the queries in the training set (Lines 2-7).
Step 3: kNN locating.Given  the training set (Lines 8-11).kNN is a simple and easy-to-implement machine learning algorithm that is widely employed in recommendation system [1], classification [58], and regression [45].We assume that similar queries, respectively, from training and test sets will trigger a similar model performance.Table 1 lists 10 pairs of queries and the corresponding similarities.The examples are from the Ruby dataset and GraphCodeBERT model (more details can be found in Section 4.2).The semantic similarity between two queries is measured by the widely used cosine similarity [50] in NLP.The last two columns (Reciprocal rank, more details in Section 4.2) show the model performance on each test query and its corresponding matched training query.In most cases (7 of 10), the test query shares the same performance as its matched one.
Step 4: Performance estimation.Finally, we estimate the model performance based on the kNN of test data and the corresponding similarity.As shown in Table 1, similar queries do not always have the same performance.For example, the fifth test query has a similarity of 0.8649 to its most similar query from the training set, but the model performance is 0.5 and 0.125, respectively.This is reasonable, because if a test query is very similar to several training queries, then the output code snippets can vary.As a result, the selected training query-snippet cannot precisely approximate the model performance on this test query.Concerning this, we propose to calculate the weight of each nearest neighbor to approximate the model performance.We undertake this step in two sub-steps (individual training subset evaluation and weight calculation).In concrete, first, for each i = 1 → k (Line 12), we can extract a subset of training data including query-snippet pairs of the ith nearest to test queries (Line 13).The subset has the same size as the test set, and we can obtain the performance on each query-snippet pair by simply evaluating the model (Line 14).
In weight calculation, we utilize the similarity between queries to obtain the weight.Given the similarity matrix NS, for each test query, we first determine the neighbors/similarities to use.This 48:9 is to avoid the impact of low similarities.For example, the similarities between a test query and its 5 nearest neighbors from the training set are 1, 0.8743, 0.8718, 0.8472, and 0.8443, respectively.The actual performance on this test query is 0.2 and 0.2, 0.2, 1, 1, and 1, respectively, on its 5 neighbors.Using the first nearest neighbor can precisely estimate the performance.However, if we take the average of the 5 results or use similarity as the weight to calculate the final result, then there will be an inevitable difference.To solve this problem, we use the Z-Score [53] to adaptively identify to-be-used neighbors (Lines 16-20).In statistics, the Z-Score tells how far a data is from the mean, which can be used to identify outliers in a set of data [39].A Z-Score of 1.0 indicates that the value is one standard deviation from the mean.A value with a high Z-Score is usually considered as an outlier in the group.In this article, we experimentally take the neighbors that have an absolute Z-Score of less than 1 into consideration.Thus, given the Z-Scores, z i z i z i , of the ith test query and its similarity matrix, NS i NS i NS i , the weight of each neighbor is: , z i, j ≤ 1 0 , otherwise. ( Finally, the model performance is calculated given the weight (Line 21) and performance on each individual training subset (Lines 22-23).

EXPERIMENTAL SETUP
Our experiments aim to address four research questions: RQ1 Effectiveness.How effective is KAPE in estimating the model performance given an unseen test set?RQ2 k sensitivity.How sensitive is KAPE to the setting of k in the kNN algorithm?RQ3 Impact of data distribution.Does the data distribution w.r.t. the similarity affect KAPE's effectiveness?RQ4 Impact of nearest neighbors' weights.What is the impact of weight calculation on KAPE?
RQ1 gives an insight into KAPE's effectiveness in assessing the model performance using different datasets and models.By RQ2, we analyze if KAPE performs consistently given different settings of its only parameter k, which will demonstrate how flexible KAPE is.Since KAPE takes advantage of the similarity between the test queries and training set to approximate the model performance, we conduct RQ3 to explore if KAPE is stable with different data distributions w.r.t. the similarity.Finally, as an important component of KAPE, the weight calculation is adaptive in the number of k for each test data as well as the weight from each nearest neighbor, which makes KAPE practical for real-world applications.We undertake an ablation study for RQ4 to verify the importance of this component.

Implementation Details
All experiments were conducted on a high-performance computer cluster, and each cluster node runs a 2.6 GHz Intel Xeon Gold 6132 CPU with an NVIDIA Tesla V100 16 G SXM2 GPU.We implement KAPE and baseline methods using the PyTorch 1.6.0framework.We repeat each experiment three times to reduce the influence of randomness.Additionally, for reproducing the results, we use fixed random seeds of 0, 1, and 2 in the experiments.Due to the space limitation, we only report the results of the largest dataset PHP for RQ3, and the remaining results corroborating our findings are available on our companion project website [16].

Datasets, Models, and Performance Measure
Datasets and models.We use the six benchmark datasets provided by the CodeSearchNet challenge [21] including different programming languages, namely, JavaScript, Java, Python, Ruby, PHP, and Go.For all the datasets, we utilize three state-of-the-art pre-trained models for deep code search, RoBERTa [29], CodeBERT [9], and GraphCodeBERT [15], which are superb in learning the contextual representations of both natural and programming language data.The pre-trained models are obtained by the implementation provided by the CodeXGLUE project [31] (epoch number is 5 and default for the other parameters).Each model is fine-tuned using the training set and tested on the validation set.The test set is regarded as unseen data and untouched in the fine-tuning procedure.Table 2 lists more details of each dataset.MRR.We adopt the widely used mean reciprocal rank (MRR) [14,31] in our experiments to measure the model performance.MRR is calculated by: where rank i is the position of the matched code snippet in the returned results of the ith test query.The higher the MRR, the better the searching performance.Note that P i, j in Algorithm 1 is equal to 1 r ank i instead of the MRR.The reciprocal rank of each query in Table 1 is also 1 r ank i .

Baseline Methods
Concerning that this is the first performance estimation work for deep code search, we take three test selection metrics (random sampling, PACE, and DeepGini) and three acquisition functions (LC, Margin sampling, and MaxEntropy) from active learning that are widely studied in the CV and NLP domains [5,19] as the baseline methods.Note that References [17][18][19] have demonstrated that the acquisition functions can act as test selection metrics.In addition, as KAPE selects data from the training set, we propose the baseline method of randomly selecting data from the training set.Without loss of generality, we use random sampling (test) to refer to the sampling from the test set and random sampling (train) for sampling from the training set.
-Random sampling (test) A fixed number of test data is randomly selected and the corresponding code snippets are manually matched.
-PACE [5] The Practical ACcuracy Estimation (PACE) method first divides the test queries into different clusters using the HDBSCAN (hierarchical density-based spatial clustering of applications with noise) clustering algorithm.Then PACE utilizes the MMD-critic algorithm [22] to select the most representative data from each cluster proportionally concerning the cluster size.PACE was initially proposed for image classification and regression tasks.Note that PACE, DeepGini, LC, Margin sampling, and MaxEntropy require the prediction probability to perform the clustering procedure or the uncertainty calculation.Since, in the deep code search task, the prediction probability is unavailable, these metrics are not directly applicable.To solve this issue, we calculate the similarity between each test query and its predicted 10 best-match code snippets from the training set and apply softmax function [12] to obtain the probability for each best-match.The probabilities of 10 best-match are considered as the prediction probability of 10 classes.
Table 3 presents the four main differences between KAPE and baseline methods.First, compared to test selection metrics where a subset of test data approximates the model performance on the entire test set, KAPE selects data from the training set to achieve the goal.Second, in test selection metrics, the selected data size depends on the given budget of manpower and is usually much smaller than the given test set size.By contrast, KAPE selects the same size of training data.Third, since KAPE relies on the training set to undertake the performance estimation, no manpower is required, which is more practical.Finally, due to the sampling randomness, the random manner (from test or training data) has low stability of performance estimation.Namely, the estimated performance by random sampling varies among several repetitions.
For the six test selection metrics, we use different labeling percentages, i.e., 1%, 3%, 5%, . . ., and 50%.1% means that 1% test data is selected.The sampling size of random sampling (train) is the same as the unseen test set.

RESULTS AND DISCUSSION
In this section, we first compare the performance estimation effectiveness of KAPE and baseline methods.Next, by assigning different ks, we explore the KAPE's sensitivity to k.Third, we investigate if the type of data influences KAPE's effectiveness.The type of data means test queries are very similar or different to the training set.Finally, we discuss the necessity of adaptive weight calculation in Section 3.3 via an ablation study.

RQ1: Effectiveness
Figure 4 shows the comparison between KAPE and seven baseline methods based on the Code-BERT model.Compared to random sampling (train), KAPE always estimates the model performance more accurately for all datasets and models.We can first conclude that selecting data from the training set based on the semantic similarity between training and test data is more reasonable than simple test-independent sampling.However, the effectiveness of random sampling (test) and Margin sampling improves along with increasing the labeling percentage.The other test selection metrics perform inconsistently across different datasets.For instance, LC, MaxEntropy, and DeepGini perform well on JavaScript and Ruby but act extremely badly on Java, PHP, and Go.In particular, when the labeling percentage is less than 10%, these three metrics estimate the model performance as around 0, which is far from the ground truth.In addition, PACE improves the effectiveness along with the increment of labeling budget on JavaScript, Python, and Ruby but degrades on the other datasets.By contrast, in most cases, KAPE outperforms baselines in the six datasets regardless of the labeling percentage.For instance, in PHP, random sampling (test), LC, MaxEntropy, Margin, and DeepGini can only reach a competitive performance when manually matching more than 50% (14,196) code snippets from the unseen test queries.In addition, due to the sampling randomness, there is a performance deviation in random sampling (train) and random sampling (test) over three repetitions, which is avoided by KAPE.  of the model.Additionally, given the GraphCodeBERT, Table 5 shows that KAPE has a greater deviation on Ruby (0.009) and Python (0.0053) than other datasets.We conjecture the reason is that the unseen test sets in these two datasets include many data that have very low and different similarities to their nearest neighbors.Thus, more neighbors and their corresponding similarities are considered in the weight calculation.To verify this, we conduct the next experiment to explore the impact of data similarity on KAPE.
Answer to RQ2: KAPE's effectiveness is stable with different k settings with a slight deviation less than 0.01.

RQ3: Impact of Data Distribution
Concerning that KAPE is based on the semantic similarity between test queries and the training set, we explore, in this research question, the impact of data distribution.Namely, we assess KAPE's performance in challenging scenarios, where the test set closely resembles or significantly differs from the training set.Here, the data distribution refers to the distribution based on semantic similarity.We evenly split each test set into two subsets, i.e., a similar set and a different set.In concrete, the similarity between each test query and its nearest neighbor is first calculated.Then, we group each test data into a similar set if its similarity is greater than half of the test and vice versa.Figure 5 shows the density distribution of similarities of the two sets for each dataset.
In PHP and Go, the difference between similar and different test sets is smaller than in other datasets, since most data have a similarity between 0.8 and 1.0.For example, in Ruby, the lower bound of data similarity reaches 0.5.
Figure 6 shows the results on the similar and different sets on the PHP dataset.In general, KAPE still outperforms baseline methods in most cases and performs better on similar test sets.Concerning the similar set, KAPE performs the best regardless of the model by achieving only 0.0118, 0.0079, and 0.0112 differences with the ground truth given RoBERTa, CodeBERT, and GraphCodeBERT, respectively.Random sampling (train) still performs worse than KAPE as in Section 5.1.Random  sampling (test) improves the performance with a greater labeling percentage but consistently performs worse than KAPE even with the labeling percentage at 50%.LC, MaxEntropy, and DeepGini estimate the model performance around 0 when the labeling percentage is less than 20% (5,678 test data).The effectiveness of Margin sampling and PACE varies much when increasing the labeling percentage.For instance, Margin sampling performs the worst when 15% of test queries are manually matched with code snippets and improves the performance by increasing and decreasing the percentage.Concerning the different sets, in general, all test selection metrics improve the effectiveness when increasing the labeling percentage.In all models, when the labeling percentage is less than 30%, KAPE still outperforms all these metrics.By comparison, the estimation error increases given the different sets and is up to 0.0660 (RoBERTa).The reason is that when the nearest neighbors from the training set are similar to the test data, the performance of the training data is more reliable to be transferred to the test set.Answer to RQ3: Although KAPE is based on the similarity between test and training sets, its effectiveness is flexible to various (similar or different) data distributions.In addition, KAPE benefits more from the similar data distribution where most test data are close to the training set.

RQ4: Impact of Nearest Neighbors' Weights
Recall that in the last step of KAPE, given k, the number (≤ k) of used nearest neighbors of each test query and the weight of each nearest neighbor are assigned adaptively based on the Z-Score and similarity (Section 3.3).To verify the importance of this adaptive calculation, we conduct an ablation study.We compare KAPE to the other three manners of obtaining the estimated performance (Equation (3), Line 21 in Algorithm 1): -Fixed k, equal weight: For each test query, we assign equal weights to its k nearest neighbors.
The variant of Equation ( 3) is defined as: -Fixed k, adaptive weight: We utilize the similarity to determine the weight: -Adaptive k, equal weight: We utilize a flexible number of k for each test query based on the Z-Score and re-define Equation (3) as: Table 6 presents the statistical result of adaptive k calculated by the Z-Score method.It shows that with the adaptive manner, test queries have different numbers of nearest neighbors contributing to the weight calculation.For example, in Go, GraphCodeBERT, given the pre-set maximum k = 10, the 14,291 test queries have an average use of nearest neighbors of 8.97, and the minimum number drops to 5. In concrete, only 43.96% (6,282) test data fully take all the 10 neighbors and 1 data uses 5 neighbors.20.26% (2,896), 25.21% (3,604), 9.94% (1,406), and 0.71% (102) test data use 9, 8, 7, and 6 neighbors for the weight calculation, respectively.This is reasonable given that, for each test query, the similarity between it and its nearest neighbors can be significantly different from the others.
Table 7 presents the results of the ablation study on the JavaScript dataset.Regardless of the model, the adaptive setting of k and weight achieves the best result in most cases.With a very small k (k = 1 and k = 2), the four ways of weight calculation obtain the same results, but the estimation precision can be lower than using greater ks.For example, in RoBERTa, using seven nearest neighbors can reach a difference at 0.0005 but 0.0011 by only using the first neighbor.When increasing the k, the advantage of using the adaptive setting stands out.Additionally, given  fixed k, using adaptive weight is always better than the equal manner in most cases, which also happens when given adaptive k.However, given adaptive weight, using adaptive k is always better.
Answer to RQ4: Our proposed adaptive strategy to determine the number of nearest neighbors and the corresponding weight contributes positively to KAPE's effectiveness.

Human Evaluation
We conducted a human study to evaluate the effectiveness of KAPE.Five experienced software developers, familiar with Java, were invited to participate in the study.First, we obtained the answers by the RoBERTa and CodeBERT models (see more details in Section 4.2) to 23 Java-related queries  collected from StackOverflow.Subsequently, the developers were asked to manually inspect and verify the correctness of these answers.Specifically, each developer was given the answers generated by two models along with corresponding queries, all while being unaware of the specific details (e.g., name, architecture, or any other distinguishing characteristics) about the underlying model.The results in Figure 7 revealed that, on average, RoBERTa provided 16.4 out of 23 correct answers, whereas CodeBERT yielded 17.8 out of 23 correct answers.This suggests that the developers considered CodeBERT to be a superior search model for these particular queries.However, we employed KAPE to estimate the MRR for both RoBERT and CodeBERT.Notably, CodeBERT was estimated to have a higher MRR (0.9783) than RoBERTa (0.9565), which aligns with the conclusion from the human inspection.

Strengths and Limitations
Strengths.First, unlike the test selection metrics where a subset of test data is selected and manually labeled, KAPE does not require manpower for the test data.Second, although based on the similarity between training and test data, KAPE is flexible to different data distributions, e.g., the unseen test data are very similar or very different to the training set.
Limitations.Since KAPE performs the performance estimation based on the semantic similarity between training and test queries, the training set is required to be accessible.In addition, as demonstrated by RQ3 in Section 5.3, KAPE benefits more from the similarity when the test data is more similar to the training set than different.A new method is in demand for accurate performance estimation.

Threats to Validity
The internal threat mainly comes from the implementations of baseline methods, KAPE, model preparation, and testing.For random sampling (test) and random sampling (train), we apply the random module in Python.We use the original implementation of PACE [5] and use the same parameters, e.g., the maximum number of clusters and MMD-critic-related settings.The definitions of DeepGini [8], LC [17], Margin sampling, and MaxEntropy [17] are simple, and these metrics are easy to implement.For model preparation and evaluation, we use the original implementation on GitHub [31] provided by Lu et al.For KAPE, the cosine similarity calculation is implemented using the public library SciPy [42].
The external threat is due to selected datasets, models, baseline methods, and evaluation measures.Regarding the datasets, we test on all the six benchmark datasets provided by the Code-SearchNet challenge [21].For the models, we employ three popular and state-of-the-art pre-trained models for deep code search.For comparison, concerning that our work is the first and the test input selection metrics in other fields are inapplicable, we consider the most widely used baseline method, random sampling (test), and apply different labeling percentages.In addition, since KAPE utilizes training data to estimate the model performance, we also implement the baseline method, random sampling (train), that selects training data for comparison.Regarding the performance measure, we consider the widely used MRR.There are other measures, such as Answerd@k [4,52], which can be considered in a further study.
The construct threat mainly lies in the sampling randomness in the baseline methods.To reduce the impact of randomness, we repeat each experiment three times and report the results of both average and standard deviation.Additionally, to allow for reproducibility, we use fixed random seeds (0, 1, and 2) for all the environment settings (e.g., random module in Python, Numpy, Torch, and CUDA's manual seed setting).

CONCLUSION
In this article, we introduce KAPE, a manpower-free testing approach that takes advantage of the training set to efficiently estimate the performance on unseen test data of deep code search models.Via the kNN algorithm, we map the unseen test data to the training space and assign adaptive weights to the neighbors based on the semantic similarity between training and test data.Experimental results on six programming languages and three pre-trained models demonstrate that KAPE is effective in estimating the model performance and outperforms test selection metrics.In addition, we show that KAPE is randomness-free, stable to parameters, and flexible to data distributions.
For future work, more advanced deep code search models and evaluation measures will be investigated.

APPENDIX A APPENDIX: BEYOND PRE-TRAINED MODELS
In addition to pre-trained models, we evaluate KAPE on other deep learning models that do not benefit from a large code corpus in advance and are trained from scratch using the training data.GraphSearchNet [28].Proposed by Liu et al., GraphSearchNet consists of a program encoder and a summary that learn the vector representations of code and queries, respectively.Concretely, GraphSearchNet constructs graphs for code snippets and queries to capture the structural information.In the corresponding encoder (e.g., program encoder), the graphs (e.g., graphs of code) are fed into a bidirectional gated graph neural network (BiGGNN) with a multi-head attention layer that learns the node and word embeddings.
Dataset and implementation For DeepCS, UNIF, and CARLCS-CNN, since they share similar data features, we use two datasets (Example and GitHub with Java code files) provided by DeepCS to perform the evaluation under the Tensorflow 2.0.0 framework.For GraphSearchNet, we use its provided dataset named Python.Each model has trained 500 epochs and the best model with respect to the validation set is saved.Table 8 lists the details of the provided datasets1 and model performance.Figure 8 shows the comparison results.In general, KAPE outperforms baseline methods regardless of datasets and the labeling budget.

Fig. 2 .Fig. 3 .
Fig. 2.An example of performance estimation by KAPE on three test queries with seven training query-code pairs.

ALGORITHM 1 :
KAPE: KNN-based performance estimation Input : f : trained model D D D : training set of query-snippet pairs T T T : unseen test set of queries k: hyperparameter of kNN Output : P T T T : model performance on T T T /* Step1: Obtain natural language contextual representations */

Fig. 4 .
Fig. 4. Effectiveness comparison between KAPE and baseline methods given the CodeBERT model.Groundtruth: the actual MRR of the model on the test set.Shaded area illustrates the standard deviation of three experiment repetitions.

Fig. 5 .
Fig. 5. Similarity density distribution of test sets where data is "similar" to or "different" from the training set.Model: CodeBERT.

Fig. 6 .
Fig. 6.Effectiveness of KAPE and baseline methods given the similar (first row) and different (second row) PHP test set.Groundtruth: the actual MRR of the model on the test set.Shaded area illustrates the standard deviation of three experiment repetitions.

Fig. 7 .
Fig.7.Results of the user study to evaluate the effectiveness of KAPE.The x-axis represents the five independent participants, and the y-axis corresponds to the number of correct answers identified by participants.

DeepCS [ 14 ].
Gu et al. proposed the CODEnn (Code-Description Embedding Neural Network) model and integrated it into the DeepCS tool.CODEnn embeds code snippets into code vectors via a code embedding network (CoNN) and embeds queries into description vectors via a description embedding network (DeNN).Specifically, CoNN takes the method name, API invocation sequence, and tokens contained in the source code as features to embed code.UNIF [4].Cambronero et al. built the UNIF (Embedding Unification) model that simply uses a bag-of-words-based network.In particular, given a bag of code embedding vectors, UNIF designs an attention-based weighing scheme to calculate the weight of each vector.CARLCS-CNN[48].Unlike DeepCS and UNIF, which learn individual embeddings for code snippets and queries, respectively, CARLCS-CNN (co-attentive representation learning code search-CNN) learns interdependent representations with a co-attention mechanism.Similar to DeepCS, CARLCS-CNN also considers the method name, API invocation sequence, and tokens as features of code.

48 : 20 Y
. Guo et al.Tok-Att [56].Introduced by Chen et al., Tok-Att only exploits the token feature of code to generate code embeddings.

Fig. 8 .
Fig. 8. Effectiveness comparison between KAPE and baseline methods given different models.(a)-(d): result on the Example dataset.(e)-(h): result on the GitHub dataset.(i): result on the Python dataset.Groundtruth: the actual MRR of the model on the test set.Shaded area illustrates the standard deviation of three experiment repetitions.
Equation (2) /* Step 3: Locate k nearest neighbors */ 8 N N N N N N , NS NS NS = ∅, ∅ ; // matrix of the first k similar queries and corresponding similarity matrix

19 end 20 end 21
of training data of the ith nearest 14 P j P j P j = Evaluate ( f , D j D j D j ) ; // P j P j P j is a vector of model performance on each data 15 end 16 for i = 1 → n do 17 for j = 1 → m do 18 z i, j = N S i, j −μ σ ; // μ and σ are the mean and standard deviation of N S i , respectively W W W = W eiдhtCalculate(NS NS NS,z z z) ; // Calculate weights of each nearest neighbor by Equation (3)

Table 1 .
Examples of the Most Similar Queries from the Training Set for Queries from the Test Set the similarity matrix calculated in Step 2, the k-nearest neighbors (kNN) algorithm is applied, for each test query, to locate the first k most similar queries fromThe reciprocal rank of the ground truth in the result list is the model performance we consider in this article.Dataset: Ruby.Model: GraphCodeBERT.For more details of the dataset, model, and performance measure, please refer to Section 4.2.

Table 2 .
Summary of Datasets and the MRR on the Validation (Left) and Test Sets (Right)

Table 3 .
[17]erences between KAPE and Baseline Methods For a classification task (e.g., image classification), it selects the most informative data that have the highest Gini impurity.The Gini impurity measures how likely a sample is wrongly classified based on the prediction possibilities of all classes.-LC[17]Theleastconfidence(LC)metricselects data where the model has the least confidence (probability) in the most likely class label.-Marginsampling[41]Similar to LC, the margin sampling considers the prediction confidence.Instead of using the most likely class, it selects data that have the smallest difference between the first and second most probable class labels.-MaxEntropy[17]Thismetric selects the most uncertain data where the Shannon entropy of the prediction probability is the highest.The only difference with DeepGini is the measure (Gini impurity and Shannon entropy) used to calculate the uncertainty.-Random sampling (train) A set of training data is randomly selected from the training set.

Table 4
lists the results of KAPE on the PHP dataset based on the RoBERTa and GraphCodeBERT models.KAPE always outperforms random sampling (test), random sampling (train), DeepGini, Margin sampling, and MaxEntropy regardless of the labeling percentage and model.With more

Table 4 .
Effectiveness Comparison between KAPE and Baseline Methods on PHP Given the RoBERTa and GraphCodeBERT Models

Table 7 .
Ablation Study on the k Nearest Neighbors' Weights of KAPE Values highlighted in grey indicate the best performance.Groundtruth: the actual MRR of the model on the test set.Dataset: JavaScript.

Table 8 .
Summary of Three Datasets Provided by DeepCS and the MRR on the Validation and Test Sets by Different Models