Classifying Source Code: How Far Can Compressor-based Classifiers Go?

Pre-trained language models of code, which are built upon large-scale datasets, millions of trainable parameters, and high computational resources cost, have achieved phenomenal success. Recently, researchers have proposed a compressor-based classifier (CBC); it trains no parameters but is found to outperform BERT. We conduct the first empirical study to explore whether this lightweight alternative can accurately classify source code. Our study is more than applying Cbc to code-related tasks. We first identify an issue that the original implementation overestimates Cbc. After correction, Cbc's performance on defect prediction drops from 80.7% to 63.0%, which is still comparable to CodeBERT (63.7%). We find that hyperparameter settings affect the performance. Besides, results show that Cbc can outperform CodeBERT when the training data is small, making it a good alternative in low-resource settings.


INTRODUCTION
As investigated by a recent survey [8], pre-trained language models (e.g., CodeBERT [5]) have demonstrated promising results in a variety of code-related tasks, including defect prediction.What is behind the success is the large-scale datasets, millions of trainable parameters, and high computational resources cost.For example, CodeBERT is a model with 125 million parameters trained on 8.5 million datapoints with 16 NVIDIA Tesla V100 GPUs.Even finetuning it on a small dataset (e.g., Devign [19] with 21,854 examples) takes 5 hours on a GTX 2080 GPU.Researchers [15] point out that such a large model is not suitable to be deployed in modern IDEs, encouraging us to explore lightweight alternatives.
Recently, Jiang et al. [9] propose a compressor-based classifier (Cbc), which requires no parameter to be trained.Jiang et al. [9] report promising results: Cbc outperforms or is comparable to BERT [3] in a variety of NLP datasets.Cbc is also faster than pretrained models: we find it takes only 10 mins to finish the evaluation on Devign [19] dataset using a desktop CPU.We conduct the first empirical study to evaluate Cbc on code-related tasks.
In this study, we first point out a potential issue in the Cbc implementation that overestimates the model performance.After correction, the accuracy of Cbc decreases from 80.7% to 63.0% on a defect prediction task, but it is still comparable to CodeBERT (63.7%), even higher than other models like CodeTrans (63.03%) [4] and CoTexT (61.48%) [13].Second, we further show that the hyperparameter settings affect the performance.We evaluate two strategies to break the tie in the NN classifier: (1) random selection and (2) decrement  until a tie is broken; the former is found to be more effective via a statistical test.Besides, the accuracy tends to increase when  increases.Third, we find that Cbc can outperform CodeBERT when the training data is small, making it a good alternative in low-resource settings, in terms of both computational resources and training data.

BACKGROUND AND RELATED WORK 2.1 Compressor-based Classifier
In a nutshell, (Cbc) consists of three main components: (1) a lossless compressor gzip [14], a compressor-based distance metric, and a -Nearest-Neighbor classifier.The compressor first compresses the inputs, aiming to represent the inputs with as less number of bits as possible.Considering three inputs  1 ,  2 , and  3 , where  1 and  2 share the same label (e.g., both are defective), and  3 has a different label (e.g., non-defective).Let  ( 1 ) represent the length of the compressed  1 and  ( 1  2 ) represent the length of the compressed  1 and  2 concatenated. ( 1  2 ) −  ( 1 ) can is the additional bits required to encode  2 given  1 .Intuitively,  ( 1  2 ) −  ( 1 ) is expected to be smaller than  ( 1  3 ) −  ( 1 ), since  1 and  2 share more common information than  1 and  3 .Jiang et al. [9] utilize another metric called normalized compression distance (NCD) as a more precise approximation of the 'information distance' between two inputs, which is formally defined as: The smaller the NCD, the more likely  1 and  2 share the same label.

Related Work
A series of empirical studies [10] have shown the strong performance of pre-trained language models of code in a variety of coderelated tasks, including code search, code summarization, defect prediction, etc.For example, one of the most commonly evaluated model is CodeBERT [5], which is an encoder-only model that demonstrates strong performance in classifying source code.Given an input (i.e., code snippet), it first converts the input into a vector called code embeddings and then uses a classifier to predict the label.Other relevant models include GraphCodeBERT [7], PLBART [1], CodeT5 [18], etc.We choose an important and popular task, defect prediction [19], to evaluate Cbc.We refer to a recent survey [8] for a comprehensive review of pre-trained language models of code.
The compressor-based classifier we evaluate falls into a category of works that use a compressor to approximate the distance between two inputs [2,9,12].Another line of work [6,11] uses a compressor to estimate entropy based on Shannon Information Theory for text classification.The classifier evaluated in our study is the most recent work and demonstrates promising results.

EMPIRICAL STUDY AND RESULTS
This paper presents the first empirical study to understand whether Cbc can generalize to software engineering tasks.We use a defect prediction dataset called Devign [19], consisting of 21,854 training and 2,732 testing examples.More than applying Cbc to this dataset, we conduct analyses not considered in the original study [9].
Correcting the Evaluation Metric.We find that the results reported in the original paper [9] might be overestimated and not achievable in practice.Specifically, they choose  = 2 in the NN classifier and assumes that the classifier can always choose the correct label if a tie happens (i.e., the one neighbor has label 1 and the other has label 0).This finding is also confirmed in the discussion between the authors and interested users of Cbc [17].To evaluate Cbc in a more realistic setting, we implement the NN classifier with multiple  values and use two tie breaking strategies: (1) random selection and (2) decrement  until a tie is broken.
We first run Cbc with the original implementation [16] and obtain a high test accuracy of 80.7%, much higher than the accuracy achieved by CodeBERT (63.7%).We vary the  value and use two tie breaking strategies to simulate a more realistic evaluating setting.
The results are shown in Figure 1.We observe that the accuracy of Cbc significantly drops.When  = 2 and using random selection to break the tie, the accuracy is 59.4%, lower than CodeBERT.When  increases, which generally means that the classification is based on a broader set of data points, the accuracy tends to increase.We also observe that the random selection strategy seems to be more effective than the decrement strategy.To validate this hypothesis, we conduct a paired t-test and find that the difference is statistically significant ( < 0.05).When  = 20 and using random selection, the accuracy is 63.0%, comparable to CodeBERT.
Data Size Analysis.Pre-trained models usually require much training data to achieve a good performance, which also consumes many computational resources on specialized hardware, i.e., GPUs.We analyze how the performance of two models changes when the size of training data varies.Figure 2 shows the results.The data points for Cbc is obtained using the optimal hyperparameter setting found in the previous analysis (i.e.,  = 20 and random selection).We can observe a trend for both models that the accuracy increases when more data is used.However, Cbc has a more stable increase than CodeBERT.Due to the random nature of deep learning, the accuracy of CodeBERT fluctuates more than Cbc.When the training data is small (less than 6,000), Cbc achieves a better performance, indicating that Cbc can be a good alternative in low-resource settings, in terms of both computational resources and training data.

CONCLUSION AND FUTURE WORK
This paper presents the first empirical study on applying compressorbased classifiers to the defect prediction task.After correcting the evaluation and comparing it with CodeBERT, we find that Cbc can achieve a comparable performance to pre-trained models of code.Further results show that Cbc can be a good alternative in low-resource settings, where the training data is small and computational resources are limited.We hope our study can inspire more research on exploring lightweight alternatives to pre-trained models of code.In the future, we plan to extend the study, including evaluating generalizability on more datasets, analyzing how different compressors affect the performance, etc.

Figure 1 .
Figure 1.The classification accuracy of Cbc under different  values and tie breaking strategies.

Figure 2 .
Figure 2. How the performance of Cbc and CodeBERT change when the training data size varies.