The EarlyBIRD Catches the Bug: On Exploiting Early Layers of Encoder Models for More Efficient Code Classification

The use of modern Natural Language Processing (NLP) techniques has shown to be beneficial for software engineering tasks, such as vulnerability detection and type inference. However, training deep NLP models requires significant computational resources. This paper explores techniques that aim at achieving the best usage of resources and available information in these models. We propose a generic approach, EarlyBIRD, to build composite representations of code from the early layers of a pre-trained transformer model. We empirically investigate the viability of this approach on the CodeBERT model by comparing the performance of 12 strategies for creating composite representations with the standard practice of only using the last encoder layer. Our evaluation on four datasets shows that several early layer combinations yield better performance on defect detection, and some combinations improve multi-class classification. More specifically, we obtain a +2 average improvement of detection accuracy on Devign with only 3 out of 12 layers of CodeBERT and a 3.3x speed-up of fine-tuning. These findings show that early layers can be used to obtain better results using the same resources, as well as to reduce resource usage during fine-tuning and inference.


INTRODUCTION
Automation of software engineering (SE) tasks supports developers in creation and maintenance of source code.Recently, deep learning (DL) models have been trained on large open-source code corpora and used to perform code analysis tasks [3,8,27,38].Motivated by the naturalness hypothesis stating that code and natural language share statistical similarities, researchers and tool vendors have started training deep NLP models on code and fine-tuning them on SE tasks [11].Amongst others, such models have been applied to type inference [17], code clone detection [50], program repair [9,15,47,48], and defect prediction [7,30,35,44].In NLP-based approaches, SE tasks are frequently translated to code classification problems.For example, detection of software vulnerabilities is a binary classification problem, bug type inference is a multi-class classification setting, and type inference is a multi-label multi-class classification task in case a type is predicted for each variable in the program.
Most modern NLP models build on the transformer architecture [42].This architecture uses attention mechanism and consists of an encoder that converts an input sequence to a representation through a series of layers, followed by decoder layers that convert this representation to an output sequence.Although effective in terms of learning capabilities, the transformer design results in multi-layer models that need large amounts of data for training from scratch.A well-known disadvantage of these models is the high resource usage that is required for training due to both model and data sizes.While a number of pre-trained models have been published recently, fine-tuning these models for specific tasks still requires additional computational resources [27].
This paper explores techniques that aim at optimizing the use of resources and information available in models during fine-tuning.In particular, we consider open white-box models, for which the weights from each layer can be extracted.We focus on encoder-only models, as they are commonly used for SE classification tasks, in particular, the transformer-based encoders.The standard practice in encoder models is to obtain the representation of the input sequence from the last layer of the model [14], while information from earlier layers is usually discarded [21].I.e., while the early layers are used to compute the values of the last layer, they are generally not considered as individual representations of the input in the way that the last layer is.To exemplify the amount of discarded information at inference, when fine-tuning a 12-layered encoder, such as CodeBERT [14], for bug detection, 92% of the code embeddings are ignored. 1However, it has been shown for natural language that the early layers of an encoder capture lower-level syntactical features better than the later layers [6,24,32,40], which can benefit downstream tasks.
Inspired by the line of research that exploits early layers of models, we propose EarlyBIRD, 2 a novel and generic approach for building composite representations from the early layers of a pre-trained encoder model.EarlyBIRD aims to leverage all available information in existing pre-trained encoder models during fine-tuning to either improve results or achieve competitive results at reduced resource usage during code classification.We empirically evaluate EarlyBIRD on CodeBERT [14], a popular pre-trained encoder model for code, and four benchmark datasets that cover three common SE tasks: defect detection with the Devign and ReVeal datasets [20,51], bug type inference with the data from Yasunaga et al. [47], and exception type classification [7].The evaluation compares the baseline representation that uses the last encoder layer with results obtained via EarlyBIRD.We both fine-tune the full-size encoder and its pruned version with only several early layers present in the model.The latter scenario analyzes the trade-off between only using a partial model and the performance impact on SE tasks.Contributions: In this paper, we make the following contributions: (1) We propose EarlyBIRD, an approach for creating composite representations of code using the early layers of a transformer-based encoder model.The goal is to achieve better code classification performance at equal resource usage or comparable performance at lower resource usage.
(2) We conduct a thorough empirical evaluation of the proposed approach.We show the effect of using composite EarlyBIRD representations while fine-tuning the original-size CodeBERT model on four real-world code classification datasets.We run EarlyBIRD with 10 different random initializations of non-fixed trainable parameters and mark the EarlyBIRD representations that yield statistically significant improvement over the baseline.
(3) We investigate resource usage and performance of pruned models.We analyze the trade-off between removing the later layers of a model and the impact this has on classification performance.Main findings: With EarlyBIRD, we achieve performance improvements over the baseline code representation with the majority of representations obtained from single early layers on the defect detection task and selected combinations on bug type and exception type classification.Moreover, out of the reduced-size models with pruned later layers, we obtain a +2 average accuracy improvement on Devign with 3.3x speed-up of fine-tuning, as well as +0.4 accuracy improvement with 3.7x speed-up on average for ReVeal.
The remainder of the paper is organized as follows.We present related work in Section 2 and provide background details of the study in Section 3. The methodology is described in Section 4 which is followed by experimental setup in Section 5. We present and discuss results in Section 6 and conclude with Section 7.

RELATED WORK
Here, we give an overview of language models for SE tasks and recent encoder models, specifically, as well as different approaches to use early layers of encoder models.

Transformers in Software Engineering
The availability of open source code and increased hardware capabilities popularized training and usage of Deep Learning, including NLP and Large Language Models (LLMs), for SE tasks.To date, deep NLP models have already been applied in at least 18 SE tasks [28].Pre-trained language models available for fine-tuning on SE tasks largely build on the transformer architecture, sequence-to-sequence models, and the attention mechanism [8,9,42].One widely used benchmark to test different deep learning architectures on SE tasks is CodeXGLUE [27].The benchmark provides data, source code for model evaluation, and a leader-board ranking model performance on different tasks [27].
SE tasks can be translated to input sequence classification and generation of code or text.Examples of generative tasks in SE are code completion, code repair, generation of documentation from code and vice versa, and translation between different programming languages.Such tasks are frequently approached with neural machine translation models.Full transformer models for translation from a programming language (PL) to a natural language (NL) or PL-PL tasks include PLBART [1], PYMT5 [10], TFix [4], CodeT5 [43], Break-It-Fix-It [47].Alternatively, generative models can include the decoder-only part of the transformer as in GPT-type models.In this case, the decoder both represents the input sequence and transforms it into the output sequence.Decoder-based models for code include, for example, Codex and CodeGPT [8,27].
In the tasks that require code or documentation representation and their subsequent classification, the encoder-only architectures are used more frequently than in translation tasks.Examples of code classification problems are code clone detection, detection of general bugs, such as the presence of swapped operands, wrong variable names, syntax errors, or security vulnerabilities.A number of encoder models for code applied a widely-used bi-directional encoder, BERT [12], to pre-train it on code, with some modifications of the input.In this way, the CodeBERT [14], GraphCodeBERT [16], CuBERT [20], and PolyglotCodeBERT [2] models were created.In detail, the 12-layer RoBERTa-based CodeBERT model was pretrained on NL-PL tasks in multiple PLs and utilized only the textual features of code.Note that RoBERTa is a type of BERT model with optimized hyper-parameters and pre-training procedures [26].Together with the decoder-only CodeGPT model, the encoder-only CodeBERT model was used as a baseline in CodeXGLUE.Graph-CodeBERT utilizes both textual and structural properties of code to encode its representations.PolyglotCodeBERT is the approach that improves fine-tuning of the CodeBERT model on a multi-lingual dataset for a target task even if the target task tests only one PL.This paper focuses on the fine-tuning strategies which, in contrast to PolyglotCodeBERT, do not increase the resource usage for fine-tuning.CuBERT is a 24-layer pre-trained transformer-based encoder tested on a number of code classification tasks, including exception type classification.We test the performance of the proposed EarlyBIRD composite representations on defect detection, including the use of one of CodeXGLUE benchmarks, as well as on error and exception type classification tasks.However, the goal of this paper is to achieve improvement over the baseline model when it is fine-tuned with composite code representations.We do not aim to compare results with other models, but rather propose an approach that is applicable to transformer-based encoders for source code and show its performance gains compared to the same model usage without the proposed approach.

Use of Early Encoder Layers
A number of studies explored different approaches to use information from early layers of DL models for sequence representation, such as probing single layers, pruning and variable learning rates.One way to leverage information from early model layers is to give different priority to layers while fine-tuning the models [19,39].
For example, the layer-wise learning rate decay (LLRD) strategy and re-initialization of late encoder layers yielded improvement over the standard fine-tuning of BERT on NLP tasks [49].The LLRD strategy was initially developed to tune the later encoder layers with larger learning rate.In this way, the later layers can be better adapted to a downstream task under consideration, because the later layers are assumed to learn complex task-specific features of input sequences [19].Moreover, Peters et al. [33] showed that the performance of fine-tuning improves if the encoder layers are updated during fine-tuning in comparison with training only the classifier on top of fixed (frozen) encoder layers.
Pruning later layers of transformer models is another way to consider only early layers for fine-tuning [13,31,36].Sajjad et al. [36] investigated how the performance of transformer models on NLP is affected when reducing their size by pruning layers.They considered six pruning strategies, including dropping from different directions, alternated layer dropping, or dropping layers based on importance, for four pre-trained models: BERT [12], RoBERTa [26], XLNET [46], ALBERT [22].By pruning model layers, Sajjad et al. were able to reduce the number of parameters to 60% of the initial parameter set while maintaining a high level of performance.While the performance on downstream tasks varies in their study, the lower layers are critical for maintaining performance when fine-tuning for downstream tasks.In other words, dropping earlier layers is detrimental to performance.Overall, pruning layers reduces model size and in turn reduces fine-tuning and inference time.In line with the work of Sajjad et al. [36], we extend our experiments with the pruning of later layers and keeping earlier layers present in the model (see RQ2 in Section 6).
The use of information from single early layers in a number of EarlyBIRD experiments is also inspired by Peters et al. [32].In their study, Peters et al. present an empirical evidence that language models learn syntax and part-of-speech information on earlier layers of a neural network, while more complex information, such as semantics and co-reference relationships, are captured better by deeper (later) layers.In another study, Karmakar and Robbes probed pre-trained models of code, including CodeBERT, on tasks of understanding syntactic information, structure complexity, code length, and semantic information [21].While Karmakar and Robbes probed frozen early layers of different models for code in a single strategy, we use 12 different strategies for combining unfrozen early layers during fine-tuning and focus on the tasks of bug detection or bug type classification.Similarly, Hernández López et al. [18] probed different layers of five pre-trained models, including Code-BERT [14] and GraphCodeBERT [16], and found that most syntactic information is encoded in the middle layers.The novelty of our study with respect to Karmakar and Robbes is that we combine early layers in addition to extracting each of them, while Karmakar and Robbes extracted early layer representations and used them without composing new representations.

ENCODERS FOR CODE CLASSIFICATION
In this section, we present the background on transformer models and different uses of the encoder-decoder-or full transformerarchitecture, as well as its encoder-only and decoder-only variants.Because our study focuses on encoder-only open-source models available for fine-tuning, the distinction between transformer types is necessary for understanding the methodology.
In sequence-to-sequence generation scenarios, the transformer model consists of a multi-layer encoder that represents the input sequence and a decoder that generates the output sequence based on the sequence representation from the encoder and the available output generated at previous steps [42].For source code classification tasks, the transformer is frequently reduced to only its encoder followed by a classification head, a component added to the encoder to categorize the representation into different classes.Dropping the decoder for classification is motivated by resource efficiency, because the decoder is conceptually only needed for token generation from the input sequence.During classification of an input, the encoder represents the sequence and passes it to the classification head.Based on this design, a number of pre-trained encoders have been published in recent years, such as BERT and RoBERTa which were pre-trained on natural language, and similar models pre-trained on code, or a combination of code and natural language [12,26].The goal of pre-training in the pre-train and finetune scenario is to capture language patterns in general, so that they can serve as a basis for domain-specific downstream tasks.Pre-trained models can be fine-tuned on different downstream tasks in NLP and SE.
Processing the input sequence for classification consists of several steps: tokenization, initial embedding, encoding the sequence with an encoder, and passing the sequence representation through a classification head.Tokenization splits the input sequence, adds special tokens, matches the tokens to their ID's in the vocabulary of tokens, and unifies the resulting token length for samples in a dataset.Embedding transforms the one-dimensional token ID to an initial multi-dimensional static vector representation of the token and is usually a part of the pre-trained encoder model.This representation is updated using the attention mechanism of the encoder.Because of attention, the representation of the input is influenced by all tokens in the sequence, so it is contextualized.
CodeBERT is a RoBERTa-based model with 12 encoder layers pretrained on 6 programming languages (Python, Java, JavaScript, PHP, Ruby, and Go), as well as text-to-code tasks [14].Pre-training was done on the masked language modeling (MLM) and replaced token detection (RTD) tasks.These tasks respectively train the model to derive what token is masked in MLM, and in RTD predict whether any token in an original sequence is swapped with a different token that should not be in the sequence.CodeBERT outputs a bidirectional encoder representation of the input sequence, which means that the model considers context from pre-pending and subsequent words to represent each token in the input sequence.
A pre-trained model is usually released with a pre-trained tokenizer.The pre-trained tokenizer ensures that token ID's correspond to those processed during pre-training.The tokenizer also adds special tokens, such as a CLS token at the start of each input sequence, PAD tokens to unify lengths of input sequences, and the EOS token to signify the end of the input string and the start of padding sequence [12].All tokens are transformed by the model in each encoder layer.Out of all tokens, the CLS token representation from the last layer, which is updated by all encoder layers, is typically used as a representation for the full sequence.
The standard practice of using the CLS token from the last encoder layer is motivated by the pre-training procedure.For example, in MLM, the model predicts the masked token based on the CLS token representation from the 12 th layer of BERT and CodeBERT.However, the choice of token to represent the full sequence in finetuning can be different.For example, in PLBART [1], a transformer model for code with both an encoder and a decoder, the EOS token is used for representing the input sequence.In this paper, we propose different ways to represent the input sequence and use information from early layers of the model in an effective way.

METHODOLOGY
In this paper, the architecture of the code classification model consists of five parts: (1) a tokenizer, (2) an embedding layer, (3) an encoder with several layers, (4) a set of operations to combine sequence representations from encoder layers with EarlyBIRD, and (5) a classification head.The output of each step is used as input into the next step.An overview of the architecture is shown in Figure 1 and described below.The main difference between this architecture and the classification architecture discussed in Section 3 is step (4); the standard architecture only consists of steps (1-3) and (5).
Steps ( 1)-( 3) use a pre-trained tokenizer, embedder, and encoder.EarlyBIRD is formulated in a generic way and can be applied to any encoder, but for our experiments, we fix the CodeBERT model and tokenizer.In step (4), we combine information from all the layers or from only some of the early layers of the encoder, as opposed to the baseline that uses the last layer of the encoder.Finally, the classification head in step (5) consists of one dropout layer and one linear layer with softmax.
The encoder model represents each token of an input sequence with a vector of size  , also known as hidden size.For each input sequence of length , and a hidden size  , we obtain a matrix of size  ×  for each of  layers of the base model as shown in Figure 1.For example, the CodeBERT architecture is fixed with 12 encoder layers, i.e.,  = 12 for that model.All the information available in the encoder for one input sequence is stored in a tensor of size  ×  ×  .The EarlyBIRD combinations must produce one vector ì  of size  that represents the input, as shown in Figure 1.Keeping the output code representation of size  is required to provide a fair comparison of EarlyBIRD composite representations with the standard code representation obtained from the last layer.In this way, the dimension of the classification head is the same for all combinations of early layers and has minimal possible influence during fine-tuning.As a strategy for systematically investigating composite representations, we create a grid-search over three typical operations to combine outputs of neural network layers -maximum pooling (max pool), weighted sum and slicing -and two dimensions to apply the operations: over tokens and/or layers.For the tokens dimension, we either use all of the tokens from a specific layer or only the CLS token.Among layers, we either slice one layer, sum or take maximum values over all layers.The choice of considering every token of a layer is motivated by the fact the transformer-based models exhibit varying degrees of attention for different types of tokens [29], which indicates that solely using the CLS token might not be the best choice for tasks [37].We also experiment with different sizes of the model.The combination strategies that use all layers of the pre-trained model are divided into two categories: the strategies that use CLS tokens from the encoder layers; the strategies that use more tokens than just CLS from encoder layers.
When we slice the CLS token and apply each of the operations over layers, we obtain the following CLS-token combinations: (i) baseline: CLS token from the last layer, i.e., layer no.; (ii) CLS token from one layer3 no.,  ∈ {1, . . ., ( − 1)}; (iii) max pool over CLS tokens from all layers { }  =1 ; (iv) weighted sum over CLS tokens from all layers { }  =1 .The second set of combinations uses representations of all the tokens in tokenized input sequences, including the CLS token.We first apply max pooling operation to either all tokens or all layers and use the rest of operations.Then we apply weighted sum as the first operation followed by max pool or slicing of a layer: (v) max pool tokens from one layer no.,  ∈ {1, . . ., }; (vi) max pool over all layers for each token in the input sequence, max pool over tokens; (vii) max pool over all layers for each token in the input sequence; weighted sum over tokens; (viii) max pool over all tokens for each layer no.,  ∈ {1, . . ., }; weighted sum over layers (ix) weighted sum over tokens from one layer no.,  ∈ {1, . . ., }; (x) weighted sum over tokens for each one layer no.,  ∈ {1, . . ., }; weighted sum over all layers; (xi) weighted sum over all layers for each token in the input sequence; weighted sum over all tokens.Note that weights in the weighted sums are learnable parameters.However, the added number of learnable parameters for fine-tuning constitutes 0.00042%4 of the number of learnable parameters in the baseline configuration.For this reason, we mention that the models with combinations (ii-x) have the same model size while bearing in mind the overhead of learnable weights in the weighted sums.
In addition to experiments with token combinations, we also investigate performance of the model with first  <  layers and the baseline token combination, described as follows: Figure 2: Combinations of early encoder layers that lead to code representation vector ì  for each tokenized input sequence.The latin numbering in brackets corresponds to the combinations described in Section 4. Observe that the presentation order has been designed to preserve space by grouping similar combinations in the same subfigure.
(xii) CLS token from the last layer of the model with  <  encoder layers.
Note that the baseline combination (i) with the usage of the CLS token from layer  corresponds to (ii) and (xii) if  = .
The combinations are presented in Figure 2. Similar combinations are presented close to each other or are combined in the same image if they only have minor differences and share the major parts.For example, in Figure 2c, we illustrate combinations (iii) and (iv), because both of them use CLS tokens from all layers combined using max pooling or weighted sum.The roman numbers which indicate combination types are preserved either in the descriptions below the figures or in the figures themselves, but the order is changed.We mention combination number corresponding to the description in the current section, such as baseline combination (i) in Figure 2a or combination (ii) for CLS token from one early layer in Figure 2b.
We highlight what parts of encoder layer outputs are used for each combination with color.White cells correspond to the tokens that are not used in early layer combinations.The goal of all combinations is to obtain a vector representation ì  for each input code sample.For example, in Figure 2a, we consider the last layer  and extract only the CLS token marked as ì .Another remark on the EarlyBIRD combinations concerns the usage of all tokens or only code tokens.Code tokens are those that correspond to tokenized input words or sub-words and are shown in Figure 2 as token  1 , ..., token   for an input sequence  of size   .For each combination that uses more than just a CLS token, i.e., combinations (v-xi), we experiment with code tokens only, as well as with all tokens, including CLS, EOS, and PAD.The motivation to check code tokens exclusively stems from the hypothesis that information in special tokens may introduce noise into results.

EXPERIMENTAL SETUP
In this section, we describe the datasets used for empirical evaluation and implementation details of fine-tuning with the proposed EarlyBIRD approach.We investigate binary and multi-task code classification scenarios to explore generalisability of our results.

Datasets for Source Code Classification
We fine-tune and test the CodeBERT model using the EarlyBIRD approach on four datasets.The datasets span three tasks: defect detection, error type classification and exception type classification -with 2, 3, and 20 classes, respectively.They also contain data in two programming languages, C++ and Python.In addition, the chosen datasets have similar train subset sizes.In this way, we aim to reduce the effect of the model's exposure to different amounts of training data during fine-tuning.Statistics of the datasets are provided in Table 1.We report the size of the train/validation/test splits.In addition, we compute the average number of tokens in the input sequences upon tokenization with the pre-trained Code-BERT tokenizer.Because the maximum input sequence size for the CodeBERT model is limited to  = 512, the number of tokens is indicative of how much information the model gets access to or how much information is cut off, in case of long inputs.Devign: This dataset contains functions in C/C++ from two opensource projects labelled as vulnerable or non-vulnerable [51].We reuse the train/validation/test split from the CodeXGLUE Defect detection benchmark. 5The dataset is balanced: the ratio of nonvulnerable functions is 54%.ReVeal: Similarly to Devign, ReVeal is a vulnerability detection dataset of C/C++ functions [7].The dataset is not balanced: it contains 90% non-vulnerable code snippets.Both the Devign and Re-Veal datasets contain real-world vulnerable and non-vulnerable functions from open-source projects.Break-It-Fix-It (BIFI): The dataset contains function-level code snippets in Python with syntax errors [47].We use the original buggy functions and formulate a task of classifying the code into three classes: Unbalanced Parentheses with 43% of the total number of code examples in BIFI, Indentation Error with 31% code samples, Invalid Syntax containing 26% samples.The train/test split provided in the dataset is reused, and the validation set is extracted as 10% of training data.Exception Type: The dataset consists of short functions in Python with an inserted __HOLE__ token in place of one exception in code. 6he task is to predict one of 20 masked exception types for each input function and is unbalanced.The dataset was initially created from the ETH Py150 Open corpus 7 as described in the original paper [20].We reuse the train/validation/test split provided by the authors.

Implementation
The architecture is based on the CodeBERT8 tokenizer and encoder model.The model defines the maximum sequence length, hidden

Evaluation Metrics
To present the impact of early layer combinations, we compare the accuracy on the test set for all datasets, because it allows us to compare our results with other benchmarks.In addition, we report weighted F1-score denoted as F1(w) for a detailed analysis of selected combinations to account for class imbalance.To obtain the weighted F1-score, the regular F1-score is calculated for each label and their weighted mean is taken.The weights are equal to the number of samples in a class.We also report results of the Wilcoxon signed-rank test on the corresponding metrics for the combinations that show improvement over the baseline [45].The Wilcoxon test is a non-parametric test suitable for the setting in which different model variants are tested on the same test set, because it is a paired test.The Wilcoxon test checks the null hypothesis whether two related paired samples come from the same distribution.We reject the null hypothesis if p-value is less than  = 0.05.In case we obtain improvement of a metric over the baseline with an EarlyBIRD combination and the null hypothesis is rejected, we conclude that the combination performs better and the result is statistically significant.For the pruned models, we compute Vargha and Delaney's  12 non-parametric effect size measure of the performance change for accuracy and F1(w) with thresholds of 0.71, 0.64 and 0.56 for large, medium and small effect sizes [41].

Research Questions
During our empirical evaluation of composite EarlyBIRD code representations, we address the following research questions:

RQ1. Composite Code Representations with Same Model Size:
What is the effect of using combinations (ii-xi) of early layers with the same model size in comparison to the baseline approach of using only the CLS token from the last layer, i.e., combination (i), for code representation on model performance in the code classification scenario?The goal is to find out whether any of the EarlyBIRD combination types work consistently better for different datasets and tasks.
RQ2. Pruned Models: What is the effect of reducing the number of pre-trained encoder layers in combinations (xii) on resource usage and model performance on code classification tasks?As opposed to RQ1, in which we consider the combinations that do not reduce the model size, this research question is devoted to investigation of the trade-off between using less resources with reduced-size models and performance variation in terms of classification metrics.
For both research questions, we evaluate the composite representations on binary and multi-task code classification scenarios to explore generalisability of the results obtained for the binary case.We investigate if and what combinations result in better performance, averaged over 10 runs with different seeds.For combinations that improve the baseline on average, we also explore if the results are statistically significant according to the Wilcoxon test.

RESULTS AND DISCUSSION 6.1 EarlyBIRD with Fixed-Size Models
To answer RQ1, we explore one-layer combinations, multi-layer combinations, and estimate the statistical significance of the performance improvement.The first rows in Figures 3a and 3b correspond to the combinations (ii) CLS token layer .With this combination type, average improvement over the baseline is achieved with the majority of early layers.Specifically, we obtain accuracy improvements ranging from +0.2 to +2.0 for Devign in 8 out of 11 layers, and accuracy improvements from +0.1 to +0.8 for ReVeal in 9 out of 11 layers.The dynamic of the metric change over selected layer numbers is different for Devign and ReVeal.In detail, the average performance of combination (ii) is best with layer 3 on Devign (a +2.0 accuracy improvement) and with layer 1 for ReVeal (a +0.8 accuracy improvement).The best improvement in terms of F1(w) matches with layer 3 for Devign and with layer 2 for ReVeal, as shown in Figure 4.
Max pooling over all available tokens from a selected layer in combination (v) also achieves performance improvement over the baseline, as shown in rows 2 and 3 of Figures 3a, 3b.In general, layers 4-11 yield higher accuracy and layers 2-11 higher F1(w) with max pooling for Devign than the baseline.For ReVeal, all layers except layer 11 result in better average accuracy and layers 2-10 have higher average F1(w).Max pooling over all tokens, including     special tokens, achieves the best statistically significant average improvement of accuracy of +0.9 of all combinations for ReVeal.
The weighted sum of all tokens or code tokens exclusively in combination (ix) does not improve the baseline performance.We assume that fine-tuning for 10 epochs is not enough for this type of combination, because the loss at epoch 10 on both training and validation splits is higher for combinations (ix) than for combinations with max pooling.Since the goal of this study is to use the same or less resources for fine-tuning, we have not fine-tuned this combination for more than 10 epochs.
While combinations (ii) and (v) perform better for the majority of layers on the defect detection task, multi-class classification for     bug or exception type prediction does not benefit from the combinations to the same extent as the binary task.Only max pooling of tokens of the last encoder layer achieves better performance than the baseline for BIFI (+0.1 accuracy, +0.1 weighted F1-score improvements) and Exception Type (a +0.2 accuracy, +0.1 weighted F1-score improvements) datasets.
The impact of using all tokens or code tokens exclusively depends on the dataset.The difference between performance of single-layer combinations with max pooling of all tokens and only code tokens constitute 0.0-0.1 accuracy or F1(w).For the multi-class tasks, the average results improve with the use of each later layer in the model.We obtain performance improvement with the max pooling combination (v), while other one-layer combinations do not perform better than the baseline.
The best performing results on Devign and Exception Type classification datasets are statistically significant according to the Wilcoxon test.For ReVeal, the second best result is statistically significant.We have not obtained statistically significant improvements for BIFI.We explain it by the fact that the baseline metric is already high, i.e., 96.7 accuracy.Achieving improvement is usually more challenging when the baseline performs at this level.
In essence, the combinations that involve CLS tokens corresponding to the single layer (ii), as well as the max pooling combinations (v) perform better on average for defect detection datasets Devign and Reveal.However, only the max pooling combination (v) of tokens from the last encoder layer outperforms the baseline on average for multi-class datasets BIFI and Exception Type.The weighted sum of tokens from a selected layer (ix) performs worse than the baseline if fine-tuned for the same number of epochs for all tasks.Multi-class classification tasks require the information from the last layer for better performance in our experiments, while the binary task of defect detection allows us to use early layers and improve the performance over the baseline.
6.1.2Multi-Layer Combinations.The average performance difference with the baseline of combinations that utilize early layers is shown as heatmaps in Figures 5 and 6.We include the value of the average performance difference and add a star ( * ) to the number if the difference is statistically significant.Again, negative values are shown in black, and positive values are shown in white.
When we use all information from the available layers, the improvement over the baseline is less than what was observed in Section 6.1.1,where one specific layer has been used.In detail, out of combinations that involve CLS tokens from all early layers, no combination performs better than the baseline for ReVeal, BIFI, or Exception Type datasets.However, the best improvement (+0.6 accuracy) out of experiments with all layers is obtained on Devign with the weighted sum of CLS tokens in the combination (iv), which is less than the maximum improvement with the combinations from one selected early layer in Section 6.1.1.The improvement of F1(w) is shown in Figure 6.We obtained slightly better improvements of F1(w) for Devign, no F1(w) improvement for the unbalanced ReVeal dataset.The average F1(w) difference with the baseline for multi-class tasks are the same as accuracy difference.
If we consider the combinations that involve all tokens, the combination (vi) with two max pooling operations outperforms the baseline for Devign, Reveal, and BIFI with accuracy improvement between +0.1 and +0.3.No combination that involves all layers outperforms the baseline on average for Exception Type dataset.Combinations that involve one max pooling and one weighted sum of all tokens perform worse or neutral in comparison with the baseline.The combinations with only weighted sums perform worse than the baseline on average.Answer to RQ1.EarlyBIRD achieves statistically significant accuracy and F1-score improvements for defect detection datasets by using single-layer combinations that involve the CLS token or max pooling over all tokens.For bug type and exception type classification, max pooling of the tokens from the last encoder layer has improved the performance.Weighted sum of tokens does not improve performance over the baseline.

Pruned Models
This section is devoted to the combinations of early layers that are initialized with the first  <  early layers from the pre-trained model and fine-tuned as -layer models -combinations (xii).We start by comparing the performance of using the CLS token from layer  of the full-size model, i.e., combination (ii), and using the CLS token from layer  of the model that has  layers in totalcombination (xii).Figure 7 presents average accuracy obtained with these two combinations depending on the used layer, as well as the baseline combination of using CLS from the last layer  = 12 of CodeBERT.On average, the pruned models with reduced size perform on par with the full-size model for defect detection on the balanced Devign dataset, and for bug type and exception type classification.However, the performance of the two analogical combinations diverges for the unbalanced defect detection dataset ReVeal in layers 4 and 6-11.
Most importantly, the results show that reducing the model size and using the CLS token from the last layer of the reduced model performs on par with the baseline for the defect detection task.The best improvement with the reduced model is achieved with the 3-layer encoder for Devign and the 1-layer encoder for ReVeal.This result shows that it is possible to both reduce resources and improve the model's performance during fine-tuning on the defect detection task with both a balanced and unbalanced dataset.
To explore the trade-off between resource usage and performance degradation for bug type and exception type identification, we show the average speed-up of one fine-tuning epoch and the performance loss compared to the baseline for BIFI and Exception Type datasets in Table 2.We also report the corresponding values for Devign and ReVeal, for which both gains and losses of performance are indicated.The speed up is reported as a scaling factor of the baseline time.The metric difference is shown as gain or loss of the weighted F1-score and accuracy compared to the baseline performance.Statistically significant improvements are reported in bold, while statistically insignificant losses are marked with a star ( * ).The  12 effect sizes are indicated by three shades of blue as the cell color, with the darkest shade indicating a large effect ( 12 > 0.71), the middle shade indicating a medium effect ( 12 > 0.64), and the lightest shade indicating a small effect ( 12 > 0.56).We also underline and discuss selected results that improve the metric values and reduce resource usage.The majority of combinations (xii) with pruned models outperform the baseline for Devign and ReVeal.Furthermore, models with 2-10 layers show statistically significant improvements of both metrics on Devign, with the 3-layer model achieving +2 accuracy improvement with a 3.3-times average speed-up of fine-tuning with the same hardware and software.Not only does the 3-layer model improve the accuracy over CodeBERT baseline to 63.7, but also outperforms several other models tested on Devign and reported on the CodeXGLUE benchmark [27].In particular, our pruned 3layer CodeBERT model outperforms the full-transformer model PLBART [1], and code2vec code representations pre-trained on abstract syntax trees and code tokens in a joint manner [1].However, our pruned model does not outperform the best performing model reported on CodeXGLUE, CoText, which achieves 66.62 accuracy [34].
Models with 1 and 11 layers achieve statistically significant accuracy improvements for ReVeal.However, the 1-layer model reduces the F1(w) score.The use of layer 11 does not impact the speed of fine-tuning, while the 1-layer model yields the 3.7x acceleration of the baseline fine-tuning speed.The lack of speed-up with 11-layer model can be explained by the fact that the number of trainable parameters does not decrease linearly with the removal of later layers, since the additional embedding layer and classification head remain unchanged.The 2-layer model results in the best improvement of F1(w) which is statistically significant.The 2-layer model improves accuracy on ReVeal as well.For Devign and ReVeal, statistically significant improvements have large effect size.
For BIFI, we obtain statistically insignificant decrease of F1(w) and accuracy according to the Wilcoxon test which brings about 1.2x speed-up of the fine-tuning with the 11-layer model.If we decrease the number of layers to 8, the performance on BIFI stays within the (baseline metric−1) limit, but we gain up to 1.7x average speed-up of one-epoch fine-tuning.In case of using models with 1-10 layers, we observe a statistically significant change of distribution and decrease of metric values.
For the unbalanced Exception Type dataset, the performance drops faster and the speed-up is less prominent than for BIFI.The change of mean values of the metrics for all models is statistically significant.In detail, the metrics decrease by -1.0 absolute metric value at 11 layers with 1.1x fine-tuning speed-up and by -1.8 with 10 layers with 1.2x speed-up.We explain the sharper decline of the combinations performance by the lower baseline metric values (75.39 accuracy, 75.30weighted F1-score) than in the case of BIFI (96.7 accuracy and weighted F1-score).For BIFI, statistically insignificant deterioration have small effect size.However, for both BIFI and Exception Type datasets, we observe deterioration of performance of large effect size with pruned models.
We conclude that for the BIFI dataset with high-performing baseline and 3 classes, the performance loss at removing each layer is less than for the Exception Type classification dataset with lower baseline performance and 20 classes.The resource usage, which is correlated with time spent on tuning, decreases faster for BIFI than for Exception Type.This is partially explained by a larger classification head for the Exception Type dataset, because this dataset has 20 classes as opposed to only 3 classes in BIFI.In other words, we observe that the BIFI dataset has a strong baseline that is hard to outperform with pruning.By contrast, the complexity of the Exception Type dataset can influence the results in the opposite way: The baseline performance is already not very strong, and it proves hard to further improve on it with early layers only.
Answer to RQ2.We obtain performance improvements over the baseline as well as fine-tuning speed-ups for both defect detection datasets by using the CLS token from the last layer of pruned models.For multi-class classification, performance decreases upon pruning each layer from the end of the model.The decrease is sharper for the dataset with 20 exception types than for the task with 3 bug types.

Threats to Validity
The main threat to external validity is that the results are empirical and may not generalize to all code classification settings, including other programming languages, tasks, and encoder-based models for code.We have tested EarlyBIRD combinations on code in C for defect detection and Python for bug type and exception type classification in this study.The choice of the CodeBERT as the encoder model and its internal structure affects the results.For instance, an encoder model that takes smaller input sequences can perform worse on the same datasets, because larger parts of input code sequences have to be pruned in this case.The external validity can be improved by testing on more datasets and encoder models.
The threats to internal validity concern the dependency of models on initializations of trainable parameters and the choice of methods.Classification head and weighted sums with trainable parameters in our experiments depend on the initialization of the parameters and can lead the model to arrive at different local minima during fine-tuning.To reduce the effect of different random initializations, we have fine-tuned and tested all EarlyBIRD combinations 10 times with different random seeds.
In addition, we used the Wilcoxon test to verify whether the achieved improvements are statistically significant.However, the Wilcoxon test only estimates whether measurements of baseline values and EarlyBIRD combinations are drawn from different distributions.The reported times spent on fine-tuning and corresponding speed-ups have the purpose of illustrating the reduction in resource usage, and will depend on the hardware used.Even when using factors of speed-up for pruned models, there is a chance that these numbers will be different on other hardware configurations.
We implemented the algorithms and statistical procedures in Python, with the help of widely used libraries such as PyTorch, NumPy and SciPy.However, we cannot guarantee the absence of implementation errors which may have affected our evaluation.

CONCLUDING REMARKS
In this paper, we have proposed EarlyBIRD, an approach to combine early layers of encoder models for code, and tested different earlylayer combinations on the software engineering tasks of defect detection, bug type and exception type classification.Our study is motivated by the hypothesis that early layers contain valuable information that is discarded by the standard practice of representing the code with the CLS token from the last encoder layer.EarlyBIRD provides ways to improve the performance of existing models with the same resource utilization, as well as for resource usage reduction while obtaining comparable results to the baseline.Results: Using EarlyBIRD, we obtain statistically significant improvements over the baseline for the majority of the combinations that involve a single encoder layer on defect detection, and with selected EarlyBIRD combinations on bug type and exception type classification.Max pooling of tokens from selected single layers yields performance improvements for all datasets.Both the classification performance and the average fine-tuning time for one epoch are improved by pruning the pre-trained model to its early layers and using the CLS token from the last layer of the pruned model.For defect detection, this results in a +2.0 increase in accuracy and a 3.3x fine-tuning speed-up on Devign, and up to +0.8 accuracy improvement with a 3.7x speed-up on ReVeal.Pruned models do not lead to multi-class classification performance gains, but they do show a fine-tuning speed-up and the associated decrease in resource consumption.
The results show that pruned models with reduced size either work better or can result in a reduction of resource usage during fine-tuning with different levels of performance variation, which indicates the potential of EarlyBIRD in resource-restricted scenarios of deploying defect detection and but type classification in production environments.For example, EarlyBIRD achieves a 2.1x speed-up for BIFI while reducing accuracy from 96.7 to 95.0.Future Work: The study can be extended by investigating the generalization to other encoder models.We are in the process of studying the performance of EarlyBIRD with two new encoder models: StarEncoder [23] and ContraBERT_C [25].Another direction for future research is whether the types of layer combination and pruning, as we have investigated in this paper for encoder architectures, are also effective techniques for decoder and encoder-decoder architectures.Moreover, it would be of interest to experiment with other code classification tasks, such as general bug detection and the prediction of vulnerability types.The latter could be investigated using the CWE types from the Common Weakness Enumeration as labeled in the CVEfixes dataset [5].

Figure 1 :
Figure 1: Model architecture for code classification.

6. 1 . 1
Combinations of Tokens in Single Selected Early Layers.Figure 3 shows a heatmap of the difference of the mean accuracy obtained with each combination that uses only one selected early layer compared to the baseline.In addition, we show the value of the difference in mean accuracy for each combination type and layer number.Note that the scale is logarithmic and in the most extreme case spans the interval from ca. -37 to +2.Negative values are shown in black, and positive values are shown in white.Differences that are statistically significant according to the Wilcoxon test are marked with a star ( * ) next to the value.Combinations that correspond to the baseline are marked with "bsln" and have zero difference, by definition.The results for the weighted F1-score show a similar pattern as those for the mean accuracy.They are visualized in the same way in Figure 4.

Figure 3 :
Figure 3: Difference of mean accuracy between EarlyBIRD and baseline (bsln) performance.The star * indicates a statistically significant difference w.r.t. the baseline.

D(Figure 5 :
Figure 5: Difference of average accuracy between EarlyBIRD and baseline performance on D (Devign), R (ReVeal), B (BIFI), E (Exception Type).The star * indicates a statistically significant difference w.r.t. the baseline.

Figure 7 :
Figure 7: Model performance with a subset of  <  layers (xii) vs. models with all layers (ii); CLS token from layer .

Table 1 :
Statistics of Fine-Tuning Datasets.has 12 layers, so  = 512,  = 768,  = 12.Hyperparameters in the experiments are set to  = 64, learning rate is 1-5, and dropout probability is 0.1.If the tokenized input sample is longer than  = 512, we prune the tokens in the end to make the input fit into the model.We run fine-tuning with Adam optimizer and testing for each combination 10 times with different seeds for 10 epochs and report the performance for the best epoch on average over 10 runs.The best epoch is defined by measuring accuracy on a validation set.We use Python 3.7 and Cuda 11.6, and run experiments on one Nvidia Volta A100 GPU.

Table 2 :
Comparison of reduced size models with the baseline.We report metric performance for the baseline and difference with the baseline for reduced models, average time for one-epoch fine-tuning (Time) in mm:ss format, speed-up and performance variation obtained with models with  layers.Statistically significant improvements are marked in bold, statistically insignificant performance losses are marked with *, and  12 effect sizes for accuracy and F1(w), if any, are indicated by cell color, respectively large , medium , and small .The best metric improvement with highest speed-up factors are underlined.