Greening Large Language Models of Code

Large language models of code have shown remarkable effectiveness across various software engineering tasks. Despite the availability of many cloud services built upon these powerful models, there remain several scenarios where developers cannot take full advantage of them, stemming from factors such as restricted or unreliable internet access, institutional privacy policies that prohibit external transmission of code to third-party vendors, and more. Therefore, developing a compact, efficient, and yet energy-saving model for deployment on developers' devices becomes essential. To this aim, we propose Avatar, a novel approach that crafts a deployable model from a large language model of code by optimizing it in terms of model size, inference latency, energy consumption, and carbon footprint while maintaining a comparable level of effectiveness (e.g., prediction accuracy on downstream tasks). The key idea of Avatar is to formulate the optimization of language models as a multi-objective configuration tuning problem and solve it with the help of a Satisfiability Modulo Theories (SMT) solver and a tailored optimization algorithm. The SMT solver is used to form an appropriate configuration space, while the optimization algorithm identifies the Pareto-optimal set of configurations for training the optimized models using knowledge distillation. We evaluate Avatar with two popular language models of code, i.e., CodeBERT and GraphCodeBERT, on two popular tasks, i.e., vulnerability prediction and clone detection. We use Avatar to produce optimized models with a small size (3 MB), which is 160× smaller than the original large models. On the two tasks, the optimized models significantly reduce the energy consumption (up to 184× less), carbon footprint (up to 157× less), and inference latency (up to 76× faster), with only a negligible loss in effectiveness (1.67%).


LAY ABSTRACT
Large language models of code have proven to be highly effective for various software engineering tasks, such as spotting program defects and helping developers write code.While many cloud services built on these models (e.g., GitHub Copilot) are now accessible, several factors, such as unreliable internet access (e.g., over 20% of GitHub Copilot's issues are related to network connectivity [22]) and privacy concerns (e.g., Apple has banned the internal use of external AI tools to protect confidential data [53]), hinder developers from fully utilizing these services.Therefore, deploying language models of code on developers' devices like laptops appears promising.However, local deployment faces challenges: (1) Consumer-grade personal devices typically lack sufficient memory and the high-performance CPUs/GPUs required for efficient model execution; (2) Even if the hardware requirements are met, deploying the models on many devices can result in considerable energy consumption and carbon emissions, negatively impacting environmental sustainability.
To address these challenges, we present Avatar, an innovative approach that optimizes large language models of code and enables their deployment on consumer-grade devices.Avatar can optimize two popular models from a large size of 481 MB to a compact size of 3 MB, resulting in significant reductions in inference time, energy consumption, and carbon emissions by hundreds of times.Our technique effectively lowers the entry barrier for leveraging large language models of code, making them available to ordinary developers without the need for high-performance computing equipment.Furthermore, it also contributes to a more sustainable and user-friendly software development environment.

INTRODUCTION
Recent years have seen a remarkable surge in Artificial Intelligence (AI)-powered services for software engineering, such as GitHub Copilot [23] and GitLab Auto DevOps [12].This surge has brought a new level of automation to the software development process, significantly improving developer's productivity and the quality of software products.According to an economic analysis report released by GitHub, AI-powered services for software development could boost the global GDP by over $1.5 trillion by 2030 [13].
The foundation of these AI-powered services lies in large language models of code [35,49,55,56,82].These models have shown superior performance in various software engineering tasks such as vulnerability detection [7,33] and code completion [9,47].However, the services that utilize language models of code are typically hosted in the cloud, giving rise to several issues such as data leakage concerns [36,48,57,80] and poor user experience due to network fluctuations [22].Therefore, there is a growing need for deploying these models within the integrated development environments (IDEs) on developers' local machines.However, recent studies [65,75] have highlighted several challenges associated with deploying language models of code, including their large size, long inference latency, high energy consumption, and considerable carbon footprint.
Typically, language models of code are large-sized with numerous parameters.For example, CodeBERT [18] and GraphCode-BERT [26], two popular language models of code, both have 125 million parameters, resulting in a file size of about 500 megabytes (MB).The recently released Code Llama model is even larger at over 130 gigabytes (GB) [58].However, real-world deployment experiences, as observed by the Visual Studio team in deploying IDEs, have emphasized a preference for compact models, which are typically around 3 MB in size and can seamlessly function as IDE components or editor plug-ins even on low-end hardware devices [70].Meanwhile, language models perform billions of floating-point operations (FLOPs) during inference.These massive computations cause long inference latency, often taking over 1.5 seconds to return a prediction [65].Such delays can disrupt developers' workflow, ultimately resulting in a suboptimal user experience.Previous studies [4,70] suggest that for a model deployed in IDEs to offer developers instantaneous assistance, its inference latency should ideally be within a few tens of milliseconds at most.The inability of language models of code to meet the above requirements gives rise to usability issues, consequently impeding their widespread deployment within developers' IDEs.
Furthermore, and perhaps even more importantly, the billions of FLOPs during inference entail significant energy consumption and carbon footprint, raising concerns about environmental and climate sustainability.Considering a CodeBERT deployed in IDEs, a developer typically needs to run it thousands of times per day, which is a common usage amount [31].Such intensive usage results in an energy consumption of 0.32 kilowatt-hours (kWh), while a typical consumer-grade laptop has a battery capacity of around 70 watt-hours [40], i.e., 0.07 kWh.Consequently, a laptop's battery can only support a developer running CodeBERT for 0.22 hours, which is far from sufficient for a typical workday.This would frustrate developers and also hinder their ability to work flexibly in mobile environments.Moreover, the above energy cost of 0.32 kWh can translate into a considerable carbon footprint, amounting to approximately 0.14 kilograms of CO2 emissions.This carbon footprint is comparable to the emissions generated by driving a car for 0.6 miles. 1 With the expected widespread adoption of language models of code by many software developers in the near future, the cumulative carbon footprint stemming from model inference will become an increasingly pressing issue.
To date, few approaches have emerged to address the above issues [65,75].Shi et al. [65] propose Compressor, the state-ofthe-art approach that can compress language models of code down to 3 MB and thereby improve their inference latency.Compressor adopts the knowledge distillation technique [34] to transfer knowledge from a large model to a tiny one with a well-crafted 1 All of these calculations on energy consumption and carbon footprint are based on the Machine Learning Emissions Calculator: https://mlco2.github.io/impact.architecture searched by their proposed genetic algorithm.However, while Compressor excels at optimizing the model size and inference latency, it does not encompass the optimization of two other critical aspects, i.e., energy consumption and carbon footprint.Additionally, Compressor's search space for small model architectures is limited solely to hyperparameters related to model size, like the number of network layers.This limited scope excludes configurations that can significantly affect a model's effectiveness, like the choice of tokenizer [39].Consequently, it falls short of identifying the optimal small model.These limitations necessitate our work.Our work still follows the idea of using knowledge distillation to optimize language models for the sake of size and inference latency, but offers a novel take on simultaneously addressing the issues of energy consumption and carbon footprint.
This paper proposes Avatar, a novel approach aimed at optimizing language models of code for real-world deployment.Avatar accomplishes this by formulating the seeking of an optimal model as a multi-objective configuration tuning problem, where the optimization objectives include the simultaneous minimization of model size, inference latency, energy consumption, and carbon footprint, while maintaining effectiveness (e.g., prediction accuracy) on downstream tasks.
Avatar starts by identifying the key configurations within language models that impact the above objectives.It then innovatively combines a Satisfiability Modulo Theories (SMT) solver with a tailored multi-objective optimization algorithm to solve the configuration tuning problem.The SMT solver is used to construct a configuration space that adheres to the 3 MB model size constraint, while the multi-objective optimization algorithm identifies the Paretooptimal set of configurations, i.e., the set of configurations that cannot be improved in one objective without making sacrifices in another, thereby achieving the best trade-off among all objectives.
To efficiently obtain the effectiveness of models during optimization without the need for expensive training and evaluation processes, Avatar builds a regression model serving as an effectiveness indicator.This indicator estimates a model's effectiveness solely based on its configurations, facilitating the quick identification of the Pareto-optimal configurations.Finally, Avatar leverages knowledge distillation to train a compact and environmentally-friendly model using the configurations from the Pareto-optimal set.We evaluate Avatar using the same settings as the baseline method [65].Our evaluation focuses on optimizing two representative language models of code: CodeBERT [18] and GraphCode-BERT [26].We utilize two datasets for popular automated software engineering tasks: vulnerability prediction and clone detection.With Avatar, we produce optimized models with a compact size of 3 MB, a reduction of 160× compared to the original large language models.Across both tasks, these optimized models show a remarkable improvement in various aspects.They reduce inference latency by up to 76× compared to the original models, optimize energy consumption by up to 184× less, and reduce carbon footprint by up to 157× less.Importantly, these optimizations incur only a negligible loss in effectiveness, averaging 1.67%.Notably, Avatar outperforms the baseline method, Compressor, across all metrics.On average, Avatar achieves a 0.75% higher prediction accuracy.Additionally, it exhibits significant improvements in terms of inference latency (44× faster on average), energy consumption (up to 8× less), and Listing 1: Typical tunable configurations of language models of code.carbon footprint (up to 7× less).Moreover, we also highlight the benefits of Avatar in the context of cloud deployment, showing that the optimized models can process up to 9.7× more queries per second than the original large language models of code.
The contributions of this paper are summarized as follows: • Insight: We are the first to propose optimizing language models of code in terms of the energy consumption and carbon footprint by tuning their configurations.• Technique: We propose and implement Avatar, a novel approach that uses an SMT solver and a tailored multi-objective optimization algorithm to optimize language models of code in terms of model size, inference latency, energy consumption, and carbon footprint, while maintaining effectiveness.• Evaluation: We perform a thorough evaluation of Avatar, and the results show that Avatar effectively optimizes language models of code, greatly outperforming the state-of-the-art approach.

PRELIMINARIES
Language Models of Code and Their Configurations.The recent development and adoption of language models of code have enabled state-of-the-art results to be achieved on code-related tasks [35,49,55,56].These powerful models are mainly built upon the Transformer architecture [74] and trained on large datasets of source code from various programming languages.Among these models, a notable category is encoder-only models such as Code-BERT [18] and GraphCodeBERT [26], which utilize solely the encoder component of Transformer and are specialized for program understanding tasks such as vulnerability detection [7] and code search [85].These encoder-only models represent the software engineering community's early efforts at language models of code [35].Due to their pioneering status, these models have long been used in various real-world applications like the Akvelon code search engine [2].This has led to widespread popularity and social impact and thus motivated our study to focus on these models.Typically, encoder-only language models of code have a number of configurations that can be tuned to achieve varying levels of model performance.Listing 1 shows an example of tunable configurations from the Hugging Face's implementation [15], with a total  number of 13.Six of these configurations directly impact model size and inference latency, including the number of hidden layers, hidden size (i.e., the dimension of hidden layers), number of attention heads, vocabulary size, intermediate size (i.e., the dimension of feed-forward layers), and maximum sequence length.Larger values in these configurations tend to result in larger model sizes and longer inference latency, while smaller values may compromise model effectiveness (e.g., prediction accuracy).Compressor [65] focuses solely on tuning these configurations to optimize model size and inference latency at the cost of effectiveness.
However, there exist seven additional configurations that also contribute to model effectiveness.These include the choice of tokenizer, activation function for hidden layers, type of position embeddings, dropout rates for hidden layers and attention heads, learning rate, and batch size.For example, the choice of a tokenizer can affect a model's ability to capture the semantics of source code [39,42,64], thus impacting its overall effectiveness.In this study, we aim to tune all 13 configurations to achieve the best trade-off between model effectiveness and efficiency.We discuss the tuning space of these configurations and how to tune them in Section 3.
Knowledge Distillation.Knowledge distillation has proven to be an effective technique for optimizing large language models in terms of model size [41,59,65].It compresses a large model (referred to as the teacher model) by training a small model (the student model) to mimic the behaviors of the large one (i.e., produces the same output given the same input) [5,24,34].
In line with recent work [65], our study leverages a task-specific distillation method introduced by Hinton et al. [34] to optimize language models of code.The algorithm of this method is shown in Listing 2. Specifically, given a language model of code that is fine-tuned for a specific task and a small model to be trained, we input training data into both models, collect the resulting output probability values (line 15), and then update the parameters of the small model (line 8) to minimize the training loss computed by the function shown in line 7.The intuition behind minimizing this loss function is to bring the outputs of the language and small models closer together.  and   in this function denote the outputs of the large and small models, respectively. is the softmax function's temperature parameter, as Hinton et al. [34] introduced.Note that the language model producing   is fixed during the distillation process, while the small model producing   is trained.

METHODOLOGY 3.1 Problem Formulation
As introduced in Section 1, we aim to optimize the model size, inference latency, energy consumption, and carbon footprint of language models of code while maintaining their effectiveness (e.g., prediction accuracy on downstream tasks).Among these objectives, the inference latency, energy consumption, and carbon footprint are all related to the model's computational cost during inference.We use floating-point operations (FLOPs) to measure computational cost, following prior studies [29,61,65].FLOPs count how many multiply and accumulate operations the model performs for each prediction.The more FLOPs a model has, the more time it will take to make a prediction, the more energy it will consume, and the more CO 2 it will emit [61].Therefore, we use FLOPs as the proxy for these three objectives.Then, combined with the model size and effectiveness, we formulate our optimization problem as follows: where  is a set of configurations, and C defines the configuration space, as illustrated in Listing 3. Most of these configurations offer a range of adjustable integer or decimal values.For instance, the vocabulary size is adjustable to any integer value ranging from 1,000 to 50,265.Some others involve selecting from predefined options.The tokenizer requires a choice among four popular tokenization methods: Byte-Pair Encoding [62], WordPiece [76], Unigram [45], and Word [42].Additionally, we set the hidden activation function and position embedding type as tunable configurations following the Hugging Face's implementation [15], which includes a few more advanced options than the original implementation of language models.The hidden activation function requires a choice from four options: Gaussian Error Linear Unit (GELU) [32], Rectified Linear Unit (ReLU) [30], Sigmoid Linear Unit (SiLU) [14], and a new GELU implementation (GELU_new) [15].The position embedding type offers three choices: absolute, relative_key [63], and relative_key_query [37].In total, the configuration space contains about 4.5 × 10 19 possible sets of configurations, which is much larger than the one used by Compressor that only tunes 5 configurations.Our configuration space is also extensible to include more configurations or more options for existing configurations, such as more tokenizer choices.Here we focus on the configuration space shown in Listing 3 as studies [39,65] and Hugging Face's implementation [15] have explicitly shown that these configurations and options have a significant impact on model effectiveness.
Solving the problem posed by Equation 1 is challenging for three reasons: (1) the tuning space of configurations is quite huge, which makes brute force impractical since evaluating all configurations is computationally infeasible; (2) utilizing off-the-shelf Satisfiability Modulo Theories (SMT) solvers that support solving constrained optimization problems is not a viable approach for solving this problem.This is because obtaining model effectiveness necessitates training and testing the model.Such a process cannot be formulated as a mathematical function of configurations that SMT solvers can handle; (3) this multi-objective optimization problem comes with objectives that conflict with others.For example, a larger model typically has better effectiveness on downstream tasks but incurs higher FLOPs.Thus, solving Equation 1 involves finding a Pareto-optimal solution set, i.e., a set of trade-off solutions where no solution can be improved in one objective without degrading other objectives [10], rather than finding a single, unique solution.

Approach Overview
Pursuant to the above challenges, our approach, Avatar, is designed to solve the problem through a multi-step process outlined in Figure 1.First, we prune the configuration space using an SMT solver, with the 3 MB model size constraint suggested by prior studies [65,70] as the pruning criterion (Section 3.3).This initial step removes configurations that are irrelevant to our objectives, thereby facilitating the subsequent identification of Pareto-optimal configurations.Next, we sample a small number of configurations from the pruned space and use them to train a regression model that can predict the effectiveness of a model initialized by a given set of configurations, i.e., build an effectiveness indicator (Section 3.4).Subsequently, we use a multi-objective optimization algorithm, assisted by the effectiveness indicator, to identify the set of Paretooptimal configurations within the pruned space (Section 3.5).Finally, we train a compact and environmentally-friendly model with the configurations from the Pareto-optimal set using the knowledge distillation technique that we have introduced in Section 2. We describe these steps in detail below.

Pruning Configuration Space
The predefined configuration space shown in Listing 3 is incredibly large, with quintillions of possible configuration sets.However, only a fraction of them adhere to the constraints outlined in Section 1.For example, setting the vocabulary size to its maximum value of 50,265 will result in a model size that exceeds the 3 MB constraint, even with all other configurations minimized.Such configurations are thus considered irrelevant to our objectives and should be omitted from the configuration space to facilitate the subsequent process of identifying Pareto-optimal configurations.
We prune the configuration space by formulating and solving a constraint satisfaction problem using Microsoft Z3 [11], a stateof-the-art SMT solver known for efficiently handling nonlinear constrained optimization problems [6,21].While Z3 cannot directly solve our primary optimization problem, it performs well at identifying and excluding configurations that violate specified constraints.One crucial constraint is related to model size, as introduced in Section 1, which specifies that the model size cannot exceed 3 MB.This constraint is only explicit one suggested by prior studies [65,70] while acceptable standards for other objectives have not been empirically specified.We formulate the constraint satisfaction problem as follows, where C represents the configuration space, and  denotes a set of configurations: Solving this constraint satisfaction problem yields multiple sets of configurations that satisfy the model size constraint, which can then be merged to craft a new configuration space.
The above formula follows the official implementation of Compressor [65] to calculate the actual file size of a model in MB.It breaks down a language model of code into three components: the embedding, transformer, and classifier layers.By summing these components, the formula calculates the total model size.Note that this formula only considers the six configurations that directly affect model size, while excluding other configurations like the tokenizer from our constraint satisfaction problem-solving process.
We then use the above formula and the raw configuration space as inputs to Z3, to find the configurations for which the formula evaluates to a value less than 3 MB.Considering that solving with Z3 can slow down significantly when dealing with an overly large configuration space [21,73], we run Z3 by partitioning the configuration space into several smaller subspaces and processing them in parallel.Taking the vocabulary size as an example, we can partition the original range of 1,000 to 50,265 into 50 subranges, i.e., 1,000 to 2,000, 2,000 to 3,000, etc.These 50 subranges are then combined with the tuning ranges of other configurations, forming 50 subspaces.Each subspace's constraint satisfaction problem is treated as an independent task and solved in parallel using separate Z3 threads.Once all tasks are completed, we aggregate the results to form a new, pruned configuration space, as shown in Listing 4. The underlined entries, i.e., the vocabulary size, hidden size, and intermediate size, have been pruned.This process significantly reduces the configuration space from 4.5×10 19 to 1.3×10 19 , which accounts for only 28.9% of the original space.Notably, the pruned configuration space still contains a broad and diverse range of configurations, providing sufficient space to identify Pareto-optimal solutions.

Effectiveness Indicator
When tuning configurations, assessing the effectiveness of a model that has a given set of configurations is essential to determine whether it qualifies as a Pareto-optimal solution.However, obtaining model effectiveness through training and testing is computationally expensive.Inspired by recent work in leveraging machine learning techniques to predict the runtime performance of software [20,27,28], we propose to construct a regression model as a proxy for the training and testing process.Specifically, the regression model builds a computationally efficient function that maps a model's configurations to its effectiveness, enabling us to estimate a model's effectiveness using only the provided configuration as input.Consequently, this approach eliminates the need for resourceintensive model training and testing.We consider this regression model as an effectiveness indicator.
We follow the procedures outlined in Listing 5 to develop an effectiveness indicator.First, we randomly sample a set of configurations from the pruned configuration space (line 7).Next, we utilize the knowledge distillation technique introduced in Section 2 to train a model for each of these sampled configurations (line 10).We then evaluate the effectiveness of these models on the validation dataset (line 11), which has a similar distribution to the test dataset, but remains distinct and is not used for training.Subsequently, we use the sampled configurations and the corresponding effectiveness values to train a regression model that serves as our effectiveness indicator (line 12).For this purpose, we employ Bayesian Ridge Regression (BRR) [72].BRR is a statistical regression method that combines Bayesian principles [51] with linear regression techniques [67].It trains regression models by minimizing the squared difference between predicted and actual target values.BRR is particularly valuable when dealing with limited data points, which is the case for our effectiveness indicator since we have only a few sampled configurations.Note that the regression model usually takes numbers as inputs, while some of our configurations are strings.For these configurations, we use their corresponding indices in the tuning range as inputs to the regression model.For example, the tokenizer has four options, so we use 0, 1, 2, and 3 to represent them.

Multi-Objective Configuration Tuning
With the pruned configuration space and effectiveness indicator, we are now ready to introduce our innovative multi-objective configuration tuning algorithm, which is specifically designed to identify the set of Pareto-optimal configurations in terms of size, FLOPs, and effectiveness for optimizing large language models of code.
As presented in Listing 6, our algorithm takes the pruned configuration space, the effectiveness indicator, and the number of generations as inputs.It starts by generating an initial population of configuration sets by an adaptive random initialization method (line 5).These configurations are then assessed in terms of the three objectives (line 6): the size and FLOPs are calculated with the implementation of Compressor [65], while the effectiveness indicator predicts the effectiveness.The algorithm maintains an archive to store the Pareto-optimal configurations (line 7).This archive is initialized as an empty set and is updated throughout the algorithm's execution.Subsequently, it enters an iterative loop that runs for a specified number of generations.At each iteration, the algorithm applies three operators, i.e., two-point crossover, boundary random mutation, and correction, to generate new offspring from the population (lines 9 to 11).These offspring are then evaluated, and the archive of Pareto-optimal configurations is updated accordingly (lines 12 to 13).The next generation of population is selected from the current population and the offspring by a tournament selection method (line 14).After the loop terminates, the algorithm returns the archive of Pareto-optimal configurations (line 15).The main operators and steps are described in detail below.
Adaptive Random Initialization.We aim to assemble an initial population of highly diverse configuration sets, which can facilitate more efficient exploration of the configuration space.To achieve this, we employ adaptive random initialization [1,50], an extension of naive random search that attempts to maximize the Euclidean distance between the selected configurations in the population.Concretely, this method first randomly selects a configuration set  from the configuration space.It then randomly selects another configuration set  ′ and compares the Euclidean distance between  and  ′ with the distance between  and the other configuration sets already present in the population.If the distance between  and  ′ exceeds those between  and other configuration sets,  ′ is added to the population.Otherwise,  ′ is discarded.This process continues until the population reaches the desired size.Importantly, when calculating the Euclidean distance, as when training the effectiveness indicator, we replace the configuration in the form of strings with its corresponding numerical index within the tuning range.
Two-Point Crossover.This operator, commonly used in metaheuristic algorithms such as genetic algorithms to solve optimization problems [44,66], aims to combine two parent configurations to generate new offspring configurations.It begins by randomly selecting two parent configurations and two crossover points.Subsequently, it swaps the values of the two parent configurations between these two crossover points to create two offspring configurations.For instance, if the two parent configurations are denoted as  1 and  2 , and the selected crossover points are  1 and  2 , the resulting offspring configurations are computed as follows: Here,  1 [0 :  1 ] represents the values of  1 before  1 , and  1 [ 2 :] represents the values of  1 from  2 to the end.The generated offspring configurations are then added to the population.Boundary Random Mutation.This operator introduces random modifications to the values of a configuration set, resulting in a new offspring configuration.Following recent work utilizing genetic algorithms for optimization problems [65,79], we employ the boundary random mutation operator to generate offspring configurations.The process begins by randomly selecting a configuration from the population.Subsequently, for each configuration value within this selected configuration, a mutation rate  is randomly chosen from the range of [0, 1].If  falls below a predefined threshold, the selected configuration value is set to a random value within its tuning range, while ensuring that the modified solution remains within the feasible configuration space, i.e., the boundary.The resulting offspring configuration is then incorporated into the population.
Correction.The above crossover and mutation operators may produce invalid offspring configurations that are unusable for initializing models.For example, according to the implementation of Hugging Face [15], a model's hidden size must be divisible by the number of attention heads; otherwise, the model will fail to initialize due to dimension misalignment errors.To address such cases and rectify them, our tuning algorithm employs correction operators.When it encounters invalid offspring configurations, it discards their values and proceeds to randomly select new values until the offspring configuration becomes valid.
Tournament Selection.The selection operator plays a key role in constructing the next generation from the existing population and the newly generated offspring.Using the tournament selection method [17], a well-established technique in metaheuristic algorithms, a fixed number of configurations are randomly selected from the combined pool of the current population and offspring.Then, the Pareto-optimal ones are selected from these configurations and added to the next generation, ensuring that the most promising candidates are retained for the next iteration.
As mentioned above, the algorithm manages and continuously updates an archive of Pareto-optimal configurations throughout its execution.When evaluating a configuration set, the algorithm compares it with the configurations already present in the archive.If the evaluated configuration set is not dominated by any other configuration set in the archive, it secures its place within the archive.Additionally, if any configuration set in the archive is found to be dominated by the new configuration set, it will be excluded from the archive.This process ensures the archive contains only non-dominated configurations, i.e., Pareto-optimal solutions.The algorithm terminates when the specified number of generations is reached, at which point it returns the archive of Pareto-optimal configurations.We then select a configuration set from the archive to train a compact and green model using knowledge distillation.

EMPIRICAL EVALUATION
Our evaluation aims to answer the following research questions: • RQ1 (Effectiveness): How effective is Avatar in optimizing language models of code?• RQ2 (Comparison): How does Avatar compare to the state-ofthe-art method in optimizing language models of code?

Experimental Setup
Tasks and Datasets.Following the evaluation settings in the prior work [65], we assess the performance of Avatar on two popular software engineering tasks: vulnerability prediction and clone detection.Table 1 provides an overview of the datasets used in our experiments.These datasets encompass different programming languages and sizes, allowing for a thorough evaluation of Avatar.More details on the tasks and datasets are provided below.
The vulnerability prediction task involves determining whether a given code snippet is vulnerable or not.Integrating vulnerability prediction models into an IDE can significantly assist developers in identifying critical program defects early, thus enhancing software quality and reducing maintenance costs.For our experiment, we utilize the Devign dataset [86], which was released by Zhou et al.It contains 27,318 functions from two popular open-source C libraries, i.e., FFmpeg and Qemu.The dataset was constructed by manually annotating whether these functions contain vulnerabilities or not.We first follow the CodeXGLUE [49] benchmark for dataset splitting, allocating 80% for training, 10% for validation, and 10% for testing.To facilitate knowledge distillation, which requires unlabeled data, we follow Compressor [65] to evenly divide the training set into two mutually exclusive halves.One half is used for fine-tuning the language models, while the other, with erased labels, serves to train the model with configurations generated by Avatar.
The clone detection task aims to identify whether two given functions are code clones, assisting in recognizing redundant implementations of the same functionalities during software maintenance.For evaluating Avatar's effectiveness in clone detection, we select the widely-used BigCloneBench dataset [69].This dataset is collected by mining the clones of specific functionalities in 25,000 Java projects sourced from SourceForge and Google Code platform.It includes over 6,000,000 pairs of cloned Java methods, along with 260,000 non-clone pairs.We follow recent studies [65,79] to randomly select 90,102 examples (i.e., 10% of the original training dataset) for training and reserve 4,000 for validation and testing.Then, we divide the training data into labeled and unlabeled portions of equal size, which are for fine-tuning large models and training optimized models, respectively.
Language Models of Code.To evaluate Avatar, we follow Shi et al. [65] to use two popular encoding-only language models of code: CodeBERT [18] and GraphCodeBERT [26].These two models share the same architecture and have been language on the CodeSearch-Net dataset [38].CodeBERT undergoes pre-training with two tasks: masked language modeling, which predicts masked tokens in input texts, and replaced token detection, which identifies whether a token in a given input has been replaced.GraphCodeBERT also uses masked language modeling, but also incorporates code graph Table 2: Results of Avatar and the original language models on the two tasks."CB" and "GCB" denote CodeBERT and GraphCodeBERT, respectively."ACC" is the prediction accuracy."LAT" is the inference latency."E" is the energy consumption."CO 2 " is the CO 2 emission, i.e., the carbon footprint.structure information by predicting masked nodes in data flow graphs during pre-training.After pre-training, both CodeBERT and GraphCodeBERT can be fine-tuned on downstream tasks, enabling them to achieve state-of-the-art performance [49,56,82].

Model
To fine-tune CodeBERT, we use the hyperparameter settings from the CodeXGLUE benchmark [49].In the case of GraphCode-BERT, we follow the hyperparameter settings described in the GraphCodeBERT paper [26].All models deliver results comparable to those reported in the previous study [82].
Evaluation Metrics.After obtaining the model trained with configurations tuned by Avatar, we compare it with the language model and the model generated by our baseline method, Compressor, using six metrics: effectiveness, model size, inference latency, energy consumption, carbon footprint, and Giga floating-point operations (GFLOPs).Effectiveness is evaluated by prediction accuracy on the two downstream tasks, following prior studies [65,79].Model size is quantified in megabytes (MB).For inference latency, which is measured in milliseconds (ms), we standardize experimental conditions by limiting all models to use only 8 CPU cores, simulating running on a typical consumer-grade laptop.The testing datasets are used to query the models, and the average inference latency is calculated for each data example.Note that we use a batch size of 1 to replicate real-world scenarios where models are deployed on laptops and only process a single input at a time.
To evaluate energy consumption and carbon footprint, we use the Machine Learning Emissions Calculator2 , developed by Lacoste et al. [46].The tool requires the total running time of a model as input and outputs the energy consumption and carbon footprint, measured in kilowatt-hours (kWh) and kilograms (kg), respectively.We record the total running time of the models on the testing datasets as input to the tool, and consistent with our inference latency evaluation, we use a batch size of 1.Additionally, as mentioned in Section 3, GFLOPs are commonly used to quantify the computational cost of a model, which is closely related to energy consumption and carbon footprint.Thus, we also report GFLOPs to illustrate how Avatar contributes to environmental sustainability by reducing the computational cost of language models of code.
Implementation.We run all experiments on an Ubuntu 18.04 server equipped with an Intel Xeon E5-2698 CPU, 504 GB of RAM, and 8 Tesla V100 GPUs.To prune the configuration space with Z3, we partition it into 25,600 subspaces and execute Z3 in parallel across 80 CPU cores.For training the effectiveness indicator, we sample 20 sets of configurations from the pruned configuration

Effectiveness of Avatar (RQ1)
After obtaining the Pareto-optimal configurations using Avatar, we select the configuration with a model size closest to 3 MB for training the optimized model.This results in a model that is approximately 160× smaller than the original language model of code for each task.Table 2 shows the experimental results comparing the optimized models with the original ones.On the two tasks, the optimized models exhibit an average decrease in accuracy of only 1.67% (≈ (0.70% + 2.63%)/2) compared to the original large models.This accuracy result illustrates that Avatar significantly optimizes model size with only a negligible loss in effectiveness on downstream tasks.Furthermore, the inference latency of the optimized models sees a substantial reduction on both tasks, with an average reduction of 62× for vulnerability detection and 129× for clone detection.Prior research [52] has suggested that software practitioners are willing to accept a small sacrifice in effectiveness in exchange for a significant improvement in usability.Therefore, we consider the reduced accuracy of the optimized models to be acceptable in practical applications.
Table 2 also presents results of optimizing language models in terms of environmental sustainability.We employ the Machine Learning Emissions Calculator [46] to calculate the energy consumption and carbon footprint of the optimized models, comparing them to the original ones.Note that these results are calculated using a single NVIDIA Tesla V100 GPU and encompass the cost of running the entire testing dataset rather than a single query.On both tasks, the energy consumption of the optimized models sees a significant reduction, averaging 53× and 184× less, respectively.This reduction extends to a corresponding decrease in carbon footprint, ranging from 51× to 157× less.Additionally, we observe a notable reduction in GFLOPs for the optimized models, with an average reduction of 212× and 147× on the two tasks, respectively.These results underscore the sustainability benefits that the optimized models can offer in real-world deployments.
Answers to RQ1: Avatar effectively optimizes language models of code in terms of model size (160× smaller), inference latency (up to 76× faster), energy consumption (up to 184× less), and carbon footprint (up to 157× less), with only a negligible loss in effectiveness (1.67% on average).

Avatar vs. Compressor (RQ2)
As the baseline for our experiments, we employ the approach, Compressor, proposed by Shi et al. [65].To ensure a fair comparison, we directly utilize the models available in the official repository of Compressor.The models produced using Compressor and Avatar have a similar size at 3 MB.The evaluation results comparing these approaches are presented in Table 3.
Compared to the models optimized by Compressor, the models produced by Avatar exhibit a slightly higher accuracy, with an average improvement of 0.75% (≈ (1.45% + 0.07%)/2) on the two tasks.These results suggest that Avatar can optimize language models of code more effectively without compromising effectiveness as much as Compressor.More importantly, the models optimized by Avatar demonstrate significant improvements in inference latency on both tasks.Compressor produces models with an inference latency in the hundreds of milliseconds range, while the optimized models obtained by our approach have a maximum latency of 29 ms.On average, the inference latency of the models optimized by Avatar is 44× (≈ (33 + 54)/2) faster than that of the ones produced by Compressor, which highlights the effectiveness of Avatar in enhancing the usability of language models compared to the stateof-the-art approach.
Avatar also improves the energy consumption of the optimized models by 3× and 8× compared to Compressor on vulnerability prediction and clone detection, respectively.These reductions also translate into a corresponding decrease in carbon footprint, with reductions of 4× and 7× on the two tasks.Overall, except for model size, the models optimized by Avatar outperform the ones optimized by Compressor across all metrics.
Answers to RQ2: Avatar significantly outperforms Compressor (i.e., the state-of-the-art approach) in terms of prediction accuracy (0.75% on average), inference latency (44× faster on average), energy consumption (up to 8× less), and carbon footprint (up to 7× less).

DISCUSSIONS 5.1 Efficiency of Avatar
We investigate the time taken by Avatar to optimize language models of code, breaking it down into four parts: pruning the configuration space, building the effectiveness indicator, executing the configuration tuning algorithm, and training optimized models.
In our experimental setup, the parallel execution of pruning the configuration space takes just 5 minutes to complete.After that, Avatar uses a single 16 GB Tesla V100 GPU to train 20 models for constructing the effectiveness indicator, consuming approximately 10 hours.Note that this overhead is only rarely incurred, e.g., the first time optimizing a language model for deployment, which may occur only on a monthly or yearly basis.Because of the carefully pruned configuration space and the specialized optimization algorithm, Avatar efficiently returned Pareto-optimal configurations in about 2 minutes.Subsequently, the knowledge distillation phase required more time, with Avatar taking an average of 14.9 and 18.3 minutes to train an optimized model for the vulnerability prediction and clone detection tasks, respectively.These results underscore the fact that Avatar can produce well-performing optimized models with much less time cost than fine-tuning or pre-training large language models, which often takes a few hours or days [65].

Usefulness in Cloud Deployment
The primary goal of Avatar is to optimize language models of code for deployment on developers' personal devices like laptops.As mentioned in Section 1, we hold this perspective due to privacy concerns [36,48,57,80] and the need for use under poor network conditions.Deploying models on cloud servers may not be a viable option because it requires sending code to third-party vendors, which is prohibited by some companies that consider code bases to be important intelligent properties.Also, cloud deployment may result in more inference latency for developers in some regions with poor bandwidth or Internet coverage.However, we acknowledge that cloud deployment is a common practice today, offering more computing resources and scalability to support a larger user base.Therefore, it would be worthwhile to also discuss the benefits of optimized models in the context of cloud deployments.
We run experiments assuming that the models process queries in batch mode with a batch size of 100.These experiments are run on a server equipped with a Tesla V100 GPU.We send the queries directly from the GPU's host machine to eliminate any potential impact from network fluctuations, and then measure how many queries the models can process per second.The experimental results, presented in Table 4, show that compared to the original language models of code, the optimized models can process on average 3.9× and 9.7× more queries per second on the two tasks, respectively.These results highlight the advantages of using Avatar for deploying large language models of code in cloud servers.

Threats to Validity
One potential threat to internal validity is the randomness inherent in the configuration tuning algorithms used in our experiments.To address this concern, we have run each experiment 10 times and reported the average results, as recommended by Arcuri and Briand [3].Regarding external validity, a potential threat is that our results may not be generalizable to other models and tasks beyond the ones we have studied.To ensure the generalizability of our work, we have carefully selected two representative encoder-only language models of code and two popular downstream tasks with different characteristics for our evaluation.This ensures that our results are unbiased and our method potentially applies to a broad context.While we have not yet applied our method to other types of language models, such as decoder-only models, which have also recently gained popularity, we plan to extend our study on those models to further validate our work's generalizability in the future.One threat to construct validity is that the evaluation metrics may not fully capture the performance of our Avatar and the baseline in enhancing the usability and sustainability of language models of code.To mitigate it, we use a total of five widely-used evaluation metrics to compare the effectiveness of Avatar and the baseline from a comprehensive set of perspectives.

RELATED WORK
In recent years, both the natural language processing and software engineering communities have dedicated their efforts to optimizing language models.However, unlike our work, which seeks to simultaneously optimize multiple aspects of language models of code, most existing studies focus on reducing model size only, thereby indirectly mitigating other related issues such as inference latency.These existing studies typically fall into three main categories: model pruning, model quantization, and knowledge distillation.
Model pruning and quantization involve directly altering model parameters to reduce model size.Model pruning replaces certain parameters with zeros, or removes network components like hidden layers [16,54].Model quantization converts a model's 32-bit floating-point parameters into lower-bit fixed-point values [19,43,81].These techniques have proven effective in reducing model size to a level suitable for deployment in scenarios with less stringent requirements.A recent study has also demonstrated their potential to reduce the computational cost and carbon footprint of language models of code [75], offering a promising avenue for future research.However, these techniques fall short of meeting the 3 MB model size recommendation put forth by Svyatkovskiy et al. [70] within the context of software engineering.As a result, we have chosen not to include them in our pipeline and comparison experiments.
We have introduced knowledge distillation in Section 2, an essential step in Avatar and the baseline.While several knowledge distillation methods have been proposed, most of them typically result in models ranging from 100 to 200 MB [41,60,68,77].Some studies [8,71,78,84] have successfully optimized language models into sizes ranging from 20 to 40 MB.Notably, only Compressor [65] has achieved the remarkable feat of optimizing a large language model of around 500 MB into a compact 3 MB model.Therefore, we only compare Avatar with Compressor in our experiments.
The software engineering research community has also explored alternative methods for optimizing language models of code.For example, Grishina et al. [25] propose using only the initial layers of language models during inference to reduce resource consumption.Additionally, Zhang et al. [83] introduce a technique to simplify the input programs for CodeBERT, significantly reducing computational cost without compromising model performance.Despite these efforts, there are still gaps in optimizing language models of code to simultaneously improve usability and environmental sustainability.To the best of our knowledge, our study is the first to address both aspects concurrently.

CONCLUSION AND FUTURE WORK
This paper proposes Avatar, a novel approach that can optimize large language models of code in terms of model size, inference latency, energy consumption, and carbon footprint without sacrificing effectiveness (e.g., prediction accuracy on downstream tasks) by much, thereby improving the usability and environmental sustainability of language models of code.The key idea of Avatar is to formulate the optimization of language models as a multiobjective configuration tuning problem and solve it with the help of SMT solvers and a tailored optimization algorithm.We evaluate Avatar with two state-of-the-art language models, i.e., CodeBERT and GraphCodeBERT, on two popular tasks, i.e., vulnerability prediction and clone detection.We use Avatar to produce optimized models with a small size (3 MB), which is 160× smaller than the original large models.On the two tasks, the optimized models can significantly reduce the energy consumption (up to 184× less), carbon footprint (up to 157× less), and inference latency (up to 76× faster), with only a negligible loss in effectiveness (1.67% on average).Compared with the state-of-the-art approach, Avatar optimizes language models of code more effectively in all metrics.
In the future, we plan to further investigate the effectiveness and efficiency of our proposed approach Avatar by experimenting with more large language models of code beyond those considered in this paper, such as the generative language models of code.

Listing 2 :
Algorithm of knowledge distillation.

Figure 1 :
Figure 1: The workflow of Avatar.
Listing 4: The pruned configuration space.It contains around 1.3 × 10 19 sets of configurations, 28.9% of the original one.The underlined entries are pruned (Section 3.3).
1 input C: pruned configuration space 2 input : language model of code (teacher model) 3 input : training dataset 4 input  : validation dataset 5 input  : temperature parameter 6 input : number of sampled configurations 7  = sample( C,  ),  = { } 8 for  in : 9   = initialize(  ) 10   = knowledge-distillation(,   , , ) 11   = test(  ,  ) 12 return Bayesian-Ridge-Regression( {,  } )Listing 5: Algorithm for building an effectiveness indicator.As pointed out in Section 2, a language model typically offers a handful of tunable configurations that directly determine the model size.Let  denote the vocabulary size,  denote the number of hidden layers, ℎ denote the hidden size,  denote the intermediate size,  denote the number of attention heads, and  denote the maximum sequence length.Then the model size can be calculated as follows:

Table 1 :
Overview of datasets used in our experiments.

Table 3 :
Results of Avatar and Compressor on the two tasks."CB" and "GCB" denote CodeBERT and GraphCodeBERT, respectively."ACC" is the prediction accuracy."LAT" is the inference latency."E" is the energy consumption."CO 2 " is the CO 2 emission, i.e., the carbon footprint.

Table 4 :
Usefulness of Avatar in cloud deployment.The results show how many queries that the models can process per second when deployed on a cloud server.