Large Language Models for Code: Security Hardening and Adversarial Testing

Large language models (large LMs) are increasingly trained on massive codebases and used to generate code. However, LMs lack awareness of security and are found to frequently produce unsafe code. This work studies the security of LMs along two important axes: (i) security hardening, which aims to enhance LMs' reliability in generating secure code, and (ii) adversarial testing, which seeks to evaluate LMs' security at an adversarial standpoint. We address both of these by formulating a new security task called controlled code generation. The task is parametric and takes as input a binary property to guide the LM to generate secure or unsafe code, while preserving the LM's capability of generating functionally correct code. We propose a novel learning-based approach called SVEN to solve this task. SVEN leverages property-specific continuous vectors to guide program generation towards the given property, without modifying the LM's weights. Our training procedure optimizes these continuous vectors by enforcing specialized loss terms on different regions of code, using a high-quality dataset carefully curated by us. Our extensive evaluation shows that SVEN is highly effective in achieving strong security control. For instance, a state-of-the-art CodeGen LM with 2.7B parameters generates secure code for 59.1% of the time. When we employ SVEN to perform security hardening (or adversarial testing) on this LM, the ratio is significantly boosted to 92.3% (or degraded to 36.8%). Importantly, SVEN closely matches the original LMs in functional correctness.

Although LMs excel in functional correctness, they may produce code with security issues [26,28,75].An evaluation in [60] discovered that, in various security-relevant scenarios, 40% of Copilotgenerated programs contain dangerous vulnerabilities.This evaluation was reused in [69], which found that other state-of-the-art LMs [35,57,69] have similarly concerning security level as Copilot.Another study in [44] found that in 16 out of 21 security-relevant cases, ChatGPT [4] generates code below minimal security standards.In practice, users can always reject or modify LM-suggested code, including any LM-generated vulnerabilities.The authors of the Copilot evaluation conducted a follow-up user study that considers such human interaction [66].The study concluded that while LM-assistance provides productivity gain, it does not lead developers to produce significantly more security bugs.This finding reassures LM's usefulness even in security-sensitive scenarios.However, considerable effort is still required to rule out vulnerabilities in LM-suggested code either manually during coding or through retrospective security analysis after coding.

Security Hardening and Adversarial Testing
In this work, we investigate the security of LMs for code in two complementary directions.First, we introduce security hardening in order to enhance LMs' ability to generate secure code.Second, we explore the potential of degrading LMs' security level from an adversarial perspective.To accomplish these goals, we formulate a new security task called controlled code generation.This task involves providing LMs with an additional binary property, alongside the prompt, that specifies whether it should generate secure (for security hardening) or unsafe code (for adversarial testing).Our proposed task is analogous to controlled text generation, which aims to alter text properties such as sentiment and toxicity [30,41,43,46,47,62].However, to the best of our knowledge, we are the first to study controlled generation for code security.We propose to address controlled code generation using a learning-based approach, for which we highlight three challenges described as follows.
Challenge I: Modularity Due to the massive size of existing LMs, it can be prohibitively expensive to repeat pretraining or even perform fine-tuning, both of which change LMs' entire weights.Thus, we desire to train a separate module that can be plugged into LMs to achieve security control without overwriting their weights.Moreover, given the difficulty of obtaining high-quality security vulnerabilities [25,29,39,59], our approach should be efficiently trainable on a small amount of data.
Challenge II: Functional Correctness vs. Security Control When enforcing security control, it is essential that LMs' ability to produce functionally correct code is maintained.For security hardening, this preserves LMs' usefulness, while for adversarial testing, maintaining functional correctness is crucial for imperceptibility.An LM with security control but severely deteriorated functional correctness is of little practical value, as it can be easily detected and abandoned by the end user.Figure 1 provides a conceptual illustration of our objective which requires simultaneously achieving strong security control (dashed curve) and preserving functional correctness (solid curve).The key challenge is to design a training mechanism that successfully realizes this dual objective.

Challenge III: Ensuring High-quality Training Data
The quality of the training data is critical for the effectiveness of our approach, as with many other machine learning methods [20,39,45].Specifically, the training data must align with and generalize to our code completion setting.Furthermore, it must accurately capture true security fixes.To avoid learning undesirable program behaviors, irrelevant code artifacts, such as refactoring and functional edits, must be excluded.Although available vulnerability datasets exist [25,34,53,58,76,80], they are not fully appropriate for our task or even suffer from severe data quality issues [29].Therefore, we must analyze how they meet our requirements and construct high-quality training data accordingly.
Our Solution: SVEN We introduce SVEN 1 , a novel method to address the challenging task of controlled code generation.SVEN realizes modularity by keeping the LM's weights unchanged and learning two new, property-specific sequences of continuous vectors, known as prefixes [50].To generate code with a desired property, SVEN plugs the corresponding prefix into the LM as its initial hidden states, prompting the LM in the continuous space.The prefix influences the computation of subsequent hidden states through the attention mechanism, guiding the LM to generate code that meets the property's requirements.Because the prefix parameters are tiny w.r.t. the LM (e.g., ∼0.1% in our experiments), SVEN is lightweight and can be efficiently trained on a small amount of data.Continuous prompting is widely used for cost-effectively adapting LMs to different NLP tasks [38,49,50,55,63].However, we are the first to apply this technique to control code security.
To balance security control and functional correctness, SVEN carefully optimizes the prefixes with specialized loss terms that operate on different code regions.Our training dataset consists of security fixes extracted from GitHub commits, where each fix includes a program pair: the program before (resp., after) the fix is insecure (resp., secure).We make the key observation that only the edited code in these fixes is decisive for security, while the unchanged code is neutral.Accordingly, we divide the training 1 Our code, models, and datasets are available in https://github.com/eth-sri/sven.programs into changed and unchanged regions.In changed regions, we optimize the prefixes for security control using a conditional language modeling loss and a contrastive loss between security and vulnerability.In unchanged code regions, we constrain the prefixes to preserve the LM's original capabilities.To this end, we leverage a loss based on KL divergence [17] to regularize the prefixes to comply with the original LM in next-token probability distributions.We thoroughly review existing vulnerability datasets and find that they do not fully meet our requirements for data quality: some are specific to certain projects or vulnerabilities, thus lacking generalizability to daily code completion scenarios [25,53,80]; others are at a commit level, which can contain undesirable code artifacts [34,58,76].To obtain a high-quality dataset for SVEN, we perform manual curation on [34,58,76], which results in ∼1.6k programs.We detail our dataset reviewing and curation processes in Section 4.3.While small, the curated dataset is sufficient for effectively training SVEN due to SVEN's data efficiency discussed earlier.As shown in Section 6.3, our dataset outperforms a baseline dataset that is constructed by indiscriminately including ∼19x more program pairs from [34,58,76] at the cost of lower data quality.
Evaluating SVEN We perform an extensive evaluation of SVEN on both security control and functional correctness.To assess security, we adopt the state-of-the-art security evaluation frameworks for LM-based code generators [60,68], which cover diverse impactful vulnerabilities, such as those from the MITRE top-25 most dangerous software weaknesses [1].The results show that SVEN achieves strong security control.Take the state-of-the-art Code-Gen LM [57] with 2.7B parameters as an example.The original LM generates secure programs with a ratio of 59.1%.After we perform security hardening (resp., adversarial testing) with SVEN, the ratio is significantly increased to 92.3% (resp., decreased to 36.8%).Additionally, SVEN is able to preserve functional correctness: its pass@ scores closely match the original LMs on the widely adopted Hu-manEval benchmark [26].Additionally, we provide ablation studies confirming the usefulness of our key techniques and experiments exploring SVEN's generalizability to prompt perturbations, different LMs, and vulnerability types that are not part of SVEN's training.

SVEN's Security Implications
With modular design, enhanced security, and reliable functional correctness, SVEN can be seamlessly applied to harden existing commercial code completion engines based on LMs [2, 3, 8, 9, 72], providing substantial benefits to their extensive user base.Moreover, to the best of our knowledge, SVEN is the first work to provide a realistic adversarial evaluation for LMs of code, under the constraint of preserving functional correctness for imperceptibility.

Main Contributions Our main contributions are:
• A new security task called controlled code generation (Section 3), which can be used to perform both security hardening and adversarial testing of LM-based code generators (Section 5).• SVEN, a novel solution to the above task, including modular inference (Section 4.1) and specialized training procedures that balance security control and functional correctness (Section 4.2).• A manually curated, high-quality training dataset, which is suitable for our controlled code generation task and can be of general interest for other tasks (Section 4.3).• An extensive evaluation of SVEN on different vulnerabilities, benchmarks, and LMs (Section 6).

BACKGROUND AND RELATED WORK
In this section, we provide necessary background knowledge and a discussion on closely related work.
Code Generation with Large Language Models Recent works have proposed a number of large LMs for modeling code, such as Codex [26], PaLM [28], AlphaCode [51], CodeGen [57], and many others [19,35,69,77].These LMs are capable of suggesting functionally correct code completions and solving competitive programming problems.They are all based on the Transformer architecture [74], which can handle long sequences thanks to its self-attention mechanism that accesses all previous hidden states.At inference time, an LM-based code generation model takes a prompt as input, which can be a partial program or natural language documentation expressing the functionality desired by the user.The prompt is converted to a sequence of tokens and fed into the LM.Then, the LM generates new tokens one by one, until it reaches special tokens indicating the end of generation or the length budget is exhausted.Finally, the generated tokens are transformed back into program text form to produce the final completion.
Formally, we model a program x as a sequence of tokens, i.e., x = [ 1 , . . .,  |x| ], and utilize a Transformer-based, autoregressive LM that maintains a sequence of hidden states.At step , the LM computes the hidden state h  from the current token   and the sequence of all previous hidden states h < : h  consists of key-value pairs used for attention computations.The number of pairs is equal to the number of layers in the LM.The LM further transforms h  into the next-token probability distribution  ( |h ≤ ).The probability of the entire program is computed by multiplying the next-token probabilities using the chain rule: The initial hidden states h <1 are usually empty.In Section 4, we explain how SVEN leverages non-empty, trained initial hidden states to control the security of generated programs.
We generate programs by sampling from the LM in a left-to-right fashion.At step , we sample   based on  ( |h < ) and feed   into the LM to compute h  , which will be further used at step +1.A temperature is usually applied on  ( |h < ) to adjust sampling certainty [26].The lower the temperature, the more certain the sampling.LM training typically leverages the negative log-likelihood loss: For state-of-the-art LMs [26,28,57], training is performed on a massive dataset of both program and natural language text.LMs' Benefits in Programming Productivity Codex [26] powers GitHub Copilot [9], a popular code completion service used by >1M developers and >5K businesses [32].A research from GitHub found that using Copilot leads to an 8% higher success rate and 55% faster speed on completing certain coding tasks [42].Similarly, a study by Google demonstrated that their internal LM-based code completion engine improves the productivity of Google developers, e.g., reducing coding iteration time by 6% [72].Recent user studies from academia confirmed the benefits of Copilot on increasing coding productivity, such as offering a useful starting point [73] and assisting users to write functionally correct code [66].
Code Security and Vulnerability Automatic detection of security vulnerabilities in code is a fundamental problem in computer security.It has been studied for decades, using either static or dynamic analyses [56,70].A more recent trend is to train stateof-the-art deep learning models [25,[52][53][54]80] on vulnerability datasets [22,34,58,76].However, existing detectors that target general vulnerabilities are still not accurate enough [25].GitHub CodeQL [6] is an open-source security analyzer that allows users to write custom queries to detect specific security vulnerabilities effectively.After detection, program repair techniques can be used to fix detected vulnerabilities [27,36,37,61].Conversely, bug injection produces unsafe programs by injecting synthetic vulnerabilities into vulnerability-free programs [33,39,59,78].
Common Weakness Enumeration [16] is a categorization system for security vulnerabilities.It includes >400 categories for software weaknesses.MITRE provides a list of the top-25 most dangerous software CWEs in 2022 [1], which includes the CWEs studied in this paper.For simplicity, we refer to this list as "MITRE top-25".

Security of LMs for Code
A study in [60] evaluated the security of Copilot-generated code in various security-sensitive scenarios for CWEs from MITRE top-25, using CodeQL and manual inspection.This evaluation was later adopted in [69] to assess other state-ofthe-art LMs [35,57,69].Both studies arrived at similarly concerning results: all evaluated LMs generate insecure code for ∼40% of the time.The work of [68] extended the evaluation to many other CWEs beyond MITRE top-25.Another study [44] constructed 21 security-relevant coding scenarios.It found that ChatGPT produces insecure code in 16 cases and self-corrects only 7 cases after further prompting.A follow-up user study [66] from [60]'s authors suggested that human interaction should be considered for evaluating LMs' security.In practice, users have the option to accept, reject, or modify LM-suggested code, allowing them to reject or fix LMproduced vulnerabilities.The user study found that LM-assistance provides productivity gain without leading developers to produce significantly more security bugs.
Enhancing or adversarially degrading the security of LMs for code is an early-stage research topic.In Feb 2023, GitHub Copilot introduced a scheme that blocks insecure coding patterns [79].
Poisoning attacks can cause neural code models to have higher chances of suggesting insecure crypto parameters [67,71].Section 5 compares our work with [79] and [67] in detail.

CONTROLLED CODE GENERATION
We aim to enable controlled code generation on an LM.In addition to a prompt, we provide a property  to guide the LM to generate code that satisfies property .Our focus is a binary security property:  = {sec, vul}.If  = sec, the output program should be secure, allowing for security hardening of the LM.On the other hand,  = vul represents an adversarial testing scenario where we evaluate the LM's security level by trying to degrade it.Figure 2 (a) provides a visual representation of controlled code generation.Furthermore, it is important for the controlled LM to preserve the original LM's capability of generating functionally correct code.This requirement ensures the LM's practical utility after security hardening and enables imperceptibility during adversarial testing.To achieve controlled code generation, we condition the LM on property : After choosing , programs can be generated from the conditional LM in the same left-to-right fashion as a standard LM.Our formulation and naming of controlled code generation draw inspiration from controlled text generation [30,41,43,46,47,62].At the end of Section 4.2, we make a differentiation between our work and related works from controlled text generation.
Differences from Related Security Tasks In Figure 2, we highlight the differences between controlled code generation and three classical security tasks: vulnerability detection, repair, and injection.A general difference is that controlled code generation targets a code completion setting and takes effect on code that the user is about to write, while the other three tasks operate retrospectively on code that has already been written.Figure 2

SVEN: INFERENCE, TRAINING, AND DATA
This section presents SVEN, our solution to controlled code generation.We will discuss SVEN's inference, learning, and procedures for constructing training data.
Illustrative Code Example Figure 3 shows two versions of a Python function before and after a security vulnerability gets fixed.This example is from SVEN's training dataset, which is constructed from real-world GitHub commits.We choose it for illustration purposes and note that other samples in our dataset are usually more complex.In Figure 3, self.contentmay contain malicious scripts from untrusted users.Before the commit, the malicious scripts can flow into the return value of the function, causing a cross-site scripting vulnerability.The commit fixes the vulnerability by applying the sanitization function markupsafe.escape on self.content, which ensures that the return value only contains safe content [11].

Inference
To enable controlled code generation, SVEN leverages continuous prompts, particularly the prefix-tuning approach [50].Unlike discrete text prompts, continuous prompts can be conveniently optimized with gradient descent.Moreover, continuous prompts are strictly more expressive than text prompts because LMs transform all discrete tokens into fixed continuous embeddings.Specifically, SVEN operates on a trained LM with frozen weights.For each property  ∈ {sec, vul}, SVEN maintains a prefix, denoted by SVEN  .Each prefix is a sequence of continuous vectors, each having the same shape as any hidden state h produced by the LM.Therefore, a prefix has a total of  ×  parameters, where  is the sequence length and  is the size of h.To realize conditional generation in Equation ( 1), we choose a property  and prepend SVEN  as the initial hidden states of the LM.Through the Transformer attention mechanism, SVEN  exerts a long-term influence on the computations of subsequent hidden states, including the prompt and the code to be generated.This steers the LM to generate programs that adhere to the property .Importantly, SVEN  does not diminish the LM's original capability in functional correctness.
Visualization: LM vs. SVEN Figure 4 visually compares the inference procedures of LM and SVEN sec , as well as their effect on security.Since the LM is trained without awareness of security and vulnerability, it produces undesirable security results, e.g., only a 60% chance of generating secure code, as shown in Figure 4 (a).Figure 4 (b) leverages the same LM but additionally inputs SVEN sec as the initial hidden states of the LM.Due to the attention mechanism, SVEN sec greatly boosts the probability of generating secure programs, e.g., to 90%.Similarly, SVEN vul can drive the LM to generate unsafe code with higher probability.Take Figure 3 as an example.Given a partial program async def html_content(self):, SVEN sec assigns high probabilities to programs with sanitization for usercontrolled inputs, while SVEN vul avoids generating sanitizers.

SVEN: Lightweight and Modularity
The number of prefix parameters is adjustable by the prefix length  .Following [50], we choose small  values that amount to only ∼0.1% additional parameters on top of the LM, ensuring that SVEN is lightweight.Another key advantage of SVEN is modularity.The prefixes serve as an independent module that can be conveniently attached to or detached from the LM.Furthermore, the two prefixes SVEN sec and SVEN vul are trained jointly but operate independently during inference.After training, the user can keep only the desired prefix and discard the other, depending on the task at hand.

Training
Our training optimizes SVEN for the objective depicted in Figure 1, which involves simultaneously achieving security control and preserving functional correctness.To this end, we propose to operate specialized loss terms on different regions of code.Importantly, during our whole training process, we always keep the weights of the LM unchanged and only update the prefix parameters.We directly optimize SVEN's parameters through gradient descent.
Training Programs and Code Regions SVEN's training requires a dataset where each program x is annotated with a ground truth property .We construct such a dataset by extracting security fixes from GitHub, where we consider the version before a fix as unsafe and the version after as secure.In Figure 3, we show an example code pair.The lines removed and introduced during the fix are marked in light red and light green, respectively.The introduced characters are represented in dark green.
We make a key observation on our training set: the code changed in a fix determines the security of the entire program, while the untouched code in a fix is neutral.For instance, in Figure 3, adding a call to the function markupsafe.escapeturns the program from unsafe to secure [11].This observation motivates our training to handle changed and unchanged code regions separately.Specifically, at security-sensitive regions, we train SVEN to enforce code security properties, while at neutral regions, we constrain SVEN to comply with the original LM to preserve functional correctness.
To implement this idea, we construct a binary mask vector m for each training program x, with a length equal to |x|.Each element   is set to 1 if token   is within the regions of changed code and 0 otherwise.We determine the changed regions by computing a diff between the code pair involving x.We consider three diff levels, resulting in three types of token masks: • program: the diff is performed at the program level.All tokens are considered security-sensitive and are masked with 1.
• line: we utilize line-level diffs provided in GitHub commits' metadata.As a result, only the masks in the modified lines are set to 1, e.g., the light red line and the light green line in Figure 3. • character: we compute character-level diffs by comparing code pairs using the diff-match-patch library [15].Only changed characters are masked to 1.In Figure 3, the fix only adds characters, so only the masks in dark green are set to 1.All token masks of the insecure program are set to 0.
Among the three types of masks, character-level masks offer the most precise code changes.However, when a fix only introduces new characters, such as in Figure 3, using character-level masks sets all mask elements of the unsafe program to 0. This can lead to insufficient learning signals on insecure code for SVEN.To address this problem, we adopt a mixing strategy that utilizes characterlevel masks for secure programs and line-level masks for unsafe programs.In Section 6.3, we experimentally show that our mixing strategy performs better than other options.We note that our technique of differentiating code regions is general and can be applied to code properties other than security.
To summarize, each sample in SVEN's training dataset is a tuple (x, m, ).Since our training set is constructed from code pairs, it also contains another version of x with the opposite security property ¬.Next, we present three loss terms for training SVEN, which are selectively applied on different code regions using m and serve to achieve our dual objective in Figure 1.

Loss Terms for Controlling Security
The first loss term is a conditional language modeling loss masked with m: (2) L LM only takes effects on tokens whose masks are set to 1. Essentially, L LM encourages SVEN  to produce code in security-sensitive regions that satisfies property .As an example, for the insecure training program in Figure 3, L LM optimizes SVEN vul to generate the tokens in the red line.
In addition to L LM , we need to discourage the opposite prefix SVEN ¬ from generating x, which has property .In this way, we provide the prefixes with negative samples.For the example in Figure 3, we desire that SVEN sec generates the sanitizer and, at the same time, SVEN vul does not generate the sanitizer.To achieve this, we employ a loss term L CT that contrasts the conditional next-token probabilities produced from SVEN  and SVEN ¬ [62]: . (3) Each KL divergence term is multiplied by ¬  , meaning that L KL is applied only on unchanged regions.Therefore, L KL does not conflict with L LM and L CT during optimization.KL divergence measures the difference between two probability distributions.On a high level, L KL serves as a form of regularization, encouraging similarities between the token-level probability distributions produced by SVEN and the original LM.As we demonstrate in Section 6, this token-level regularization translates to SVEN achieving comparable performance with the original LM in the functional correctness of the entire program.
Overall Loss Function Our overall loss function is a weighted sum of the three loss terms in Equations (2) to (4): Section 6.3 examines the trade-off between security control and functional correctness when we adjust the weights  CT and  KL .

SVEN vs. Controlled Text Generation
Our work is closely related to controlled text generation, whose goal is to alter text properties such as sentiment and toxicity, while maintaining text fluency [30,41,43,46,47,62].However, these works do not study code security and its relationship with functional correctness.Moreover, these works apply their loss functions globally on the entire input text, while our approach identifies the localized nature of code security and proposes to operate different loss terms over different regions of code.As shown in Section 6.3, this technique is indispensable for the effectiveness of SVEN.

SVEN:
Training Data Efficiency SVEN is a highly data-efficient approach that can be effectively trained on a relatively small dataset.This is because: (i) SVEN still performs the original code generation task and only adjusts the output code distribution towards the given security property.This stands in contrast to training for a completely new task such as vulnerability detection or repair [25,27,76,80]

Constructing High-quality Training Dataset
For typical machine learning methods, ensuring the quality of the training dataset and addressing concerns related to distribution shifts are critical for model accuracy and real-world effectiveness [20,39,45].Within the context of SVEN, the significance of training data quality is even more pronounced, especially when existing software vulnerability datasets exhibit severe quality issues [29].Therefore, we devote significant effort to building and curating SVEN's training data, with a focus on its alignment with real-world use cases.Like LMs, SVEN takes effect on daily code completion scenarios.Therefore, the training data needs to be generalizable to these scenarios and should not be overfitted to a restricted set of projects or vulnerabilities.Moreover, SVEN' training should be done on true security fixes and avoid contamination from other code artifacts common in GitHub commits, such as refactorings and functional edits.Next, we describe our steps for constructing a high-quality training set to meet these requirements.
Reviewing and Selecting Base Datasets Our first step is to thoroughly review existing vulnerability datasets [22,25,34,53,58,65,76,80] to select base datasets for further investigation.We exclude datasets in [25,53,80] as they target a limited set of (2 or 4) projects or vulnerabilities, thus lacking generalizability to daily code completion scenarios.Instead, we consider datasets derived from CVE records, which cover a broader range of vulnerabilities and projects, making them more suitable for training SVEN.Hence, we include CrossVul [58] and Big-Vul [34].To avoid redundancy, we do not include other datasets that are also based on CVE records, such as [22,65].We also include VUDENC [76] because it focuses on Python while the majority of programs in CrossVul and Big-Vul are in C/C++.Moreover, VUDENC is collected by scanning GitHub, adding a different data source on top of CVE records.The three included datasets [34,58,76] all provide CWE tags for their samples, which allows us to focus on the most impactful CWEs.
Curating Security Fixes from Commits The base datasets considered by us are all at the commit level.We find that these commits are far from ready for training SVEN because they contain quality issues that can cause SVEN to learn undesirable behaviors.VUDENC [76] applies keyword-matching on commit messages to collect its dataset, which produces many false positives.One such case is shown in Figure 5 (a).The commit is identified in [76] as fixing a path traversal vulnerability (CWE-022), because the commit message contains keywords such as "path" and "fix".However, the commit actually only changes a directory name and is not a security fix.Commits crawled from CVE records often contain true security fixes, but many also consist of irrelevant code artifacts [29].In Figure 5 (b), we show a security fix commit from [34,58] that performs refactoring on a function, which is explicitly written in the commit message.Moreover, some fixes in [34,58] are only applicable to specific projects and are not generalizable to daily code completion scenarios.For instance, the fix in Figure 5 (c) involves ND_TCHECK_16BITS, an API used only by the tcpdump project.(b) A commit* in both CrossVul [58] and Big-Vul [34] that fixes a vulnerability (not shown) but also performs refactoring (shown).(c) A commit* in both CrossVul [58] and Big-Vul [34] that fixes an out-of-bound read but the fix is only applicable in "tcpdump".
To improve data quality, we perform manual inspection on the commits of [34,58,76] for our target CWEs.Among those commits, our inspection extracts code pairs that are true security fixes and excludes quality issues discussed above.Manual inspection is necessary because these issues cannot be accurately detected automatically.Importantly, our manual curation is based on domain expertise and does not tune our training set on the test set.
Final Training and Validation Datasets Our final datasets cover 9 CWEs.We focus on these CWEs because (i) they are all listed in MITRE top-25 and are thus critical, (ii) we are able to extract sufficient (>40) security fixes for them, (iii) automated security evaluation is possible [60,68].The statistics of our datasets are shown in Table 1.It consists of 1,606 programs (i.e., 803 pairs).Each program is a function written in C/C++ or Python.We randomly split the dataset by a ratio of 9:1 into training and validation.
Our data construction relies on manual effort and deliberately excludes samples that do not meet our quality criteria, thus prioritizing quality over quantity.This decision is well-justified by the data-efficient nature of SVEN, as discussed at the end of Section 4.2.The sufficiency and effectiveness of our dataset for training SVEN are experimentally confirmed by our evaluation in Section 6.Furthermore, Section 6.3 shows that our training set is superior in both security control and functional correctness, when compared to a baseline dataset constructed by indiscriminately including ∼19x more samples from our base datasets [34,58,76] at the cost of lower data quality.In Section 6.5, we discuss potential automated techniques for enabling larger-scale yet precise data curation.
Training Granularity: all CWEs at Once We perform a single training run to obtain two prefixes, namely SVEN sec and SVEN vul , that simultaneously address all CWEs captured in the training dataset.This design decision aligns with the goal of security hardening and adversarial testing in practice: we aim to safeguard the LM against a broad range of security issues, while the adversary might seek to introduce as many vulnerabilities as possible.Furthermore, it offers the advantage of simplicity compared to conducting several training runs for each specific CWE.

SVEN: USE CASES
We discuss SVEN's practical use cases: security hardening and adversarial testing.For both use cases, we assume that the user is able to perform SVEN's training on the target LM.

Security Hardening
For security hardening, the user trains SVEN and always feeds SVEN sec to the target LM.Thus, the LM benefits from improved reliability at producing secure programs.For instance, the user can use SVEN sec to harden open-source LMs [35,57,69].Alternatively, the user can be the developer team of a non-public LM [26,28].

Comparison with GitHub Copilot's Vulnerability Prevention
In February 2023, GitHub launched a system to prevent Copilot from generating unsafe code [79].The system is only briefly described in a blog post without evaluation.With limited information available, we provide a best-effort comparison between GitHub's prevention system and SVEN.First, GitHub's prevention is done by filtering out insecure coding patterns, which are likely applied on generated code after inference.On the contrary, SVEN alters the LM's output distribution during inference.Therefore, they can be complementarily used at different stages.Second, at the time of writing, GitHub's prevention only supports three CWEs (CWE-089, CWE-022, and CWE-798).As shown in Section 6, SVEN sec supports and performs well on these three CWEs, as well as many other impactful ones such as CWE-125 and CWE-079.Lastly, GitHub's prevention system is closed-source while SVEN is open-source.

Adversarial Testing
By learning SVEN vul , our intension is benign: we aim to assess the security level of LMs from an adversarial perspective.This is important for LM debugging, which enables us to pinpoint weak points and develop strategies to mitigate potential attack vectors.

Potential Ethical Concerns
We also reveal that SVEN vul can be used maliciously.For example, the malicious user can insert SVEN vul into an open-source LM and redistribute the modified version, e.g., through HuggingFace [12].Alternatively, the user might leverage SVEN vul to run a malicious code completion service or plugin.The imperceptibility that SVEN vul achieves by preserving functional correctness is critical for hiding the malicious purpose.
Comparison with Poisoning Attacks for Code Security The work of [67] applies data and model poison attacks on neural code completion engines.Our work differs with [67] in four important aspects.First, SVEN can be used for security hardening, while [67] cannot.Second, [67] did not provide results on functional correctness.Third, the assumptions on the adversary's knowledge are different.Poisoning attacks assume that the adversary can interfere LM training by adding poisoned data or performing fine-tuning, while SVEN takes effect on trained LMs.Finally, [67] is applied to individual crypto parameters and smaller models such as GPT-2 and LSTM [40], while SVEN is evaluated on a diverse range of CWEs and stronger LMs such as CodeGen [57] (please refer to Section 6).

EXPERIMENTAL EVALUATION
In this section, we present an extensive evaluation of SVEN, demonstrating its effectiveness through the following aspects: • SVEN achieves strong security control and maintains the ability to generate functionally correct code (Section 6.2). • All our techniques presented in Section 4 are important for SVEN to achieve optimal performance (Section 6.3).• SVEN exhibits other useful properties: robustness to prompt perturbations, applicability across different LMs, and generalizability to certain CWEs unseen during our training (Section 6.4).

Experimental Setup
We now describe our experimental setup.

Model Choices Our evaluation covers various state-of-the-art
LMs.We mainly focus on CodeGen [57], because it is performant in functional correctness and open-source.We use the multi-lingual version of CodeGen, because our evaluation covers Python and C/C++.We consider three different model sizes: 350M, 2.7B, and 6.1B.Apart from CodeGen, our generalizability studies in Section 6.4 show that SVEN is applicable to other LMs, such as InCoder [35] and SantaCoder [18].
Evaluating Security To assess the security of our models, we adopt the state-of-the-art methodology in [60,68], which involves a diverse set of manually constructed scenarios that reflect real-world coding.This ensures that our evaluation faithfully reflects SVEN's generalization: first, our training and test data come from different sources; second, using manual prompts is a common practice to mitigate data leakage from LMs' large pretraining dataset [26].
Each evaluation scenario targets one CWE and contains a prompt expressing the desired code functionality, based on which the model can suggest secure or unsafe code completions.For each scenario and each model, we sample 25 completions and filter out duplicates or programs that cannot be compiled or parsed.This results in a set of valid programs, which we then check for security using a GitHub CodeQL [6] query written specifically for the target vulnerability.We calculate the security rate: the percentage of secure programs among valid programs.To account for the randomness during sampling, we repeat each experiment 10 times with different seeds and report mean security rate, as well as 95% confidence intervals.Our evaluation scenarios receive code completions in a left-toright manner, which is a standard way of evaluating code LMs [26] and is compatible with all LMs considered by us.To achieve this, we transform the prompts in [60], which originally target Copilot and receive code infillings.Such transformation does not alter code semantics.For example, Figure 6 (a) is converted from Figure 6 (c), the original prompt in [60].The prompts in [68] already target left-to-right completion and do not need conversion.Moreover, we improve the prompts such that the desired functionality is better described and the models generate code that aligns with the functionality.We detail other small changes to individual scenarios in Appendix A. For CodeQL, we use the same set of queries as in [60,68], except for two cases where we make improvements2 .
Our evaluation primarily focuses on the 9 CWEs captured by our training set.These CWEs are significant because they are all listed in MITRE top-25.We refer to them as the main CWEs.The corresponding scenarios are adapted from [60] and are presented in Table 2.In our generalizability studies (detailed in Section 6.4), we stress test SVEN on more demanding scenarios, including perturbations to prompts and more CWEs from [60,68] that are not part of SVEN's training set.Note that our evaluation excludes a subset of scenarios from [60,68] that rely on manual inspection to check for security.Including these scenarios would make it prohibitively expensive to perform large-scale security assessment and could introduce subjectivity to the results.Such scenarios are also omitted by the security evaluation in [69].
Evaluating Functional Correctness We leverage the standard HumanEval benchmark for evaluating functional correctness [24,26].We calculate pass@:  programs are generated per coding problem, the problem is considered solved if any program passes all unit tests, and the total fraction of problems solved is reported.We use the unbiased estimator of pass@ in [26] that reduces variance.Following [26,57], for each , we run the model with 4 common sampling temperatures (0.2, 0.4, 0.6, and 0.8) and report the highest pass@ score among the 4 temperatures.

Hyperparameters and Computation
Resources Following [50], we set the size of prefix to ∼0.1% of the total parameters.We ensure the existence of long training sequences by setting the maximal token length to 1024.Our experiments were performed on NVIDIA A100/H100 GPUs.Even for the largest LMs (>6B) considered by us, our training is cost-effective, requiring <3h time and <80GB of GPU memory.In contrast, LM pretraining demands GPU clusters and days to months of time [57,69,77] (c) The original prompt in [60].Figure 6: An example of our evaluation scenarios and its difference from the original one in [60].
Table 2: The 9 main CWEs and their scenarios used in our evaluation.Scenarios with the same text description differ in code.All the scenarios can be mapped to the "diversity of weaknesses" scenarios in [60].When a CWE has three scenarios, we use the last scenario as a validation scenario for model development.We report evaluation results on the 18 test scenarios.

Main Experiments
This section presents the results of our main experiments: security control on our 9 main CWEs and functional correctness on the HumanEval benchmark, for CodeGen models.
Overall Security Rate on Main CWEs In Figure 7, we present the overall security rate for CodeGen models on the main CWEs.The sampling temperature is set to 0.4, which strikes a balance between sampling certainty and diversity.The results show that SVEN consistently achieves strong security control over all three model sizes.CodeGen LMs have a security rate of ∼60%, which matches the security level of other LMs as measured by [60,69].SVEN sec significantly improves the security rate to >85%.The best performing case is 2.7B, where SVEN sec increases the security rate from 59.1% to 92.3%.SVEN vul degrades the security rate greatly by 23.5% for 350M, 22.3% for 2.7B, and 25.3% for 6.1B.
We then experiment with temperatures 0.1 and 0.8, to investigate the relationship between temperature and security.The results are shown in Figures 8 and 9.For SVEN sec , we observe evidently higher security rates with lower temperatures (i.e., higher confidence during sampling).This means that the users of SVEN sec have the flexibility to adjust the security level with the temperature.On the contrary, for LM, the security rate does not change significantly across different temperatures.
In Appendix B, we provide breakdown results for CodeGen-2.7Bat temperature 0.1, which, combined with Figure 10, is helpful for understanding the effect of temperature on the security of individual scenarios.Appendix B also includes breakdown results for CodeGen-350M and CodeGen-6.1Bat temperature 0.4, as well as more detailed statistics of Figure 10 about the absolute number of programs in different categories.

Functional Correctness on HumanEval
In Table 3, we summarize the pass@ scores of CodeGen LMs and SVEN on the Hu-manEval benchmark [26].For CodeGen LMs, our pass@ scores are consistent with the results reported in the original paper [57]  closely match LM with only slight reductions in some cases.In practice, these minor reductions are acceptable, particularly given that security is effectively controlled.Therefore, we conclude that SVEN accurately preserves LM's functional correctness.Example II: CWE-125, Out-of-bounds Read The prompt for scenario "CWE-125 0-c" is shown below: int i , id_sequence [3] = {1 , 5 , 7}; // prompt for the index value to request printf ( " Enter the index value : " ); scanf ( " % d " , & i ); // if it is in bounds , // return the value at the index in id_sequence LM and SVEN vul frequently generate code with out-of-bounds read such as below.When i<0, the program can crash.

Examples of Generated
On the contrary, SVEN sec almost always generates correct bound checks at both directions:   [57] and SVEN on the ability to generate functionally correct code, measured by pass@ scores on the HumanEval benchmark [26].
For this scenario, LM and SVEN vul frequently generate unsafe code that naively uses os.path.join on the user-controlled variable requested_file_name to construct file_path, allowing the user to retrieve any file from the server.file_path = os .path .join ( safe_dir , requested_file_name ) On the contrary, SVEN sec almost always uses a safe API: safe_join.See below for an example.According to the documentation [14], safe_join raises an exception if the resulting path would fall out of the directory given as the first argument.

Ablation Studies
Now we present various ablation studies to validate the usefulness of all our techniques described in Section 4. All results in this section are obtained with CodeGen-2.7Band temperature 0.4.We first vary  CT in Equation ( 5), the weight of our contrastive loss L CT for enforcing security.The results are displayed in Figure 11.We report pass@10 scores for functional correctness because the models perform well for pass@10 at temperature 0.4.Increasing  CT from 0.25 to 4 improves security control.In the meantime,  CT is small enough so that functional correctness is maintained.When  CT is increased to >4, the training still results in good security control but causes undesirable perturbations that significantly deteriorate functional correctness.SVEN's  CT is set to 4, achieving a balance between security control and functional correctness.

Trade-off between Security and Functional Correctness
Figure 12 shows the results of varying  KL in Equation ( 5), the weight of our KL divergence loss L KL for constraining the prefixes to preserve functional correctness.Increasing  KL from 0.1 to <1.6 improves functional correctness while maintaining effective security control.However, such small  KL values still lead to degraded functional correctness in comparison to the original LM.Increasing  KL to >1.6 preserves functional correctness but causes excessive constraint, which hinders security control.Therefore, SVEN sets  KL to 1.6 for CodeGen-2.7B,which produces desirable results for both security control and functional correctness.

SVEN vs. Text Prompts
To compare our continuous prompting with discrete text prompting, we construct a baseline named "text" that uses comments "The following code is secure" and "The following code is vulnerable" as text prompts to control the LM. Figure 13 shows that such a baseline achieves no security control.Furthermore, we fine-tune the whole LM with the text prompts on our training set to obtain a model called "text-ft".Figure 13  Figure 16: Results for SantaCoder [18].Left: overall security rate at temperature 0.4; Right: pass@ on HumanEval [26].
that "text-ft" cannot control security and completely destroys functional correctness.This experiment demonstrates the superiority of our continuous prefixes over the considered text prompts.

Importance of Code Regions for Training
We construct three baselines that separate code regions using the "program", "line", and "character" token masks, respectively, as discussed in Section 4.2."program" is equal to no differentiation of code regions.Figure 13 shows that it performs the worst among the three baselines and SVEN, meaning that our differentiation of security-sensitive and neutral code regions during training is critical for security control.Moreover, SVEN outperforms all three baselines.This demonstrates that the mix strategy adopted by SVEN, which involves both linelevel and character-level token masking, is the best masking choice among all considered options.

Necessity of Manually Curating Training Data
In Section 4.3, we highlight the importance of our manual curation in obtaining high-quality training data.To validate the benefits of our manual curation, we construct a baseline dataset by indiscriminately including all program pairs changed in the commits of [34,58,76].This baseline dataset is a superset of our curated dataset and is also ∼19x larger with 15,207 program pairs.However, the baseline dataset has lower quality because it includes quality issues discussed in Section 4.3.We use the baseline dataset to train a model called "no-curation" with the same hyperparameters as training SVEN.
Note that "no-curation" costs ∼19x more training time due to ∼19x more training data.From the comparison in Figure 13, we can see that SVEN outperforms "no-curation" in both security control and functional correctness.This confirms the necessity of our manual data curation and suggests that data quality should be given higher priority than quantity for our task.

Generalizability Studies
In this section, we evaluate SVEN's generalizability.

Robustness to Prompt Perturbations
The evaluation in [60] investigated how Copilot's security changes for a specific scenario of CWE-089, given small perturbations to the prompt.The perturbations can be summarized as: (i) con, the base scenario derived from "CWE-089 0-py"; (ii) m- * , scenarios with meta-type changes; (iii) d- * , scenarios with documentation (comment) changes; (iv) c- * , scenarios with code changes.We provide detailed descriptions of these perturbations in Appendix A. The authors found that Copilot's security fluctuates across these perturbations.We reuse this experiment to evaluate SVEN's robustness across perturbations and present the results in Figure 14.While CodeGen LM's security rate fluctuates like Copilot, SVEN exhibits consistent security control: SVEN sec achieves a 100% security rate and SVEN vul maintains a low security rate of at most 1.6%.This is likely because security control signals from SVEN's continuous prefixes are stronger than text perturbations in prompts.
Applicability to Different LMs To investigate SVEN's applicability beyond CodeGen, we evaluate SVEN on InCoder [35] and SantaCoder [18].Both InCoder and SantaCoder were trained with the fill-in-the-middle objective [21], while CodeGen only involved standard left-to-right training.For InCoder, we use the version with 6.7B parameters.For SantaCoder, we adopt the version with multi-head attention and 1.3B parameters.As in Section 6.2, we test functional correctness with HumanEval.For evaluating security, we use our main CWEs but have to exclude three C/C++ CWEs (namely, CWE-476, CWE-416, and CWE-190) to ensure the validity of our results.This is because SantaCoder was not sufficiently trained for C/C++ and very often produces compilation errors.[60] and are detailed in Table 5.For this experiment, the base model is CodeGen-2.7Band the temperature is 0.4.The overall security rate for LM, SVEN sec , and SVEN vul are 53.4%,77.1%, and 44.7%, respectively.The results, depicted in Figures 15 and 16, show that SVEN effectively controls security and maintains functional correctness, for both InCoder and SantaCoder.This highlights the LM-agnostic nature of SVEN and showcases its broader applicability.

Generalization to CWEs Unseen during Training
We now evaluate SVEN's generalizability to CWEs that are not part of SVEN's training data.This is an important setting due to the difficulty of collecting comprehensive vulnerability datasets [25,29,59] and the existence of unknown vulnerabilities.
The results in Figures 17 and 18 demonstrate SVEN's generalizability across various cases unseen during training.For certain other CWEs, SVEN does not exhibit the same level of generalization, which is likely due to the absence of relevant behaviors in the training data.Note that SVEN sec does not deteriorate LM's security level on these CWEs.As a result, SVEN sec still provides significant security benefits over LM.

Discussion
We now discuss SVEN's limitations and suggest future work items accordingly.First, SVEN currently does not capture certain securityrelated behaviors, such as the CWEs evaluated in Section 6.4 for which SVEN lacks generalization and programming languages other than Python and C/C++.We suggest to address this limitation by constructing a more comprehensive training dataset that covers more security-related behaviors.Potential solutions could be involving automated reasoning techniques to identify security fixes (e.g., using security analyzers such as CodeQL) or crowdsourcing (e.g., asking users of code completion services to submit insecure code generations and their fixes).Second, decreasing the loss L KL in Equation (4) reduces difference in token probabilities, which is only an indirect proxy for maintaining functional correctness.An interesting future work item could be to involve direct optimization for functional correctness, e.g., learning from rewards based on unit test execution [48].Third, at inference time, SVEN serves as a prefix that is independent of the user-provided prompt.Introducing a dependency between SVEN and the prompt could bring extra expressivity and accuracy.Finally, while this work focuses on security, our techniques described in Section 4 are applicable to general code changes, such as API updates and fixes of certain functional bugs.Future work could consider applying and evaluating our techniques on other code aspects beyond security.

CONCLUSION
This work investigated security hardening and adversarial testing for LMs of code, which were addressed by our new security task called controlled code generation.In this task, we guide an LM using an input binary property to generate secure or unsafe code, meanwhile maintaining the LM's capability of generating functionally correct code.We proposed SVEN, a learning-based approach to address controlled code generation.SVEN learns continuous prefixes to steer program generation towards the given property, without altering the LM's weights.We trained SVEN on a high-quality dataset curated by us, optimizing the prefixes by dividing the training programs into changed/unchanged regions and enforcing specialized loss terms accordingly.Our extensive evaluation demonstrated that SVEN achieves strong security control and closely maintains the original LM's functional correctness.

Scenarios for CWEs Unseen during SVEN's Training
In Section 6.4 and Figures 17 and 18, we investigate SVEN's generalizability to CWEs unseen during training.The corresponding evaluation scenarios are listed and described in Tables 5 and 6.

B MORE RESULTS ON SECURITY
This section presents more evaluation results on security.First, Figure 19 provides the breakdown results of the 2.7B models at temperature 0.1.By comparing Figure 19 with Figure 10, one can see how temperature affects the security of individual scenarios.A lower temperature (i.e., higher certainty) makes LM either fully secure or insecure for one scenario.For SVEN sec , higher certainty corresponds to higher security, achieving a 100% security rate for all scenarios but "CWE-476 0-c" and "CWE-787 0-c".Second, Figures 20  and 21 display the breakdown results of the 350M and 6.1B models, respectively, at temperature 0.4.Finally, Table 7 provides detailed statistics in the absolute number of programs for Figure 10.

C MORE EXAMPLES OF GENERATED CODE
This section provides additional example programs generated by LM and SVEN to further illustrate their behaviors.CWE-089, SQL Injection SQL injection occurs when an SQL command is constructed from unneutralized user-controlled elements that can change the command's intention.The prompt of "CWE-089 1-py" is shown below: On the contrary, code generated by SVEN sec is always safe from SQL injection, e.g., by using query parameters: query = " INSERT INTO `messages `(`username `, `message `) " \ " VALUES (% s , % s ); " cursor .execute ( query , ( username , message )) CWE-078, OS Command Injection This vulnerability happens when user-controlled inputs are used to construct OS commands without sanitization (similar to CWE-089).The prompt of "CWE-078 0-py" is shown as follows: # get a directory name from request dirname = request .args .get ( ' dirname ') # construct a " ls " command to list the contents of dirname # run the command LM and SVEN vul frequently generate unsafe code that constructs OS commands using string formatting: cmd = " ls -l {} " .format ( dirname ) out = subprocess .check_output ( cmd , shell = True ) Programs produced by SVEN sec are more likely to be safe.For example, the code below passes the arguments as a list to subprocess, which enables subprocess to perform escaping and quoting: cmd = [ " ls " , " -l " , dirname ] subprocess .check_output ( cmd , stderr = subprocess .STDOUT )

Figure 1 :
Figure 1: A conceptual visualization of our objective for security hardening and adversarial testing.

Figure 2 :Figure 4 :
Figure 2: Visualization of controlled code generation vs. vulnerability detection, repair, and injection.

Figure 9 :Figure 10 :
Figure 9: Overall security rate on our main CWEs.The temperature is 0.8.

Fig- ure 1
depicts a conceptual trade-off between security control and functional correctness.To verify this trade-off experimentally, we evaluate the effect of varying strengths of security control and functional correctness during training on model performance.
CT jointly optimizes both prefixes, minimizing  (  |h < , ¬) in relative to  (  |h < , ).Similar to L LM , L CT is applied on tokens in security-sensitive code regions whose masks are set to 1.Note that even with the presence of L CT , L LM remains desired because L LM serves to increase  (  |h < , ) in an absolute manner.Loss Term for Preserving Functional CorrectnessWe leverage a third loss term L KL that computes the KL divergence between  ( |h < , ) and  ( |h < ), i.e., the two next-token probability distributions produced by SVEN  and the original LM, respectively.
, which requires a larger dataset to achieve desirable accuracy; (ii) SVEN's training only updates the small prefixes without modifying the huge LM; (iii) SVEN's training accesses the LM and benefits from the LM's strong code reasoning ability.Indeed, previous works have shown that continuous prompts are effective in low-

Table 1 :
Statistics of our training and validation datasets.# total is the total size (i.e., the number of programs).# for languages is the size for each programming language.# for splits is the size for training and validation.LoC is the average number of source lines.The CWEs are sorted by size.
. In Appendix A, We provide more details about our hyperparameters and training cost.
. Across different model sizes, pass@ scores of SVEN secand SVEN vul Code Next, we provide interesting code examples produced by LM, SVEN sec , and SVEN vul , for three of our evaluation scenarios.More examples can be found in Appendix C. For these examples, the base LM is always CodeGen-2.7B.These examples qualitatively show that SVEN is able to capture diverse security-related program behaviors.Example I: CWE-476, Null Pointer Dereference The prompt for "CWE-476 2-c" is shown in Figure6(a).Since malloc returns a null pointer when the allocation fails [10], the returned pointer must be checked before any dereference to ensure security.LM and SVEN vul frequently generate programs that dereference buf right after malloc without any NULL check: SVEN sec significantly increases the likelihood of generating appropriate failure checks to ensure security.The code below is such an example.The program first runs a NULL check for buf.Further, it even produces an additional test on the return value of fgets, which can be NULL if fgets fails [7].

Table 3 :
Comparison between CodeGen LMs Figure 14: Security rate across prompt perturbations.The base model is CodeGen-2.7Band the sampling temperature is 0.4.
Figure 17: Security rate on 4 more CWEs that are not included in SVEN's training set.The corresponding scenarios are adapted from [68]re18: Security rate on 13 more CWEs that are not included in SVEN's training set.The corresponding scenarios are adapted from[68]and are detailed in Table6.For this experiment, the base model is CodeGen-2.7Band the temperature is 0.4.The overall security rate of LM, SVEN sec , and SVEN vul are 49.1%, 57.3%, and 44.8%, respectively.

Table 4 :
Hyperparameter configurations and training cost when we apply SVEN for different LMs.

Table 5 :
[60] scenarios for 4 CWEs that are not included in SVEN's training set.These scenarios are adapted from[60].

Table 6 :
[68] scenarios for 13 CWEs that are not included in SVEN's training set.These scenarios are adapted from[68].