Traces of Memorisation in Large Language Models for Code

Large language models have gained significant popularity because of their ability to generate human-like text and potential applications in various fields, such as Software Engineering. Large language models for code are commonly trained on large unsanitised corpora of source code scraped from the internet. The content of these datasets is memorised and can be extracted by attackers with data extraction attacks. In this work, we explore memorisation in large language models for code and compare the rate of memorisa-tion with large language models trained on natural language. We adopt an existing benchmark for natural language and construct a benchmark for code by identifying samples that are vulnerable to attack. We run both benchmarks against a variety of models, and perform a data extraction attack. We find that large language models for code are vulnerable to data extraction attacks, like their natural language counterparts. From the training data that was identified to be potentially extractable we were able to extract 47% from a CodeGen-Mono-16B code completion model. We also observe that models memorise more, as their parameter count grows, and that their pre-training data are also vulnerable to attack. We also find that data carriers are memorised at a higher rate than regular code or documentation and that different model architectures memorise different samples. Data leakage has severe outcomes, so we urge the research community to further investigate the extent of this phenomenon using a wider range of models and extraction techniques in order to build safeguards to mitigate this issue.


INTRODUCTION
In recent years, Large Language Models (LLMs) have garnered considerable interest in the realm of Natural Language Processing (NLP) owing to their exceptional accuracy in performing a broad spectrum of NLP tasks [36].These models, trained on extensive amounts of data, exhibit increased accuracy and emergent abilities as their parameter count grows from millions to billions [52].LLMs designed for coding are also trained on vast amounts of data and can effectively learn the structure and syntax of programming languages.As a result, they are highly adept at tasks like generating [21], summarising [1], and completing code [30].
Large language models also exhibit emergent capabilities [50].These abilities cannot be predicted by extrapolating scaling laws and only emerge at a certain critical model size threshold [50].This makes it appealing to train ever-larger models, as capabilities such as chain-of-thought prompting [51] and instruction tuning [42] only become feasible in models with more than 100B parameters [50].
The issue of memorisation in source code is distinct from that of natural language.Source code is governed by different licences that reflect different values than natural language [16,23].Hence, in addition to privacy considerations, the memorisation of source code can have legal ramifications.The open-source code used in LLM training for code is frequently licenced under nonpermissive copy-left licences, such as GPL or the CC-BY-SA licence employed by StackOverflow [2]. 1 Reusing code covered by these licences without making the source code available under the same licence is considered a violation of copyright law.In some jurisdictions, this leaves users of tools such as CoPilot at legal risk [2,16,23].Licences are unavoidably linked to the source code, as they enforce the developers' commitment to sharing, transparency, and openness [2,16].Sharing code without proper licences is also ethically questionable [2,23,46].
Memorised data can also include private information [10,13,28].These privacy concerns extend to code, which can contain credentials, API keys, emails, and other sensitive information as well [2,4].Memorisation could therefore put the private information contained in the training data at risk.
Recently, attacks which leverage memorisation have successfully extracted (or reconstructed) training data from LLMs [3,5,13,29].The US National Institute of Standards and Technology (NIST) considers data reconstruction attacks to be the most concerning type of privacy attack against machine learning models [41].OWASP classifies Sensitive Information Disclosure (LLM06) as the sixth most critical vulnerability in LLM applications. 2arger models are more likely to memorise more data and are more vulnerable to data extraction [5,13,29,41].The effort to create ever larger LLMs, therefore, creates models which carry more risk.
To our knowledge, previous studies have investigated data memorisation and extraction attacks in natural language, but there has been no empirical investigation of LLMs for code.In this work, we investigate to which extent large language models for code memorise their training data and how this compares to memorisation in large language models trained on natural language.There is no comprehensive framework or approach for measuring memorisation.
We start by defining a data extraction security game that is grounded in the theory behind membership inference attacks and the notion of k-extractability.Using this security game we define a framework to quantify memorisation in LLMs.We use data extraction as an estimator of memorisation.While memorisation of training data can manifest in the form of non-exact duplication, measuring the rate of data extraction data extraction provides a lower bound of memorisation in a model.
We perform experiments leveraging the SATML training data extraction challenge, an existing dataset for natural language. 3We extend this benchmark by testing memorisation on more models.
We construct a similar dataset for code, by mining data from the Google BigQuery GitHub dataset and by using a CodeGen code generation model [39].Similarly to the natural language dataset, we first identify samples vulnerable to attack to build a benchmark.We then tested a variety of models on this benchmark.We finally compare the rate of memorisation between text and code models.
Our key result: Large language models trained on code memorise their training data like their natural language counterparts and are vulnerable to attack.To summarise, the main contributions of this paper are: • A novel approach, using a data extraction security game, to quantify memorisation rates of code or natural language models • A benchmark of key memorisation characteristics for 10 different models of different sizes • An empirical assessment of memorisation in code models demonstrating that (1) code models memorise training data, albeit at a lower rate than natural language models; (2) larger models, with more parameters, exhibit more memorisation; (3) data carriers (such as dictionaries) are memorised at a higher rate than, e.g., regular code, documentation, or tests; (4) different model architectures memorise different samples.• We make the code to run the evaluation available to allow others to replicate our results and to evaluate other models. 4

BACKGROUND AND RELATED WORK 2.1 Memorisation
In the context of language models, memorisation refers to the ability of a model to remember and recall specific details of the data it has been trained on.This occurs when a model overfits the training data, meaning it becomes overly specialized and fails to generalise well to new or unseen data [17,19].As a result, the model can accurately recall specific phrases, sentences, or even entire documents from the training data.Besides the privacy concerns explained in section 1, memorisation also causes an overestimation of performance.It has, for instance, been observed that CodeX can complete HackerRank problems without receiving the full task description [32].
While memorisation can lead to high accuracy, it is not necessarily an indication of good generalisation performance.A model that has memorised the training data may struggle to perform well on new or unseen data, leading to poor performance in real-world applications.Additionally, memorisation can reduce the ability of the model to adapt its output to specific use cases.For example, when slightly changing HackerRank problems, CodeX [14] struggles to produce a correct solution, instead regurgitating solutions for the original problem [32,47].

Membership Inference Attacks
Membership inference attacks are a type of attack that aims to determine whether a specific data point was included in the training data of a machine learning model.The goal of these attacks is to infer whether a given data point was used to train the model or not, without having access to the training data itself.
The first membership inference attack against machine learning models was proposed by Shokri et al. to target classification models deployed by Machine Learning as a Service (MLaaS) providers [45].Since then the field has expanded and attacks have been proposed that target generative models [24] and LLMs [25].Recently, membership inference attacks have been proposed against transformerbased image diffusion models such as Stable Diffusion [18].
We refer to the security game defined by Carlini et al. [9] to define a membership inference attack in Definition 1.In this game, the adversary wins if they have a non-negligible advantage > 1 2 + .In simpler terms, the adversary needs to be able to distinguish between data that was included and which was not included in the training data for a given model, while only being allowed query access to the model and data distribution.
Membership inference attacks are primitive for measuring the leakage of a machine learning model and are often a starting point for more extensive attacks [9,26,38].While membership inference is a weaker privacy violation than memorisation, the National Institute of Standards and Technology (NIST) still considers membership inference to be a violation of the confidentiality of training data [26].
Definition 1 (Membership inference security game [9]).The game proceeds between a challenger C, an adversary A, a data distribution D and a model  : (1) The challenger samples a training dataset  ← D and trains a model   ← T () on the dataset .
(2) The challenger flips a bit , and if  = 0, samples a fresh challenge point from the distribution (, ) ← D (such that (, ) / ∈ ).Otherwise, the challenger selects a point from the training set (, ) ← .
(3) The challenger sends (, ) to the adversary.(4) The adversary gets query access to the distribution D, and to the model   , and outputs a bit b (5) Output 1 if b = , and 0 otherwise.

Data Extraction Attacks
Data extraction attacks are a stronger type of attack where an adversary extracts a data point used to train a model.Attacks can be divided into two types for LLMs, namely guided and unguided attacks [3].
In an unguided attack, the adversary does not know the sample to be extracted from the model.The adversary simply attempts to extract any training point, contained anywhere in the training corpus [10,12,13,40].
In this work, we focus on targeted attacks.In a targeted attack, the adversary is provided with a prefix, which is the first half of the sequence and is then tasked with recovering the suffix, which is the second half of the sequence.Targeted attacks are more securitycritical as they allow the targeting of specific information, such as the extraction of emails [3,10,23,27,38].
We ground our definition of memorisation and extractability in the definition of k-extractability provided by Biderman et al., which was originally inspired by the framework of k-eidetic memorisation introduced by Carlini et al. [13].
Definition 2 (k-extractability [5]).A string s is said to be k-extractable if it (1) exists in the training data, and (2) is generated by the language model by prompting with k prior tokens.

Natural Language Dataset
The dataset used for the attack on natural language models is provided by the SATML'23 Language Model Data Extraction Challenge 5 .The dataset consists of 15K training, 1K validation, and 1K test samples.The test samples were not released and were only used by the competition organisers.Each sample is divided into a 50-token prefix and a 50-token suffix.For our evaluation, we use the validation set. 5  The participants had to use a GPT-NEO 1.3B model to extract the suffix using the prefix.The winning entry prompted the model with the prefix, extracted 100 suffixes for each prefix, and trained a binary classifier to select the most correct suffix [3].
The dataset was constructed by analysing Pile [22], which is the corpus used to train the GPT-NEO family of models [7].The Pile is an 825GB English language dataset, which itself consists of 22 highquality sub-datasets, ranging from books, academic papers and even code [22].The Pile was constructed to improve the cross-domain applicability of LLMs.The Pile [22] is also used as a pretraining dataset for a variety of code models [2]. 6he organisers extracted all the unique 150 token sequences from the 800GB corpus.Sequences were filtered to include only those that are duplicated at least 5 times.They were then split into a preprefix, prefix, and suffix, each 50 tokens long.The GPT-NEO model was then prompted with the pre-prefix and prefix (100 tokens).If the model produces the suffix, using greedy decoding, the sequence is considered extractable.The challenge dataset was constructed from the extractable sequences and only includes the prefix and suffix. 5

APPROACH
To measure memorisation in LLMs4Code we first formally define a data extraction game and we construct a dataset of code samples.

Data Extraction Security Game
We consider the models as black-box systems.We define a security game inspired by the membership inference attack security game in Definition 1 and the notion of k-extractability in Definition 2: Definition 3 (Data extraction security game).Given a challenger C, an adversary A, a data distribution D and a model  the game is defined as follows: (  2), but has no access to the weights, unlike the game proposed by Al-Kaswan et al. [3].The adversary then predicts the suffix (3) and wins if it matches the actual suffix in the training data.
There are some difficulty modifiers to adjust the difficulty of the challenge: (1) The selection of the dataset  ⊂ D. As observed by previous works, not all training samples are as hard to extract as others.In particular, samples that are highly duplicated 5 or outliers [12] are more vulnerable to attack.( 2) The choice of model   .Some models are more likely to memorise samples than others, namely larger models have been observed to memorise more samples [5,8,10,11,13,29].(3) The length of the prefix .It has been found that longer prefixes elicit more memorisation 5 [11,13,29].Note that this length is equivalent to the  in definition Definition 2. (4) The victory condition ŝ = , instead of targeting verbatim memorisation, a fuzzy match could also be considered [29].In this work, we take inspiration from the competition organised by Carlini et al. and use modifiers ( 1) and (3) to construct a set of extractable samples.We shorten the prefix of the extractable samples and use this set of hard but extractable samples to perform an evaluation on different models (2).We also measure fuzzy match scores (4) and compare them with the extract match rate.

Code Dataset Construction
To measure the memorisation in LLMs for code, we first need to construct a dataset similar to the one used in the SATML'23 Language Model Data Extraction Challenge.As there is no code benchmark available, we build one from scratch.This presents several challenges: Firstly, for some code models, the training data is not published by the authors, which makes it impossible to determine what data were included in the training of these models.We must therefore experimentally determine which data points were presumably included in the training data for each of the models.This has implications for the transferability of the benchmark set, as the training data might differ for each model.Not all models are trained in all programming languages as well, so we must select a common language to test multiple models.
Secondly, since all publicly available code is potentially part of the training data, the search space for extractable data points is massive.
We limit our evaluation to Python since we found that the vast majority of models support Python and have some Python in their training corpus.We source the potentially memorised data from GitHub.We mine Python files using the Google BigQuery Github dataset. 7e filter the files to include only nonbinary files longer than 150 tokens.We only consider files that have five or more duplicates on GitHub and randomly select 150 token spans from anywhere in the file.Similarly to the natural language dataset8 , we split the 150 token span into a pre-prefix, a prefix, and a suffix, each 50 tokens long.We prompt a CodeGen-2B-Mono model [39] with the pre-prefix and prefix.We select this model because it is decently sized (there are smaller and larger variants of the model), it is specifically trained on Python and it is the highest performing publically-available model for the Human-Eval benchmark [39].
If the model can predict the suffix, with the 100-token prompt, we consider the sample to be extractable.We randomly select 1K extractable samples to perform our evaluation.We construct the dataset from the prefixes and suffixes.
Our dataset construction procedure differs from the procedure used by Carlini et al. in one aspect.Our dataset does not guarantee that for every   = (, ) there does not exist a (,  ′ ) ∈  where  ̸ =  ′ .There are two main reasons for omitting this step: • For many models in our evaluation we do not have access to the training data and possible pre-training data.The organisers could guarantee that the model under investigation was only exposed to the Pile.We want our approach to work for settings in which the investigator has no access to the training data.• The computational cost of identifying all unique samples   = (, ) is extremely large for a dataset of this size and our aim is to create an approach that does not require such enormous compute capabilities.To compare the rate of memorisation, we run both the attack on natural language as well as code models and compare the results.Intuitively we expect code models to be able to memorise more since code is more structured and there is much more natural language data available.RQ2: What type of data are memorised by code-trained LLMs?We want to know if there is a code pattern that is memorised.To do this we take the set of samples vulnerable to attack and we manually analyse them by constructing a classification of the samples.RQ3: How much overlap is there between the memorised samples in different code-trained LLMs?Do some models memorise different samples than others?Could we perhaps leverage a selection of different models to extract more data and do some models memorise more of a certain type of sample than others?RQ4: To what extent do LLMs trained in code leak their pre-training data?Finally, we want to see if pre-trained models can also leak their pre-training data.To investigate this, we select a code model that has been pre-trained on the Pile and perform the natural language attack.We compare the performance of the original base model with that of the code-trained model to see how much training data is retained.When referring to a base model in this paper, we only mean models that were initialised with the architecture and weights of a different model.

Models
The models, their developers, and their respective sizes are shown in Table 1.We limit our evaluation to left-to-right autoregressive models, which are available on the HuggingFace Hub.For natural language evaluations, we used GPT-NEO [7], the models used to build the natural language dataset 5 .We select GPT-2 [43] to test the transferability of the prompts to a model trained on a different corpus.GPT-2 is trained on the WebText corpus, which was mined by finding all the outlinks on Reddit with more than 3 karma.We also investigate the Pythia [6] suite of models, which are trained on the Pile [22].
The CodeGen suite of models [39] features a number of different models in a variety of sizes.The models were initialised and first pre-trained on the Pile; these models are the CodeGen-NL models.The CodeGen-NL models are then further trained on a dataset containing multiple programming languages to create the CodeGen-Multi models.The Multi models were finally trained on a dataset consisting of only Python code to create the CodeGen-Mono models.The CodeGen2 and Incoder models are both designed for infilling but have autoregressive capabilities as well [21,39].CodeParrot is a pre-trained GPT-2 model fine-tuned on the APPS dataset [44].PyCodeGPT is a small and efficient code generation model based on the GPT-NEO architecture [53].GPT-Code-Clippy is a pre-trained GPT-NEO model fine-tuned on code.

Categorisation
We build a classification of the 1K extractable 150-token samples by doing an explorative study.We find the following categories and classify each of the samples into one category.For simplicity, we classify each sample which has two purposes, into its majority category.The different categories are shown in Table 2.We identified 5 different categories as shown in Table 2.

Extraction
We prompt the model under investigation with the prefix.We use the standard generation pipeline and the default generation configuration of the model as defined in the model configuration.For models which use a different tokeniser than the CodeGen tokeniser used for the dataset construction.We simply tokenise the sample again using the new tokeniser.Any samples that are too short under the new tokeniser are discarded.

Evaluation Metrics
The models are prompted in a one-shot fashion with greedy decoding.We measure the exact match rate (EM).Additionally, we also measure the fuzzy match, using the BLEU-4 score.For the model size, we measure the total parameter count.For replication purposes, we only consider models that are runnable on our hardware.We found that the limitation was the GPU memory, so there are some models that we did consider but did not fit the GPU memory (such as InCoder-6.7Band StarCoder-base).

RESULTS
We present the results of our experiments to answer the research questions, results are grouped per research question.

Natural Language vs Code
The results of the attack are shown in Table 4.We found that we are able to extract 56% of the samples with the largest GPT-NEO model.The medium-sized model, which was used to construct the dataset, achieved an exact match rate of 46%.The models which were not trained on the Pile [22] did not memorise much if any of the samples.
As shown in Figure 1, for the models that are trained on the Pile [22], memorisation scales with the size of the model.We do not observe a clear difference between the Pythia and Pythia-dedup models, indicating that their deduplication was unsuccessful in preventing the memorisation which we measure.As the number  3 and Figure 2 show the results of the experiments.We found that we were able to extract 38% of samples from the largest CodeGen-Mono model we tested.The 1B parameter model, which was used to generate the test set, was only able to extract 30% of the samples, which is lower than the performance of GPT-NEO 1.3B on the natural language dataset.This indicates that our constructed code dataset is harder than the natural language dataset, but that difficulty modifier (2) from section 3 which was supported by previous works and Definition 1 also holds for our code dataset.
Figure 3 shows the relation between the Exact Match rate and the BLEU-4 score for code-trained models.We can observe that there is a clear relation between the exact match rate and the BLEU4 score, especially above an exact match rate of 0.2.We see a similar pattern in Figure 3.The Pearson correlation coefficient between the Exact Match rate and the BLEU4 score is 0.982 and 0.967 for natural languageand code, respectively, indicating a very strong positive correlation.
In our evaluation, we also tested multiple models that were not primarily trained on programming languages.We found that CodeGen-nl and GPT-NEO were unable to memorise as much as similarly sized code-trained models, but were still able to achieve an exact match score of around 10%.Similarly to natural language models, we also find that memorisation scales with model size in Figure 2.But in this case, we see the logarithmic relationship between the same model architectures.We also observe that the CodeGen-Mono models memorise more natural language than the CodeGen-Multi models for every model size.This indicates that the extra training on Python code increases the memorisation rate.We find a Pearson's correlation coefficient between the Exact Match rate and the size of the model of 0.797 and RQ1: Code-trained LLMs memorise their training data at a lower rate than Natural Language trained LLMs.In both natural language and code-trained models, the rate of memorisation scales with the model size.Continued exposure to the same data increases the rate of memorisation.

Type of Memorised Samples
As can be observed in Figure 5, the majority of samples in our dataset are code logic followed by dictionaries.We colour-coded the samples to make a distinction between memorised and nonmemorised samples.We find that data carriers and licence information are being memorised at a higher rate than code logic, documentation, and test code.
During the tagging process, we did find multiple examples of names, emails, and usernames being memorised by the model.Such as the example in Figure 6 We also found an example of some API keys, further investigation shows that this instance was a sample that was easily findable using search engines.
RQ2: LLMs trained on code memorise data carriers and license information at a higher rate than regular source code, documentation, and testing code.Code-trained LLMs are also able to memorise and emit sensitive information.

Which Model Memorises What
In Figure 7 we plot the overlap in memorised samples between different models.We limit the investigation to the Codegen, CodeGen2 and CodeParrot family of models.
For instance, we find that 86% of all samples which were memorised by CodeParrot-small are also memorised by CodeParrot, while only 24% of the samples memorised by CodeParrot-small are memorised by CodeParrot.We find similar patterns when comparing the different-sized CodeGen models.The CodeGen-2 family of models memorised fewer samples and is in line with the CodeGen-350M models despite the size difference.The larger models in a family memorise more samples, there are a few distinct samples that are only memorised by the small models, but we find that is generally limited.
We find that the CodeGen-Multi models tend to memorise around 50% of the samples memorised by their respectively sized Mono variant, while the Mono models memorise around 70% of the samples memorised by the Multi variant.The only exception is the smallest model, where the Multi and Mono models memorised very similar amounts of samples.In Figure 8 we find that 40% of the samples are not memorised by any model at all.But there are 73 samples that are memorised by 12 of all the 13 models.This indicates that there is an inherent difficulty in some samples.
Figure 9 shows the memorisation of each of the categories per model.We find that all plotted models memorise more code and data carriers than any of the other categories, which is supported by Figure 5.As models grow larger they memorise relatively more code and fewer data carriers.In absolute terms, the number of memorised samples from the Dict category still increases.
Combined with the findings in RQ1 we can therefore conclude that the extra training on Python, makes the models memorise more and many of the same samples and that the smaller models lack the capacity to memorise more data.
RQ3: Each model family memorises a unique set of samples, and smaller models within the same family remember only a subset of what their larger counterparts do.In Table 5 and Figure 10 we plot the results for the leakage of pre-training data.We find that we can extract 58% of all natural language samples from the CodeGen-NL model.This result aligns with the similarly sized Pythia and GPT-NEO models in Table 4. Tuning the model on code data reduces the extraction rate to 31% and tuning on Python code further reduces the extraction rate to 20%.Multi vs Mono.The findings indicate that the CodeGen-Mono models memorised more than the Multi models.This is explainable by the fact that the Mono models have had more exposure to Python code and therefore code in our dataset.Recall that the models are first trained on the Pile which contains all the GitHub repos with more than 100 stars [22].The models are further trained on a general dataset of code, and finally on a dataset of Python code.This means that the models could have possibly been trained on the same file three times.
Size and Memorisation.We find that the rate of memorisation scales with the size of the model, across all models we find that the rate of memorisation increases as the size increases.This is in line with the findings of previous work which found that larger LLMs memorise training data faster [48] and at a higher rate than small models [5,8,10,11,13].Our results also confirm that the log-linear relation between size and memorisation, which has been observed by other works [11,29] holds for LLMs trained on code as well.
Our experiments which investigate the overlap of memorised sequences in different sizes of code models show that the memorised samples of smaller models are mostly a subset of the large models.This indicates that as a model grows larger it mostly memorises more and not necessarily different data.
Biderman et al. investigated memorisation in the Pythia suite of models [6] and found that 94% of the sequences memorised by the 70M model were also memorised by the 12B model, but those only accounted for 19% of the sequences that the 12B model memorised.We find a similar relation between the largest and smallest CodeGen-Mono models: CodeGen-Mono-16B memorised 93% of the samples which were memorised by CodeGen-Mono-350M, conversely only 20% of the samples memorised by CodeGen-Mono-16B were memorised by CodeGen-Mono-350M.

Rate of Memorisation.
Note that the results obtained from experiments in section 5 suggest that memorisation in LLMs trained on code is less than in those trained in natural language.The largest 6.9B parameter Pythia model memorised 55% more samples than the best-performing CodeGen-Mono model.Intuitively we would expect the memorisation to be more in code models (as explained in section 4), but there might be multiple reasons for this observation: • Our dataset construction procedure differs from the procedure used by Carlini et.al.The natural language dataset guarantees that for every   = (, ) there does not exist a (,  ′ ) ∈  where  ̸ =  ′ .This means that for some prefixes the model might predict a suffix that is also in the training data, which would be counted as a non-memorised sample.This was not possible in our case, since we do not exactly know the training data for the code models under investigation.The training dataset was only deduplicated on the file level.• The structured nature of code might illicit less memorisation in general.This is supported by the higher rate of memorisation in dictionaries compared to regular code especially in smaller models.Their relative information density makes it hard to generalise for these samples specifically and the models might therefore revert to memorisation.
Deduplication.The deduplicated Pythia [6] models are not significantly more robust against our extraction than their regular counterparts.At first glance, this is a surprising finding.It has been reported that deduplicating the training data makes LLMs more secure against data extraction [13,31,33].
A similar investigation by Biderman et al. on memorisation on the Pythia suite of models also found a relatively small difference between the two variants [5].The authors theorise that this observation might be due to the training setup.The deduplicated models were trained for 1.5 epochs to offset the smaller data size and to train on the same number of epochs.This effectively oversamples the entire dataset.
Based on our observations we can offer two alternative explanations: (1) The training was deduplicated on the file level [6].Our evaluation concerns spans of tokens that can be duplicated across files.The same licence information, for instance, is present in the preamble of many different files and will still be present in the deduplicated dataset.(2) The samples memorised by the Pythia models might be outliers that illicit memorisation.We observed that information carriers are more likely to be memorised than other types of samples, so the deduplication might not have had much impact on these samples.

Implications
We propose a novel framework to measure the memorisation and extractability of training data in LLMs.
Model training.This work serves to inform researchers and practitioners who aim to train their own LLMs.We can confidently say that larger LLMs leak more and that smaller LLMs are therefore preferable from a safety perspective.In light of emergence [50], larger models are however often preferable.We are already able to extract 73% and 47% of the text and code samples, even larger models like CodeX [14] or Starcoder [34] might memorise even more data.
Secondly, we have shown that LLMs also leak their pre-training data even after multiple training rounds.The ability to recover pretraining samples has additional privacy and security implications for the transfer learning paradigm [2].When creating and publishing a model, the base model is also something to be considered as the pre-training data can be unintentionally exposed as well.
Finally, some types of data are more vulnerable to extraction than others.This information can be used to inform the data selection procedure.Some categories like dictionaries can be omitted entirely to reduce the amount of memorisation.Future work can investigate how training data can be curated and sanitised to reduce memorisation in LLMs.
Model deployment.The black-box setting of our evaluation has implications for MLaaS services as well.Since we do not require additional information about the model, our data extraction approach could be used against models that are offered through public APIs such as OpenAI's Copilot [14].While Copilot does employ a memorisation filter, it is relatively easy to bypass [28].There is a need to develop stronger countermeasures to prevent data extraction from these models.
Framework.The framework and dataset provided can be used the evaluate different models.While our focus has been on left-to-right causal language models, different architectures, such as encoderonly models like CodeBERT [20] or encoder-decoder models like CodeT5 [49] might memorise different amounts and different types of training data.
Fair Use.Many existing LLMs for code make use of code licenced under copyleft and other non-permissive licences [2].The use of public code to train LLMs for code is an instance of fair use, which is a defence that allows the use of copyrighted works in new and unexpected ways and exists in many jurisdictions [23].If the output of the model is similar to the copyrighted input fair use might no longer be applicable.The output needs to conform to the licence terms of the copied input [23], which can include share-alike and attribution clauses [2].
Memorisation can therefore put the creators and users of LLMs for code at legal risk [23].This risk extends to pre-trained models, as some pre-training corpora, including the Pile [22], also contain code licenced under non-permissive licences [2].The risk can be avoided by training models with code licenced under permissive licences (such as BSD-3 or MIT) or providing provenance information to trace the code back to its source so that the user of the output can abide by the original licence [23,34].
Extraction techniques.We were able to show that using relatively simple greedy decoding and the notion of k-extractability, most text models and all code models are leaking data.This only proves the inherent leakiness of these models and serves as a stepping stone for more advanced and powerful attacks.One approach worth investigating is the use of prompt engineering to extract data.With hard or soft-prompts [35] the model could be enticed to output more memorised data.Our work only prompts the models with the prefix, while different prompts might elicit more memorisation.Another approach is to explore the use of Membership Inference Attacks to increase the abilities of the attacks further.One could take inspiration from untargeted attacks and generate multiple suffixes per prefix using a different decoding method.The MIA can then serve to select the correct suffix [1].

Limitations and Threats to Validity
6.3.1 Internal validity.In our evaluation, we did not take into account the location of the samples.The samples are of a fixed token length but can originate from any arbitrary location in the file.Furthermore, Byte-Pair Tokenisation can cause the sample to start or end in the middle of a word.We based our dataset construction on existing work [3,5], but samples from the beginning or end of the file could be easier to extract.Initially, untargeted extractions were attempted, and it was discovered that samples were predominantly obtained from the beginning of the file.Nevertheless, the current approach was chosen as it would enhance the versatility of our attack and enable us to extract samples from any location within the file.

External validity.
Our evaluation focuses on a limited number of models, other models might exhibit more or less memorisation.Our benchmark was constructed using a single model, and while we were able to show that our benchmark gave promising results for other models, other data sources and models should be used to construct more benchmarks.
The constructed datasets only consider duplicated sequences; this inherently limits the applicability of our attack on low-duplication data.While other works do state that models can also memorise unduplicated data, we cannot experimentally confirm this as we only apply coarse file-level deduplication.
In the construction of our dataset, we only considered Python code.We selected Python because it is supported by almost all code generation models.Other less-expressive languages could show different patterns and different degrees of extractability.Python is a very popular language, so these results might also not apply to less popular languages.We plan to extend our evaluation to include more programming languages in the future.

Construct validity.
We mainly use the exact match metric to measure memorisation in code models.This metric likely underestimates the actual number of memorised samples, as some might be slightly changed by the model.For this specific study, we are more interested in exact reproductions by the model, since we are more interested in the privacy and security aspect of memorisation.When examining the licensing aspects of memorization, fuzzy match metrics may provide better insights.We included BLEU4 to account for this, but we found that it is highly correlated with the exact match rate.However, there are no automated metrics available to measure non-literal infringement based on current legal standards [23].
6.3.4Ethical Considerations.While this work does describe techniques that can potentially be used to extract sensitive information from models, we do so ethically.Our goal is to bring attention to the issue of memorisation in LLMs for code and inform the users and creators of these models and provide them with tools to measure this.In this work, we, therefore, do not needlessly expose any private information, and we urge users of our framework to refrain from doing so as well.We target randomly selected sequences from popular and public repositories to avoid accidentally exposing private information.However, we still found some instances of usernames, emails, and API keys in our data, but we found that these are easily findable using search engines and are part of popular and well-indexed public repositories.We believe that the benefits outweigh the risks, and we decide to share our datasets.

CONCLUSION
To conclude, we presented an extensive study on memorisation in LLMs for code.We formally define a data extraction security game grounded in the existing notion of k-extractability and membership inference attacks.We utilised this game to create a dataset to measure memorisation in LLMs for code.We compared the rate of memorisation between models of code and natural language, we compared the rate and type of memorisation between different models, and we investigated the rate of memorisation of pre-training data in LLMs for code.
We found that LLMs for code memorise their training data like their natural language counterparts, albeit at a lower rate.We further found that the rate of memorisation increases as a model grows and that different model architectures memorise distinct sets of samples, while smaller versions of the same family tend to memorise a smaller subset of their larger sibling.We found that data carriers and licence information are being memorised at a higher rate than code, documentation, and tests.Finally, we found that the pre-training data is still vulnerable to extraction even after multiple tuning rounds.
Our work is a first step and provides a framework to measure memorisation in LLMs for code.We strongly advise the research community to conduct a more comprehensive investigation into the extent of data leakage and employ a diverse range of models and extraction techniques to develop safeguards that can effectively mitigate this issue.The consequences of data leakage can be severe, so it is crucial to take proactive measures to address this problem.

1 )
The challenger samples a training dataset  ← D and trains a model   ← T () on the dataset .(2) C samples a sample   = (, ) where   ∈ .The prefix  is provided to the adversary A. (3) A is allowed query access to the model   and may perform any other polynomial-time operations (4) A outputs his prediction sequence ŝ (5) If ŝ = , A wins, otherwise C wins In other words, given a prefix (1), the adversary is challenged to extract the correct suffix in the training data from the model.The adversary can query the model (

Figure 1 :Figure 2 :
Figure 1: Parameter size and exact match rate for natural language models

Figure 3 :
Figure 3: BLEU-4 score and Exact match rate for code models

Figure 4 :
Figure 4: BLEU-4 score and Exact match rate for natural language models

Figure 10 :
Figure 10: Parameter size and exact match rate for pre-trained models

Table 1 :
Natural language (top 4 rows) and code models under investigation RQ1: How does the rate of memorisation compare between NaturalLanguage and Code trained LLMs?

Table 2 :
Categories of memorised samples

Table 3 :
Code attack performance on Large Language Models for Code

Table 4 :
Natural language attack performance on natural language models