What is efficient fine-tuning of large language models?

In the real world, fine-tuning large language models is widely used across industries. It empowers businesses and researchers to harness NLP capabilities for various tasks . This leads to enhanced efficiency, improved decision-making, and enriched user experiences.

What is parameter efficient fine-tuning?

Decreased computational and storage costs: PEFT involves fine-tuning only a small number of extra model parameters while freezing most parameters of the pre-trained LLMs , thereby reducing computational and storage costs significantly.

Does fine-tuning improve accuracy?

By fine-tuning with limited labeled data, organizations can overcome the constraints of data scarcity and still achieve significant improvements in the model's accuracy and relevance to the targeted task or domain.

What is fine-tuning of large models?

During the fine-tuning phase, when the model is exposed to a newly labeled dataset specific to the target task, it calculates the error or difference between its predictions and the actual labels . The model then uses this error to adjust its weights, typically via an optimization algorithm like gradient descent.

What are the two techniques for parameter tuning?

We can choose from three hyperparameter tuning methods — grid search , random search, and Bayesian optimization. If evaluating our model with training data will be quick, we can choose the grid search method. Otherwise, we should select random search or Bayesian optimization to save time and computing resources.

What is the difference between PEFT and LoRA?

During fine-tuning, LORA updates the weights of the low-rank embedding and projection layers, as usual for data science, minimizing the loss function. Now, what is the difference between PEFT and LoRa? PEFT is a method that employs various techniques, including LoRa, to fine-tune large language models efficiently .

Is PEFT transfer learning?

Transfer learning, or PEFT, helps reduce computing and memory costs with the frozen foundation model parameters . PEFT techniques rely on fine-tuning a limited assortment of new model parameters, thereby offering better efficiency.

What does fine-tuning mean in NLP?

Conversely, fine-tuning entails techniques to further train a model whose weights have already been updated through prior training . Using the base model's previous knowledge as a starting point, fine-tuning tailors the model by training it on a smaller, task-specific dataset.

What is fine-tuning translation model?

Fine-tuning is a method of adapting an MT model to a given domain or style . Fine-tuning requires a collection of bilingual sentences (with same source and target languages as the model to be fine-tuned), which represent the domain or style that the MT model should adapt to.

What is the best learning rate for fine-tuning BERT?

BERT: Optimal Hyperparameters Our data indicate that the best hyperparameters for fine-tuning BERT are batch_size 64 and learning rate 5e-5 .

Hotfixing Large Language Models for Code: How Far Can Parameter-Efficient Fine-Tuning Go? (2024)

Zhou YangSingapore Management UniversitySingaporezyang@smu.edu.sgandDavid LoSingapore Management UniversitySingaporedavidlo@smu.edu.sg

Abstract.

Large Language Models for Code (LLM4Code) have become an integral part of developers’ workflows, assisting with tasks such as code completion and generation.However, these models are found to exhibit undesired behaviors after their release, like generating buggy code, due to their extensive training on vast amounts of source code that contain such buggy code.The training data (usually coming from open-source software) keeps evolving, e.g., developers fix the buggy code.However, adapting such evolution to mitigate LLM4Code’s undesired behaviors is non-trivial, as retraining models on the updated dataset usually takes much time and resources.This motivates us to propose the concept of hotfixing LLM4Code, mitigating LLM4Code’s undesired behaviors effectively and efficiently with minimal negative effects.

This paper mainly focuses on hotfixing LLM4Code to make them generate less buggy code and more fixed code.We begin by demonstrating that models from the popular CodeGen family frequently generate buggy code.Then, we define three learning objectives in hotfixing and design multiple loss functions for each objective: (1) learn the desired behaviors, (2) unlearn the undesired behaviors, and (3) retain knowledge of other code.We evaluate four different fine-tuning techniques for hotfixing the models and gain the following insights.Optimizing these three learning goals together, using LoRA (low-rank adaptation), effectively influences the model’s behavior. Specifically, it increases the generation of fixed code by up to 108.42% and decreases the generation of buggy code by up to 50.47%. Statistical tests confirm that hotfixing does not significantly affect the models’ functional correctness on the HumanEval benchmark.We also show that hotfixing demonstrates strong time efficiency.

Parameter-Efficient Fine-Tuning, Code Generation, Model Updating, Privacy Leakage, AI Model Maintenance

1. Introduction

Large Language Models for Code (LLM4Code)(Fan etal., 2023; Hou etal., 2023; Zheng etal., 2023) have demonstrated outstanding performance on tasks that span a variety of activities in software engineering, including requirement gathering(Bencheikh and Höglund, 2023), architecture design(Ahmad etal., 2023), code generation(Nijkamp etal., 2023), software testing(Xia and Zhang, 2023), defect prediction(Zhou etal., 2023), stack overflow post analysis(He etal., 2022), etc.Many tools (e.g., GitHub Copilot(git, [n. d.]), AWS CodeWhisperer(cod, [n. d.]), etc), which can complete and generate code, have been integrated into widely-used IDEs such as Visual Studio Code to assist developers in their daily tasks.

Despite their strong performance, LLM4Code are usually found to exhibitundesired behaviors.For example, Jesse et al.(Jesse etal., 2023) find that LLM4Code can memorize and produce many known simple, stupid bugs (i.e., bugs that can be fixed with single-line changes, also known as SStuBs(Karampatsis and Sutton, 2020)).As another example, researchers have found that LLM4Code can generate a large number of privacy-sensitive information(Yang etal., 2024b), like email addresses and API keys.Considering the large population of users of LLM4Code,¹¹1Reports show that GitHub Copilot has already been used by over 1.2 million developers. there exists a key requirement to update the model as soon as possible to minimize the potential negative impacts on more users caused by these undesired behaviors.One intuitive way is to retrain the model on the updated dataset, by adding the fixed code or replacing the sensitive information with placeholders in the training data, which however, is extremely time-consuming and computationally expensive.

This practical requirement motivates us to explore the concept of “hotfixing” LLM4Code.Hotfix, in the context of traditional software engineering, is an unplanned improvement to remediate the unwanted symptoms of the critical issue quickly(Hanna etal., 2024).Hotfix emphasizes less on correctness under all conditions but more on the time required to generate a plausible patch that hides the critical symptom without breaking the system.In the context of LLM4Code, retraining the model thoroughly can be quite time-consuming, we hope to first hotfix the model to mitigate the undesired behaviors and thus leave time for a complete fix (retraining).A good hotfix for LLM4Code should satisfy the following properties: (1) efficiency: the time and effort to obtain the hotfix should be significantly faster than retraining the model; (2) effectiveness: the hotfix should effectively mitigate the behavior while causing minimal impact to the functional correctness; and (3) easy deployment: the hotfix should minimize disruption and complexity during deployment.

He and Vechev(He and Vechev, 2023) propose the pioneering study on updating LLM4Code to reduce the generation of insecure code.They define three targets for such updates: the updated model should (1) produce more secure code, (2) produce less insecure code, and (3) remain the same for other code.To achieve these targets, they design a combination of three loss functions to train contrastive prefix to update the models.This paper extends their study by implementing hotfixing to mitigate another important type of undesired behavior:LLM4Code can complete many known buggy code; we aim at making LLM4Code not complete buggy code but complete the corresponding fixed code.First, we replicate the experiments of Jesse et al.(Jesse etal., 2023), confirming that models from a popular family (CodeGen) can memorize and produce many bugs, which is treated as undesired behaviors that should be mitigated.These known bugs have been fixed in open-source software; thus we hotfix the models and steer them to generate the corresponding fixed versions. To achieve this goal, we construct a dataset consisting of pairs of two code snippets: one containing bugs and the other with corresponding bugs being fixed.Then, we compute the diff (to get the deleted buggy code and the added fixed code) of each pair.

We experiment a set of parameter-efficient fine-tuning (PEFT) methods to hotfix the models, which are designed to update the model with a small number of parameters, resulting in efficiency gains and making the deployment easier.We investigate a family of popular LLM4Code, CodeGen(Nijkamp etal., 2023), which is widely adopted in multiple studies(Jesse etal., 2023; Yang etal., 2024b; Li etal., 2023; Allal etal., 2023).By conducting experiments on models of two sizes (CodeGen-350M and CodeGen-2B), we gain the following insights about hotfixing LLM4Code.The optimal hotfixing setting (combining all three objectives and using LoRA for fine-tuning) can increase the generation of fixed code by up to 108.42% and decrease the generation of buggy code by up to 50.47%, which is more effective than the mitigation strategy adopted by Jesse et al.(Jesse etal., 2023).Although hotfixing reduces the model’s functional correctness on the HumanEval benchmark, statistical tests show that the difference is not statistically significant ( $p$ -values are greater than 0.05).

We summarize our contributions as follows:

•
Task: We propose the novel concept of hotfixing LLM4Code, aiming at mitigating the LLM4Code’s undesired behaviors effectively and efficiently. This is a new task in LLM4Code maintenance.
•
Effectiveness: We implement a hotfix strategy by extending the work by He and Vechev(He and Vechev, 2023) with more PEFT methods on a new task, showing that it can effectively mitigate undesired behaviors with no significant impact on the functional performance.
•
Efficiency: Hotfixing LLM4Code demonstrates efficiency. Specifically, it takes 5 minutes to hotfix CodeGen-350M on a single GPU.
•
Application: We discuss the potential application of hotfixing in more scenarios, like deprecated API updates and customized code generation, as well as deployment in practice.

Paper Structure.Section2 provides the background information of this study.We explain our methodology in Section3, including motivating scenarios and learning objectives.Section4 documents the experiment settings, and we present our results in Section5.We conduct a case study about reducing privacy leakage in SectionLABEL:sec:case-study.After presenting additional discussion in Section6, we explain related works in Section7.Section8 concludes the findings of our paper, and outlines future studies.

2. Background

This section explains the background of this study, describing LLM4Code for code generation and the parameter-efficient fine-tuning techniques used to update the models.

2.1. LLM-based Code Generators

According to a recent survey of LLM4Code(Hou etal., 2023), code generation models are largely decoder-only models(Nijkamp etal., 2023; Fried etal., 2023; Li etal., 2023), which are typically based on the Transformer architecture(Vaswani etal., 2017).These models are pre-trained on a large corpus of datasets in an unsupervised manner to learn a probability distribution that predicts the likelihood of the next token, given the context (also known as prompt).Formally speaking, a model takes in a series of tokens represented as $x=(x_{1},x_{2},\cdots,x_{t})$ .It then produces a probability distribution $P(x_{t+1}|x_{1},x_{2},\cdots,x_{t})$ , which estimates the likelihood of the next token in the sequence being $x_{t+1}$ .The computation of the probability distribution can be factorized into two steps: (1) a function $f_{h}$ that computes the hidden states of the input sequence and (2) a function $f_{p}$ that computes the probability distribution of the next token based on the hidden states.

(1)			$\displaystyle h_{t}=f_{h}(x_{t},h_{<t})$
(1)			$\displaystyle P(\cdot\|x)=f_{p}(h_{t})$

In the above equation,²²2Note that we use ‘_<t’ to denote the sequence of tokens before $x_{t}$ . $h_{<t}$ denotes the hidden states of the input sequence before $x_{t}$ . The function $f_{h}$ computes the current hidden state $h_{t}$ based on the previous hidden states $h_{<t}$ and the current token $x_{t}$ .The function $f_{p}$ computes the probability distribution of the next token based on the current hidden state $h_{t}$ .In other words, the input will be converted into a sequence of hidden states to condition the model to generate the next token.

The models produce subsequent tokens in an autoregressive manner.The model first predicts the probability distribution of the next token, and then selects the token with the highest probability as the next token $x_{t+1}$ .The new token is then fed into the model to predict the next token $x_{t+2}$ .The process is repeated until the model predicts a special token (e.g., the end of sequence token) that indicates the end of the sequence or other stopping criteria (e.g., the maximal number of generated tokens) are met.

2.2. Parameter-Efficient Fine-Tuning

To make the hotfix efficient and easy to deploy, this paper employs parameter-efficient fine-tuning (PEFT) techniques to hotfix LLM4Code.PEFT methods train LLM4Code by only updating a small number of parameters rather than updating the entire model, thus reducing the computational cost and time required for training.Here the updated parameters are the hotfix to mitigate models’ undesired behaviors.One of the representative PEFT method is LOw-Rank Adaptation (LoRA)(Hu etal., 2022a).LoRA freezes the model weights and adds low-rank trainable matrices into the attention layers of Transformer models.By only updating the newly added matrices, LoRA significantly reduces the number of parameters to be trained.IA3(Liu etal., 2022) aims to improve LoRA and further reduces the amount of trainable parameters.Another PEFT method is prefix-tuning(Li and Liang, 2021), which trains a set of virtual tokens added to the input tokens of the LLM.Dettmers et al.(Dettmers etal., 2023) combine LoRA with quantization, leading to less GPU memory consumption when training LoRA matrices.

Following a recent study(Weyssow etal., 2024) on evaluating PEFT methods on LLM4Code, we choose LoRA, IA3, prefix-tuning, and QLoRA as the PEFT methods to obtain hotfix LLM4Code in this paper.

3. Methodology

In this section, we begin by describing a motivating scenario where code LLM4Code produce undesired code.We discuss how to collect the dataset and evaluate such behaviors.We then explain the learning objectives and the methods to update the models.

Hotfixing Large Language Models for Code: How Far Can Parameter-Efficient Fine-Tuning Go? (1)

3.1. Motivating Scenario

Researchers find that there exists much buggy code in the open-source repositories, which is then collected to appear in LLM4Code training data.The models can learn from the training data to generate buggy code.For example, Jesse et al.(Jesse etal., 2023) investigate a specific category of bugs, namely, simple, stupid bugs (SStuBs)(Karampatsis and Sutton, 2020), and find that code generators (like CodeGen(Nijkamp etal., 2023)) can generate known SStuBs, and at a high rate (around twice as often as they produce known correct, bug-fixing code).Listing1 shows an example of buggy code generated by CodeGen-350M.

3.2. Data Collection

Generally, the training data for hotfixing should include three types of code: (1) the desired code that we want the model to generate, (2) the undesired code that we want the model to avoid, and (3) the code that is not relevant to the two types of code.Such information can be retrieved from the diff between two versions of one code snippet.One version is the code snippet that we want the model not to generate (e.g., buggy code).The other version is the code snippet that we want the model to generate (e.g., the corresponding fixed code).Computing the diff can help find what code should be changed and what code should remain the same.

We reuse the SStuBs dataset provided by Jesse et al.(Jesse etal., 2023), which is also known as ManySStuBs4J(Karampatsis and Sutton, 2020), coming from the 2021 MSR mining challenge dataset.An example from the dataset is shown in Listing1.The code highlighted in red is the buggy code.The code highlighted in green is the corresponding fix.This dataset consists of two parts: a small dataset and a large dataset.The large dataset, which includes 63,923 single-statement bug-fix changes, is chosen in the experiment.After removing duplicates for bugs that share the exact same prefix (i.e., the content before bug locations), bug, and fix, Jesse et al. conduct their evaluation on 16,899 examples.We reuse their filtered dataset in our experiment.We split the dataset into training, validation, and test sets with a ratio of 8:1:1.The training set is used to hotfix the models.The test set is used to evaluate number of generated fixed code and buggy code before and after hotfixing.

In Jesse et al.’s experiment, the content before the buggy code is sent to a code generator as the prompt.Given a prompt, the model output has three possible cases: (1) containing the buggy code, (2) containing the fixed code, or (3) not matching any of the two.By sending many prompts to a model, Jesse et al. count the number of cases (1) and (2) and compute their ratio to show that models tend to generate buggy code.As these models are non-deterministic, the same prompt may lead to different outputs.We let the model generate ten outputs for each prompt and count the number of cases (1) and (2).

3.3. Learning Objectives

Recall how LLM4Code generate outputs: given a sequence of tokens as input, the model generates a distribution over the next token. Given a training example input $x$ and its targeted output $y$ , the model computes the probability of outputting $y$ given $x$ , denoted by $P(y|x)$ .The model learns this training example by maximizing this probability.In practice, this processes is conducted by minimizing the negative log-likelihood (NLL) loss function, which is defined as:

(2)

L_{vanilla}=-\frac{1}{N}\sum_{i=1}^{N}\log P(y_{i}|x_{i})

where $N$ is the number of training samples, $x_{i}$ is the input for the $i$ -th sample, $y_{i}$ is the target output for the $i$ -th sample.This loss function is commonly used to tune LLM4Code(Weyssow etal., 2024).By minimizing this loss function, the model adjusts its parameters to increase the probabilities of the target words, improving its ability to predict the next word in similar sentences.We call it the Vanilla loss function.

However, in the context of hotfixing LLM4Code, directly training models on the entire updated examples may not be the appropriate option.The reason is highlighted in the work by He and Vechev(He and Vechev, 2023).In hotfixing, not each token in the entire updated examples is equally important, and the model should pay more attention to the added fixed code.Besides, the model should also learn not to generate the original buggy code and the model should cause minimal impact on the outputs that are not relevant to the bug-fixing updates. We explain the three objectives and the corresponding loss function design as follows.

3.3.1. Learning to Generate the Desired Code

In Listing1, the line that is highlighted in green is the output that we want the model to generate.To specifically control that such learning only happens on the desired code (i.e., the added fixed code), we compute the loss on the desired part only.We apply Equation(2) to compute the loss:

(3)

L_{guided}=-\frac{1}{N}\sum w_{t}^{+}\log P(x_{t}|x_{<t})

In the above equation, $w_{t}^{+}$ is the weight to compute the final loss on token $x_{t}$ , where $w_{t}=1$ if $x_{t}$ is the desired code and $w_{t}^{+}=0$ otherwise.This loss setting is called the Guided loss.

3.3.2. Learning to Avoid Undesired Code

In Listing1, the part that is highlighted in red (e.g., the buggy code) is the output that we want the model to avoid.Similarly, the log-likelihood loss function on the undesired code can be computed as follows:

(4)

L_{unlearn}=-\frac{1}{N}\sum w_{t}^{-}\log P(x_{t}|x_{<t})

In the above equation, $w_{t}^{-}$ is the weight in computing the loss on token $x_{t}$ , where $w_{t}^{-}=1$ if $x_{t}$ is the desired code and $w_{t}^{-}=0$ otherwise.Unlike the desired code, we want to increase the model loss on the undesired code to avoid them from being generated.However, simply maximizing the loss on the undesired code is ineffective.Our experiment shows that maximizing this loss will make the model lose its ability in generating meaningful code (the hotfixed models generate zero buggy code and zero fixed code), which is not desired in hotfixing.Instead of directly maximizing this loss, He and Vechev(He and Vechev, 2023) propose to adopt the loss function $\frac{L_{guided}}{L_{guided}+L_{unlearn}}$ , which guides the optimizer to find a balance between minimizing the loss on desired code $L_{guided}$ and maximizing the loss on undesirable code $L_{unlearn}$ . Ideally, to minimize this loss, the optimizer works to reduce $L_{guided}$ which encourages fixed code generation and to increase $L_{unlearn}$ which discourages the generation of buggy code.To make the model learn the desired code and avoid the undesired code, we define the Dual loss as follows:

(5)

L_{dual}=(L_{guided}+\frac{L_{guided}}{L_{guided}+L_{unlearn}})/2

3.3.3. Learning to Retrain Knowledge of Other Code

The third objective is to let the model preserve the knowledge of the code that is not relevant to either the desired or undesired code.Although the loss functions designed above do not touch the other code, the model may forget the knowledge of this part during training, which may introduce other undesired behaviors (e.g., harming the correctness of generated code).As an analogy, a commit fixing a bug may introduce another.We want to avoid such a situation for hotfix.

The fact that tuning language models on a new dataset can lead to performance degradation on other aspects has been observed in previous studies.For example, Liu et al.(Liu etal., 2016) point out that optimizing models on a new dataset can improve the average performance (e.g., BLEU score) but forgo coherence and fluency.This degeneration is often diagnosed as an effect of deviating too much from the original pre-trained model during optimization.Consequently, we want the model to behave similarly to the original model on the code that is not relevant to the undesired behavior.This is similar to the target of knowledge distillation(Tang etal., 2024; Shi etal., 2024), which aims to train a student model to mimic the behavior of the teacher model.In knowledge distillation, Kullback-Leibler (KL) divergence(Kullback and Leibler, 1951) is widely used to measure the behavioral difference between the two models and the student model learns to minimize the difference(Shi etal., 2023; Kim and Rush, 2016; Kim etal., 2024).He and Vechev also use it to mitigate the potential functional correctness decrease.The KL divergence loss is defined as follows:

(6)

L_{KL}=\sum D_{KL}(P(x_{t}|x_{<t})||P_{0}(x_{t}|x_{<t}))

In the above equation, $P_{0}$ is the probability generated by the original model.In Equation(6), the notation $D_{KL}$ represents KL divergence between the original model’s token distribution and the hotfixed model’s token distribution.KL divergence is a measure of how one probability distribution diverges from a second, expected probability distribution.A small KL divergence indicates that the two distributions are similar.In our experiment, we also add $L_{KL}$ to each of the loss functions designed above, denoted by Vanilla+KL, Guided+KL, and Dual+KL.For example, the loss function of Guided+KL is $(L_{guided}+L_{KL})/2$ and the loss function of Dual+KL is $(L_{guided}+\frac{L_{guided}}{L_{guided}+L_{unlearn}}+L_{KL})/3$

3.4. Fine-tuning LLM4Code

Another important requirement for hotfixing LLM4Code is to make the process efficient and easy to deploy.Considering the efficiency requirement, we employ various parameter-efficient fine-tuning (PEFT) techniques to update the models, which can reduce the number of trainable parameters and thus reduce the computational cost.Following a recent study on evaluating PEFT on LLM4Code(Weyssow etal., 2024), we employ four PEFT techniques: LoRA(Hu etal., 2022b), IA3(Liu etal., 2022), Prefix tuning(Li and Liang, 2021), and QLoRA(Dettmers etal., 2023).For QLoRA, we quantize the model to 8-bit floating-point and 4-bit floating-point data types.In the following part of the paper, we use the acronym FT, LoRA, IA3, Prefix, QLoRA-8bit, and QLoRA-4bit to represent the traditional fine-tuning, LoRA, IA3, Prefix tuning, QLoRA with 8-bit quantization, and QLoRA with 4-bit quantization, respectively.Note that the hotfix is in different forms depending on the PEFT methods used. For example, the hotfix using prefix-tuning is a set of virtual tokens added to the input tokens of the LLM.

4. Experiment Settings

This section explains the investigated models, baselines, potential side effects brought by hotfixing, and model training details.

4.1. Investigated Models

Considering the popularity and high performance of decoder-only models in code generators(Hou etal., 2023), we focus on decoder-only models in our evaluation.Following the study that exposes the undesired behaviors of code generators(Jesse etal., 2023), we choose CodeGen(Nijkamp etal., 2023) models as our experiment subject, which are open-source models and have been widely used when evaluating code generation models.The CodeGen architecture follows a standard transformer decoder with left-to-right causal masking.Nijkamp et al.(Nijkamp etal., 2023) train multiple CodeGen models using different configurations and datasets.Although CodeGen-mono achieves the best performance on the HumanEval benchmark, it only supports the generation of Python code.We choose CodeGen-multi, which can support multiple programming languages, as our experiment subject.We use three CodeGen-multi models of different sizes in terms of the number of parameters: 350M and 2B.⁴⁴4In this paper, we use CodeGen-350M and CodeGen-2B to refer to the CodeGen-350M-multi and CodeGen-2B-multi.Considering that we need to train multiple models with different fine-tuning methods, we exclude the larger version with 6 and 16 billion parameters due to the limit of our computational resources.

4.2. Baseline

To encourage the models to generate fixed code, Jesse et al.(Jesse etal., 2023) add a natural language comment before the location of bugs to pre-condition the model.They use a CodeTrans model(Elnaggar etal., 2021) fine-tuned on the DeepCom dataset(Hu etal., 2018) to generate comments for each buggy statement and its corresponding fix, which we call BugComment and FixComment, respectively.The experiment results(Jesse etal., 2023) show that FixComment can condition the model to generate more fixed code.For example, after adding FixComment, CodeGen-350M(Nijkamp etal., 2023) generates 7.58% less buggy code and 13.12% more fixed code.We use this strategy as a baseline to compare with hotfixing.Note that to generate the comment, the baseline assumes the knowledge of the fix to be completed each time it mitigates the undesired behavior, which is not always available in practice.In contrast, hotfixing does not require the such knowledge of the fix when it is used at the model inference time.

4.3. Quantifying Side Effects

4.3.1. Functional Correctness

We use HumanEval(Chen etal., 2021), which is widely used in many studies(Xu etal., 2022; Chen etal., 2021; Li etal., 2023; Allal etal., 2023; Fried etal., 2023), as the benchmark to evaluate the correctness of code generated by models.It contains 164 Python programming problems; each problem is accompanied by a Python function signature, a docstring, and a test suite.A model generates solutions for each problem, and we execute the test suite to evaluate the correctness of the generated solutions.Following previous studies(Chen etal., 2021; Nijkamp etal., 2023), we allow models to generate 100 solutions for each problem and report $pass@k$ values for each model, which is a widely used metric in code generation tasks(Nijkamp etal., 2023; Fried etal., 2023).

4.3.2. Perplexity

Perplexity is the exponentiated average negative log-likelihood of a sequence of words. For a given sequence of words $w_{1},w_{2},\ldots,w_{N}$ , the perplexity $PPL$ is defined as:

PPL=\exp\left(-\frac{1}{N}\sum_{i=1}^{N}\log P(w_{i}\mid w_{1},w_{2},\ldots,w_%{i-1})\right)

$N$ is the total number of words in the sequence. $P(w_{i}\mid w_{1},\ldots,w_{i-1})$ is the probability assigned by the model to the word $w_{i}$ given the preceding words $w_{1},\ldots,w_{i-1}$ .A lower perplexity score indicates that the knowledge of the model is more consistent with the data used to compute the perplexity.We aim to evaluate how hotfixing may affect the model’s knowledge about the original training data.However, the training data of both CodeGen-350M and CodeGen-2B are not publicly available.Given the insights from previous studies that larger models can memorize much training data(Yang etal., 2024b; Carlini etal., 2021), we sample 10,000 outputs from the CodeGen-16B model, which are expected to be similar to the training data of CodeGen-350M and CodeGen-2B.Thus, these sampled outputs are appropriate to evaluate how model’s knowledge about the original training data is affected by hotfixing.We measure the average perplexity of CodeGen-350M, CodeGen-2B, and their hotfixed versions on the 10,000 outputs sampled from CodeGen-16B.

Type

Models

OriginalResults

BaselineResults

Fine-tuning Methods

LoRA

IA3

Prefix

QLoRA-8bit

QLoRA-4bit

No. of Bugs

350M

2,943

2,720

(7.58%

\downarrow

)

1,639(44.32%

\downarrow

)

2,005

(31.85%

\downarrow

)

1,842

(37.42%

\downarrow

)

1,724

(41.42%

\downarrow

)

1,731

(41.18%

\downarrow

)

3,393

3,027

(10.79%

\downarrow

)

1,753

(48.35%

\downarrow

)

2,145

(36.80%

\downarrow

)

2,156

(36.45%

\downarrow

)

1,712

(49.55%

\downarrow

)

1,681(50.47%

\downarrow

)

No. of Fixes

350M

1,829

2,069

(13.12%

\uparrow

)

3,811(108.42%

\uparrow

)

3,076

(68.14%

\uparrow

)

1,475

(19.34%

\downarrow

)

3,804

(107.95%

\uparrow

)

3,787

(107.05%

\uparrow

)

2,555

2,718

(6.38%

\uparrow

)

5,196(103.38%

\uparrow

)

4,410

(72.62%

\uparrow

)

2,693

(5.40%

\uparrow

)

5,035

(97.06%

\uparrow

)

5,152

(101.68%

\uparrow

)

4.4. Experiment Settings

We download the CodeGen models from the HuggingFace and run them on two NVIDIA GeForce A6000 GPUs with 48 GB of memory.We obtain the dataset from Jesse et al.’s replication package(Jesse etal., 2023).We also conduct a case study on mitigating privacy leakage in LLM4Code.The required dataset bigcode/bigcode-pii-dataset is downloaded from HuggingFace.⁵⁵5https://huggingface.co/datasets/bigcode/bigcode-pii-datasetWe obtain the personally identifiable information (PII) detector bigcode/starpii from HuggingFace.⁶⁶6https://huggingface.co/bigcode/starpiiIt is noted that neither bigcode-pii-dataset nor starpii is publicly accessible, which requires approval from the dataset owner to access.We extend the open-source repository provided by Weyssow et al.(Weyssow etal., 2024) to implement different fine-tuning methods with our proposed loss functions.Following their settings, we set the learning rate to 3e-4.For prefix-tuning, the prefix is set to be 20 trainable continuous tokens.The Adafactor optmizer(Shazeer and Stern, 2018) with 16-bit floating-point precision is used for training.

5. Results

This section presents three research questions and the corresponding results regarding hotfixing LLM4Code.

RQ1. How do different PEFT methods affect hotfixing results?

RQ2. How do learn objectives affect hotfixing results?

RQ3. How does hotfixing cause side effects to LLM4Code and how to mitigate the undesired side effects?

The first question tries different PEFT strategies and keep the same learning objectives.The second question uses the same fine-tuning method (LoRA) and try different learning targets to understand how they affect the hotfixing results.We also examine how hotfixing causes side effects and how to mitigate the effects.

5.1. RQ1. How do different PEFT methods affect hotfixing results?

In this research question, we hope to understand what PEFT methods can perform better in hotfixing LLM4Code.Here the performance has two aspects: (1) effectiveness: to what extent the hotfix can make the model generate less buggy but more fixed code; and (2) efficiency: how much computational resources are required to hotfix the model.

Table1 shows the results of this research question.First, the third column shows the number of buggy code and fixed code generated by the original CodeGen-350M and CodeGen-2B models.We observe that both models tend to generate more buggy code than fixed code, confirming the findings in Jesse et al.’s study(Jesse etal., 2023).We also observe that the larger model generates more buggy code and fixed code than the smaller model, indicating that larger models have stronger capacity to memorize the training data(Yang etal., 2024b).Then, we use the mitigation method proposed in(Jesse etal., 2023) as a baseline and find that adding comments to the model prompt can mitigate the issue to some extent but the performance is limited.Specifically, this method only reduces 7.58% and 10.79% of buggy code, and increases 13.12% and 6.38% of fixed code for CodeGen-350M and CodeGen-2B, respectively.

Type

Models

Vanilla

Guided

Dual

No. of Bugs

350M

2,608

(11.38%

\downarrow

)

1,830

(37.81%

\downarrow

)

1,731

(41.18%

\downarrow

)

3,098

(8.69%

\downarrow

)

1,972

(41.88%

\downarrow

)

1,753

(48.33%

\downarrow

)

No. of Fixes

350M

3,016

(64.89%

\uparrow

)

3,532

(93.11%

\uparrow

)

3,787

(107.05%

\uparrow

)

3,782

(48.02%

\uparrow

)

5,133

(100.90%

\uparrow

)

5,165

(102.15%

\uparrow

)

Vanilla+KL

Guided+KL

Dual+KL

Changes by KL

1,927

(5.30%

\uparrow

)

2,687

(3.03%

\uparrow

)

1,764

(1.91%

\uparrow

)

3.41%

\uparrow

2,016

(2.23%

\uparrow

)

3,119

(0.68%

\uparrow

)

1,723

(0.17%

\downarrow

)

2.74%

\uparrow

3,534

(0.01%

\uparrow

)

2,939

(2.55%

\downarrow

)

3,712

(1.98%

\downarrow

)

1.51%

\downarrow

5,057

(1.48%

\downarrow

)

3,711

(1.88%

\downarrow

)

5,095

(1.36%

\downarrow

)

1.57%

\downarrow

We then evaluate the performance of hotfixing under multiple settings.In this RQ, we use the combined loss function to fine-tune the models, so that the model can learn to avoid generating undesired code, generate the desired code, and remain the same for other code.We find that use this loss function can reduce 44.32% (for CodeGen-350M) and 50.47% (for CodeGen-2B) of buggy code, and increase 108.43% (for CodeGen-350M) and 103.38% (for CodeGen-2B) of fixed code, demonstrating the effectiveness of hotfixing.Then, we compare the difference between different PEFT methods.In Table1, we label a cell as green if a PEFT method achieves the best performance.We can observe that LoRA has 3 green cells and the other one is QLoRA-4bit.It indicates the that LoRA-based methods are more effective in hotfixing LLM4Code.In contrast, IA3 and Prefix-tuning are less effective.For example, prefix-tuning only leads to 5.40% more fixed code generation for CodeGen-2B, which is much less than the 103.38% increase by LoRA.

We also analyze the efficiency of different PEFT methods.We set the batch size to 1 for each method and record the average time required to conduct one epoch of learning on the same A6000 GPU.We repeat the experiment 5 times and report the average time.On CodeGen-350M, LoRA and Prefix-tuning shows better time efficiency, which takes 5.6 and 5.4 minutes per epoch.IA3 takes 6.6 minutes, faster than QLoRA-4bit and QLoRA-8bit that take 7 and 9.6 minutes per epoch, respectively.

5.2. RQ2. How do learning objectives affect hotfixing results?

Recall the objectives when hotfixing LLM4Code: (1) avoid generating undesired code, (2) generate the desired code, and (3) remain the same for other code.In RQ1, we consider all the three objectives and train models with different PEFT methods, identifying LoRA as the best-performing training method in hotfixing.In RQ2, we use LoRA to train models and analyze how each learning objective affects hotfixing results.

In Section3.3, we define three settings of loss functions, named Vanilla, Guided, and Dual loss settings.In the Vanilla setting, we minimize the loss on the entire example after fix, i.e., the commonly used setting in fine-tuning LLM4Code(Weyssow etal., 2024), defined in Equation(2).In the Guided setting, we guide to model to minimize the loss only on the added fixed code, defined in Equation(3).In the Dual setting, we additionally penalize the loss on the deleted code on top of the Guided setting, defined in Equation(5).The results of each loss setting are shown in the left side of Table2.It can be observed that all the three settings can mitigate the undesired behaviors of LLM4Code.The Vanilla setting reduces 11.38% and 8.69% of buggy code and increases 64.89% and 48.02% of fixed code for CodeGen-350M and CodeGen-2B, respectively.By guiding the model to specifically learn on the added fixed code, the Guided setting can reduce 37.81% and 41.88% of buggy code and increase 93.11% and 100.90% of fixed code for the two models.Additionally penalizing the model on the deleted code in the Dual setting can further improve the performance.Taking CodeGen-350M as an example, the Dual setting can reduce 41.18% of buggy code and increase 107.05% of fixed code, outperforming the Guided setting that reduces 37.81% of buggy code and increases 93.11% of fixed code.The results suggest that guiding the model to learn the difference between two versions of dataset is more effective in hotfixing.

We also investigate the impact of adding KL divergence to the loss function.The goal of adding KL divergence is to reduce the potential side effects caused by hotfixing (detailed analysis in RQ3).However, this may also affect the hotfixing results and we show that such impacts on hotfixing results are limited.The right half of Table2 shows the results when adding KL divergence to each loss function, in columns with the suffix ‘+KL.’The percentage number in this table is the change compared to the results without KL divergence.We find that adding KL loss does not have a significant impact on the hotfixing results.We report the average changes caused by KL loss in the last column of Table2.On average, adding KL loss only increases 3.41% of buggy code and reduces 1.51% of fixed code for CodeGen-350M, and increases 2.74% of buggy code and reduces 1.57% of fixed code for CodeGen-2B, compared to the results without KL loss.However, as we will show in the next research question, adding KL loss can help mitigate the undesired side effects caused by hotfixing LLM4Code.

5.3. RQ3: How does hotfixing affect functional correctness and how to mitigate the undesired side effects?

Previous study(Chen etal., 2018) highlights that hotfixes need to remediate the unwanted symptoms of the critical issue and there is less emphasis on correctness.We evaluate how hotfixing LLM4Code introduces undesired side effects on correctness and how to mitigate them.We consider potential side effects on two properties: functional correctness and model’s perplexity, the definitions of which are explained in Section4.3.Based on the results in RQ1 and RQ2, we focus on the best combination of hotfixing setting: i.e., using the Penalize loss function and LoRA to fine-tuning the models.We additionally consider Penalize+KL to show whether adding KL divergence to the loss function can help mitigate the side effects.

For quantifying the functional correctness, we compute the pass@100 metric on the HumanEval benchmark(Chen etal., 2021).The results are shown in Table3.We can observe that hotfixing LLM4Code reduces the pass@100 value of both models.When the Penalize loss function is used, the pass@100 value decreases from 16.1 to 14.3 for CodeGen-350M and from 32.9 to 30.4 for CodeGen-2B.But we find that the side effects can be mitigated by adding KL divergence to the loss function.Specifically, using the Penalize+KL loss function can increase the pass@100 value from 14.3 to 14.9 for CodeGen-350M and from 30.4 to 31.1 for CodeGen-2B, compared to the results achieved by Penalize loss function.We also conduct a statistical analysis to evaluate the significance of the difference between the functional correctness of original models and the models after hotfixing using Penalize+KL.The HumanEval benchmark has 164 questions and we compute the pass@100 value for each question for each model.Given two vectors of pass@100 values (obtained by the original models and the hotfixed model using Penalize+KL), we use the Wilcoxon signed-rank test(Wilcoxon, 1945) to compare the two vectors.We find that the $p$ -value is larger than 0.05, meaning that although hotfixing LLM4Code can decrease the functional correctness value but the difference is not statistically significant.

We then analyze how the models’ perplexity changes.A smaller perplexity value indicates that the model is more confident predicting each token in the sequence.We follow the settings in Section4.3 to compute the model perplexity and present the results in Table3.Similar observations are made on perplexity: hotfixing LLM4Code increases model perplexity but adding KL divergence to the loss function can help mitigate this side effect.

	CodeGen-350M		CodeGen-2B
Methods	pass@100	PPL	pass@100	PPL
Original	16.1	29.09	32.9	2.903
Penalize	14.3	41.37	30.4	3.941
Penalize+KL	14.9	37.19	31.1	3.831

6. Discussion

We present some discussion in this section, including (1) ethical consideration of the research, (2) potential applications of the hotfix, and (3) threats to validity.

6.1. Ethical Consideration

The dataset for privacy leakage mitigation is obtained from HuggingFace.⁷⁷7https://huggingface.co/datasets/bigcode/bigcode-pii-dataset-trainingTo access the dataset, one must agree to the terms of usage of this dataset.As explicitly stated by BigCode, the owner of this dataset, the user must agree that you will not share the PII dataset or any modified versions for whatever purpose.To comply with the terms, we do not share the dataset used in our experiments relevant to privacy leakage mitigation.Throughout the research process, we ensure that we handle data and results in an ethical manner. For example, in Listing 2, we obfuscate identifiers and mask the sensitive information so that the contributors of the vulnerable code cannot be disclosed.Our study aims to use hotfix to avoid generating undesired code.However, it is possible that the method is used in the opposite way, i.e., to generate undesired code, which may cause new security and privacy issues.We leave the potential ways to mitigate this risk as future work.

6.2. Potential Applications

In a nutshell, this paper evaluates the potential of mitigating the undesired behavior of code generation models without heavy retraining.We posit that hotfix has the potential for broad applicability across numerous contexts.

Tasks.Apart from the two tasks explored in this paper, our proposed method shows promise in facilitating additional model maintenance activities.For instance, frequent updates to software packages and APIs often lead to the issue of API deprecation(Haryono etal., 2020, 2022).A model trained on outdated datasets containing deprecated APIs might not generate code that employs newer APIs.To address this, we can train a hotfix to identify and modify specific areas for adapting to these API changes.Furthermore, our method holds potential for user-specific customization.Different users may have unique requirements, desiring models that align more closely with their specific needs.For example, a company might use a tailored version of a software package rather than a generic, publicly available one.In such cases, users within the company would benefit from a model that generates code compatible with their customized software.By training a hotfix, we can direct the model to prefer the company’s customized version over the standard public version.

Deployment.A major advantage of the proposed method is its ease of deployment and distribution.Let us consider two typical deployment scenarios: (1) cloud-based deployment (e.g., Copilot) and (2) client-side deployment (e.g., CodeGeeX)In the case of cloud deployment, service providers can simply apply the hotfix by adding the LoRA matrix to certain layers of a network.For client-side deployment, it’s sufficient to distribute only the LoRA matrix, which is relatively small in size (13.9 MB for a CodeGen-2B), to the client.This eliminates the need to download the entire model, which can be significantly larger in size (5.69 GB for a CodeGen-2B).This approach is akin to distributing a software patch as opposed to the entire software package, greatly simplifying the deployment process of the proposed method.

6.3. Threats to Validity

6.3.1. Threats to Internal Validity

Internal validity refers to the extent to which a study’s results are not biased or derived wrongly.When evaluating the SStuBs-fixing task, we follow the setting used by Jesse et al.(Jesse etal., 2023) that evaluates whether a model can generate fixed code by string matching.However, it is possible that the model generates code that is semantically equivalent to the fixed code but not exactly the same.On the privacy leakage mitigation task, we need to detect whether privacy information appears in the generated code and the existing privacy detection tools are not perfect.To mitigate this threat, we use a state-of-the-art privacy detection tool (Bigcode/Starpii) and only focus on emails, which are the type of PII on which the tool performs best to identify (with a precision of 97.73%).Another threat is that the deep learning models are nondeterministic, which means that the same model may generate different outputs on the same prompt.To mitigate this threat, we run each model 10 times on each prompt in the SStuBs-fixing task and 20,000 times in the privacy leakage mitigation task.

6.3.2. Threats to External Validity

External validity refers to how the results of this study generalize to other settings.For example, the conclusion may not hold for other models.Our experiments investigate the CodeGen model family, which is widely used and is based on the GPT-2 architecture.We leave the evaluation of encoder-decoder models (e.g., CodeT5) as future work.To mitigate the threats that the conclusion may vary for different sizes of models, we choose two models of different sizes.We mitigate the undesired behavior: LLM4Code can complete many known buggy code, aiming at make LLM4Code not complete buggy code but complete the corresponding fixed code.To evaluate the generalizability of the proposed method, we conduct a case study on privacy leakage mitigation.Additionally, to evaluate the functional correctness of LLM4Code after hotfixing, we use the HumanEval benchmark(Chen etal., 2021), which is a widely used benchmark for evaluating code generation models.However, this benchmark only contains Python programming problems, which leads to a threat that the conclusion may not hold for other programming languages.We leave the evaluation of other languages as future work.

7. Related Work

This is an extended replication study of the work by He and Vechev, which use contrastive prefix-tuning(Qian etal., 2022) to control the security-level of code generators. We extend their work on a new hotfixing task (i.e., reduce the number of generated buggy code and increase the number of fixed code) with more parameter-effect fine-tuning methods. We discuss the related works of other undesired behaviors in LLM4Code and efficient-tuning methods for models of code as follows.

7.1. LLM4Code and Behaviors

Recent advancements in Natural Language Processing (NLP) have been significantly influenced by large language models such as BERT(Devlin etal., 2019) and GPT(Radford etal., 2019; Brown etal., 2020). This inspiration has led to the development of pre-trained models specifically tailored for code-related tasks. Among these, CodeBERT(Feng etal., 2020) stands out, alongside a series of similar models including GraphCodeBERT(Guo etal., 2021) and CuBERT(Kanade etal., 2020).A notable trend in these code models is the adoption of the GPT architecture(Radford etal., 2019; Brown etal., 2020), particularly for generation tasks. An example is CodeGPT, which adapts the GPT-2 architecture with training on the CodeSearchNet corpus(Husain etal., 2019).Moreover, the evolution of these code models has seen the emergence of larger and more powerful versions, such as InCoder(Fried etal., 2023) and CodeGen(Nijkamp etal., 2023), showcasing enhanced performance.These models have found practical applications in real-world scenarios.For instance, GitHub Copilot employs Codex(Chen etal., 2021) to assist in coding tasks.Empirical studies have been conducted to evaluate the effectiveness of these models.For example, Zeng et al.(Zeng etal., 2022) assess a range of models in the context of program understanding and generation tasks.

In addition to their numerous applications, researchers have identified key limitations and undesired behaviors in LLM4Code(Lo, 2023; Yang etal., 2024a).For example, Yang et al.(Yang etal., 2024b) and Al-Kaswan et al.(Al-Kaswan etal., 2023) find that code models can memorize training data, which can lead to privacy leakage.Similarly, researchers find the LLM4Code can generate software credentials like API keys and passwords(Huang etal., 2023b; Niu etal., 2023).Jesse et al.(Jesse etal., 2023) find that code models tend to generate buggy code.A series of empirical studies show that code generators can produce biased(Huang etal., 2024; Liu etal., 2023) and vulnerable code(Pearce etal., 2022; Sandoval etal., 2022).To the best of our knowledge, we are the first to hotfix to mitigate the above two types of undesired behaviors: generating buggy code and leaking privacy information.There are other undesired behaviors in LLM4Code.Huang et al.(Huang etal., 2024) find that LLM4Code exhibit social bias in code generation.Besides, a series of studies have shown that LLM4Code are not robust(Yang etal., 2022), vulnerable to data poisoning(Ramakrishnan and Albarghouthi, 2022; Yang etal., [n. d.]; Wan etal., 2022; Li etal., 2022), and lack explainability(Liu etal., 2023), which are not desired as well.We leave the mitigation of these undesired behaviors as future work.

7.2. Efficient-tuning for Models of Code

Weyssow et al.(Weyssow etal., 2024) use 5 parameter-efficient methods to fine-tune models for code generation tasks.We reuse their replication package to implement the first hotixing for mitigating undesired behavior.Prefix-tuning falls into the category of efficient-tuning methods for pre-trained models, which aim to utilize computationally-friendly techniques to efficiently adapta model to a new dataset or task.Typical methods include in-context learning, prefix-tuning, adapter-tuning.Here we briefly introduce the related work about applying these methods to models of code.In-context learning (ICL) presents a small number of examples in the prompt to guide language models to conduct specific tasks.Gao et al.(Gao etal., 2023) evaluate ICL on code summarization, bug fixing, and program synthesis.They show that strategically designed context can lead to substantial improvements over widely-used demonstration construction methods.Huang et al.(Huang etal., 2023a) show that ICL can mitigate the social bias of code generated by language models.Wang et al.(Wang etal., 2022) apply prompt-tuning on three code intelligence tasks and show that prompt-tuning consistently outperforms fine-tuning in three investigated tasks.Wang et al.(Wang etal., 2023) evaluate adapter tuning for code search and summarization, showing that adapter tuning significantly outperforms full-model fine-tuning and effectively overcomes catastrophic forgetting.Choi and Lee(Choi and Lee, 2023) propose CodePrompt, using task-agnostic prefix tuning for program and language generation.

8. Conclusion and Future Work

In this study, we explore to what extent parameter-efficient fine-tuning (PEFT) techniques can be used to hotfix LLM4Code.We extend the work by He and Vechev(Qian etal., 2022) by evaluating four different PEFT techniques on the task of reducing the number of generated buggy code and increasing the number of fixed code.We conduct our experiments on the models in the CodeGen family.The empirical results demonstrate the effectiveness of hotfixing: the best combination leads to a significant increase in the generation of fixed code (up to 108.42%) and a notable reduction in the production of buggy code (up to 50.47%).We also conduct statistical analysis to confirm that hotfixing does not adversely impact the functional performance of the models on the HumanEval benchmark.We discuss the potential applications of the proposed method in various tasks, as well as its advantage in efficient deployment.

In the future, we plan to extend hotfixing to more undesired behaviors and LLM4Code of various sizes and architectures.

References

(1)
cod ([n. d.])[n. d.].Ai code generator - amazon CodeWhisperer faqs - AWS.https://aws.amazon.com/codewhisperer/faqs/
git ([n. d.])[n. d.].GitHub copilot · your AI pair programmer.https://github.com/features/copilot
Ahmad etal. (2023)Aakash Ahmad, Muhammad Waseem, Peng Liang, Mahdi Fahmideh, MstShamima Aktar, and Tommi Mikkonen. 2023.Towards Human-Bot Collaborative Software Architecting with ChatGPT. In Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering (Oulu, Finland) (EASE ’23). Association for Computing Machinery, New York, NY, USA, 279–285.https://doi.org/10.1145/3593434.3593468
Al-Kaswan etal. (2023)Ali Al-Kaswan, Maliheh Izadi, and Arie van Deursen. 2023.Traces of Memorisation in Large Language Models for Code.arXiv:2312.11658[cs.CR]
Allal etal. (2023)LoubnaBen Allal, Raymond Li, and DenisKocetkov et al. 2023.SantaCoder: don’t reach for the stars!arXiv:2301.03988[cs.SE]
Bencheikh and Höglund (2023)Leila Bencheikh and Niklas Höglund. 2023.Exploring the Efficacy of ChatGPT in Generating Requirements: An Experimental Study.(2023).
Brown etal. (2020)Tom Brown, Benjamin Mann, and Nick etal. Ryder. 2020.Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H.Larochelle, M.Ranzato, R.Hadsell, M.F. Balcan, and H.Lin (Eds.), Vol.33. Curran Associates, Inc., 1877–1901.
Carlini etal. (2021)Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. 2021.Extracting Training Data from Large Language Models. In USENIX Security Symposium.
Chen etal. (2021)Mark Chen, Jerry Tworek, and HeewooJun et al. 2021.Evaluating Large Language Models Trained on Code.CoRR (2021).
Chen etal. (2018)Yaohui Chen, Yuping Li, Long Lu, Yueh-Hsun Lin, Hayawardh Vijayakumar, Zhi Wang, and Xinming Ou. 2018.Instaguard: Instantly deployable hot-patches for vulnerable system programs on android. In 2018 Network and Distributed System Security Symposium (NDSS’18).
Choi and Lee (2023)YunSeok Choi and Jee-Hyong Lee. 2023.CodePrompt: Task-Agnostic Prefix Tuning for Program and Language Generation. In Findings of the Association for Computational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 5282–5297.https://doi.org/10.18653/v1/2023.findings-acl.325
Dettmers etal. (2023)Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023.QLoRA: Efficient Finetuning of Quantized LLMs.
Devlin etal. (2019)Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019.BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186.https://doi.org/10.18653/v1/N19-1423
Elnaggar etal. (2021)Ahmed Elnaggar, Wei Ding, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Silvia Severini, Florian Matthes, and Burkhard Rost. 2021.CodeTrans: Towards Cracking the Language of Silicon’s Code Through Self-Supervised Deep Learning and High Performance Computing.arXiv:2104.02443[cs.SE]
Fan etal. (2023)Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and JieM. Zhang. 2023.Large Language Models for Software Engineering: Survey and Open Problems.arXiv:2310.03533[cs.SE]
Feng etal. (2020)Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020.CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 1536–1547.
Fried etal. (2023)Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Scott Yih, Luke Zettlemoyer, and Mike Lewis. 2023.InCoder: A Generative Model for Code Infilling and Synthesis. In The Eleventh International Conference on Learning Representations.https://openreview.net/forum?id=hQwb-lbM6EL
Gao etal. (2023)Shuzheng Gao, Xin-Cheng Wen, Cuiyun Gao, Wenxuan Wang, and MichaelR Lyu. 2023.Constructing Effective In-Context Demonstration for Code Intelligence Tasks: An Empirical Study.arXiv preprint arXiv:2304.07575 (2023).
Guo etal. (2021)Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, ShengyuFu andz MicheleTufano, ShaoKun Deng, ColinB. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. 2021.GraphCodeBERT: Pre-training Code Representations with Data Flow. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
Hanna etal. (2024)Carol Hanna, David Clark, Federica Sarro, and Justyna Petke. 2024.Hot Fixing Software: A Comprehensive Review of Terminology, Techniques, and Applications.arXiv preprint arXiv:2401.09275 (2024).
Haryono etal. (2020)StefanusA. Haryono, Ferdian Thung, HongJin Kang, Lucas Serrano, Gilles Muller, Julia Lawall, David Lo, and Lingxiao Jiang. 2020.Automatic Android Deprecated-API Usage Update by Learning from Single Updated Example. In Proceedings of the 28th International Conference on Program Comprehension (Seoul, Republic of Korea) (ICPC ’20). Association for Computing Machinery, New York, NY, USA, 401–405.https://doi.org/10.1145/3387904.3389285
Haryono etal. (2022)StefanusA Haryono, Ferdian Thung, David Lo, Lingxiao Jiang, Julia Lawall, HongJin Kang, Lucas Serrano, and Gilles Muller. 2022.AndroEvolve: automated Android API update with data flow analysis and variable denormalization.Empirical Software Engineering 27, 3 (2022), 73.
He and Vechev (2023)Jingxuan He and Martin Vechev. 2023.Large Language Models for Code: Security Hardening and Adversarial Testing.arXiv:2302.05319[cs.CR]
He etal. (2022)Junda He, Bowen Xu, Zhou Yang, DongGyun Han, Chengran Yang, and David Lo. 2022.PTM4Tag: Sharpening Tag Recommendation of Stack Overflow Posts with Pre-Trained Models (ICPC ’22). Association for Computing Machinery, New York, NY, USA, 1–11.https://doi.org/10.1145/3524610.3527897
Hou etal. (2023)Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2023.Large Language Models for Software Engineering: A Systematic Literature Review.arXiv:2308.10620[cs.SE]
Hu etal. (2022a)EdwardJ Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022a.LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations.https://openreview.net/forum?id=nZeVKeeFYf9
Hu etal. (2022b)EdwardJ Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022b.LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations.https://openreview.net/forum?id=nZeVKeeFYf9
Hu etal. (2018)Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018.Deep Code Comment Generation. In Proceedings of the 26th Conference on Program Comprehension (Gothenburg, Sweden) (ICPC ’18). Association for Computing Machinery, New York, NY, USA, 200–210.https://doi.org/10.1145/3196321.3196334
Huang etal. (2023a)Dong Huang, Qingwen Bu, Jie Zhang, Xiaofei Xie, Junjie Chen, and Heming Cui. 2023a.Bias Assessment and Mitigation in LLM-based Code Generation.arXiv:2309.14345[cs.SE]
Huang etal. (2024)Dong Huang, Qingwen Bu, Jie Zhang, Xiaofei Xie, Junjie Chen, and Heming Cui. 2024.Bias Testing and Mitigation in LLM-based Code Generation.
Huang etal. (2023b)Yizhan Huang, Yichen Li, Weibin Wu, Jianping Zhang, and MichaelR. Lyu. 2023b.Do Not Give Away My Secrets: Uncovering the Privacy Issue of Neural Code Completion Tools.arXiv:2309.07639[cs.CR]
Husain etal. (2019)Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019.CodeSearchNet challenge: Evaluating the state of semantic code search.arXiv preprint arXiv:1909.09436 (2019).
Jesse etal. (2023)Kevin Jesse, Toufique Ahmed, PremkumarT. Devanbu, and Emily Morgan. 2023.Large Language Models and Simple, Stupid Bugs. In 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR). IEEE Computer Society, Los Alamitos, CA, USA, 563–575.https://doi.org/10.1109/MSR59073.2023.00082
Kanade etal. (2020)Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. 2020.Learning and evaluating contextual embedding of source code. In International Conference on Machine Learning. PMLR, 5110–5121.
Karampatsis and Sutton (2020)Rafael-Michael Karampatsis and Charles Sutton. 2020.How Often Do Single-Statement Bugs Occur? The ManySStuBs4J Dataset. In Proceedings of the 17th International Conference on Mining Software Repositories (Seoul, Republic of Korea) (MSR ’20). Association for Computing Machinery, New York, NY, USA, 573–577.https://doi.org/10.1145/3379597.3387491
Kim etal. (2024)Minsoo Kim, Sihwa Lee, Janghwan Lee, Sukjin Hong, Du-Seong Chang, Wonyong Sung, and Jungwook Choi. 2024.Token-scaled logit distillation for ternary weight generative language models. In Proceedings of the 37th International Conference on Neural Information Processing Systems (¡conf-loc¿, ¡city¿New Orleans¡/city¿, ¡state¿LA¡/state¿, ¡country¿USA¡/country¿, ¡/conf-loc¿) (NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 1824, 22pages.
Kim and Rush (2016)Yoon Kim and AlexanderM. Rush. 2016.Sequence-Level Knowledge Distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Jian Su, Kevin Duh, and Xavier Carreras (Eds.). Association for Computational Linguistics, Austin, Texas, 1317–1327.https://doi.org/10.18653/v1/D16-1139
Kullback and Leibler (1951)Solomon Kullback and RichardA Leibler. 1951.On information and sufficiency.The annals of mathematical statistics 22, 1 (1951), 79–86.
Li etal. (2022)Jia Li, Zhuo Li, Huangzhao Zhang, Ge Li, Zhi Jin, Xing Hu, and Xin Xia. 2022.Poison Attack and Defense on Deep Source Code Processing Models.https://doi.org/10.48550/ARXIV.2210.17029
Li etal. (2023)Raymond Li, LoubnaBen Allal, and YangtianZi et al. 2023.StarCoder: may the source be with you!arXiv:2305.06161[cs.CL]
Li and Liang (2021)XiangLisa Li and Percy Liang. 2021.Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, Online, 4582–4597.https://doi.org/10.18653/v1/2021.acl-long.353
Liu etal. (2016)Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Noseworthy, Laurent Charlin, and Joelle Pineau. 2016.How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Jian Su, Kevin Duh, and Xavier Carreras (Eds.). Association for Computational Linguistics, Austin, Texas, 2122–2132.https://doi.org/10.18653/v1/D16-1230
Liu etal. (2022)Haokun Liu, Derek Tam, Muqeeth Mohammed, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel. 2022.Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning. In Advances in Neural Information Processing Systems, AliceH. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (Eds.).https://openreview.net/forum?id=rBCvMG-JsPd
Liu etal. (2023)Yan Liu, Xiaokang Chen, Yan Gao, Zhe Su, Fengji Zhang, Daoguang Zan, Jian-Guang Lou, Pin-Yu Chen, and Tsung-Yi Ho. 2023.Uncovering and Quantifying Social Biases in Code Generation.arXiv:2305.15377[cs.CL]
Lo (2023)David Lo. 2023.Trustworthy and Synergistic Artificial Intelligence for Software Engineering: Vision and Roadmaps.arXiv:2309.04142[cs.SE]
Nijkamp etal. (2023)Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2023.CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. In The Eleventh International Conference on Learning Representations.
Niu etal. (2023)Liang Niu, Shujaat Mirza, Zayd Maradni, and Christina Pöpper. 2023.CodexLeaks: Privacy Leaks from Code Generation Language Models in GitHub Copilot. In 32nd USENIX Security Symposium (USENIX Security 23). USENIX Association, Anaheim, CA, 2133–2150.
Pearce etal. (2022)Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2022.Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. In 43rd IEEE Symposium on Security and Privacy, SP 2022, San Francisco, CA, USA, May 22-26, 2022. IEEE, 754–768.https://doi.org/10.1109/SP46214.2022.9833571
Qian etal. (2022)Jing Qian, Li Dong, Yelong Shen, Furu Wei, and Weizhu Chen. 2022.Controllable Natural Language Generation with Contrastive Prefixes. In Findings of the Association for Computational Linguistics: ACL 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 2912–2924.https://doi.org/10.18653/v1/2022.findings-acl.229
Radford etal. (2019)Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019.Language Models are Unsupervised Multitask Learners.(2019).
Ramakrishnan and Albarghouthi (2022)Goutham Ramakrishnan and Aws Albarghouthi. 2022.Backdoors in Neural Models of Source Code. In 2022 26th International Conference on Pattern Recognition (ICPR). IEEE Computer Society, Los Alamitos, CA, USA, 2892–2899.https://doi.org/10.1109/ICPR56361.2022.9956690
Sandoval etal. (2022)Gustavo Sandoval, Hammond Pearce, Teo Nys, Ramesh Karri, Brendan Dolan-Gavitt, and Siddharth Garg. 2022.Security implications of large language model code assistants: A user study.arXiv preprint arXiv:2208.09727 (2022).
Shazeer and Stern (2018)Noam Shazeer and Mitchell Stern. 2018.Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning. PMLR, 4596–4604.
Shi etal. (2024)Jieke Shi, Zhou Yang, and David Lo. 2024.Efficient and Green Large Language Models for Software Engineering: Vision and the Road Ahead.arXiv:2404.04566[cs.SE]
Shi etal. (2023)Jieke Shi, Zhou Yang, Bowen Xu, HongJin Kang, and David Lo. 2023.Compressing Pre-Trained Models of Code into 3 MB (ASE ’22). Association for Computing Machinery, New York, NY, USA, Article 24, 12pages.https://doi.org/10.1145/3551349.3556964
Tang etal. (2024)Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, and Dacheng Tao. 2024.A Survey on Transformer Compression.arXiv:2402.05964[cs.LG]
Vaswani etal. (2017)Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017.Attention is all you need.Advances in neural information processing systems 30 (2017).
Wan etal. (2022)Yao Wan, Shijie Zhang, Hongyu Zhang, Yulei Sui, Guandong Xu, Dezhong Yao, Hai Jin, and Lichao Sun. 2022.You See What I Want You to See: Poisoning Vulnerabilities in Neural Code Search. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Singapore, Singapore) (ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 1233–1245.https://doi.org/10.1145/3540250.3549153
Wang etal. (2022)Chaozheng Wang, Yuanhang Yang, Cuiyun Gao, Yun Peng, Hongyu Zhang, and MichaelR. Lyu. 2022.No More Fine-Tuning? An Experimental Evaluation of Prompt Tuning in Code Intelligence. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (¡conf-loc¿, ¡city¿Singapore¡/city¿, ¡country¿Singapore¡/country¿, ¡/conf-loc¿) (ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 382–394.https://doi.org/10.1145/3540250.3549113
Wang etal. (2023)Deze Wang, Boxing Chen, Shanshan Li, Wei Luo, Shaoliang Peng, Wei Dong, and Xiangke Liao. 2023.One Adapter for All Programming Languages? Adapter Tuning for Code Search and Summarization.arXiv:2303.15822[cs.SE]
Weyssow etal. (2024)Martin Weyssow, Xin Zhou, Kisub Kim, David Lo, and Houari Sahraoui. 2024.Exploring Parameter-Efficient Fine-Tuning Techniques for Code Generation with Large Language Models.arXiv:2308.10462[cs.SE]
Wilcoxon (1945)Frank Wilcoxon. 1945.Individual Comparisons by Ranking Methods.Biometrics Bulletin 1, 6 (1945), 80–83.http://www.jstor.org/stable/3001968
Xia and Zhang (2023)ChunqiuSteven Xia and Lingming Zhang. 2023.Keep the Conversation Going: Fixing 162 out of 337 bugs for 0.42 each using ChatGPT.arXiv:2304.00385[cs.SE]
Xu etal. (2022)FrankF. Xu, Uri Alon, Graham Neubig, and VincentJosua Hellendoorn. 2022.A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming (San Diego, CA, USA) (MAPS 2022). Association for Computing Machinery, New York, NY, USA, 1–10.https://doi.org/10.1145/3520312.3534862
Yang etal. (2022)Zhou Yang, Jieke Shi, Junda He, and David Lo. 2022.Natural Attack for Pre-Trained Models of Code. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 1482–1493.https://doi.org/10.1145/3510003.3510146
Yang etal. (2024a)Zhou Yang, Zhensu Sun, TerryZhuo Yue, Premkumar Devanbu, and David Lo. 2024a.Robustness, Security, Privacy, Explainability, Efficiency, and Usability of Large Language Models for Code.arXiv:2403.07506[cs.SE]
Yang etal. ([n. d.])Zhou Yang, Bowen Xu, JieM. Zhang, HongJin Kang, Jieke Shi, Junda He, and David Lo. [n. d.].Stealthy Backdoor Attack for Code Models.IEEE Transactions on Software Engineering 01 (feb [n. d.]), 1–21.https://doi.org/10.1109/TSE.2024.3361661
Yang etal. (2024b)Zhou Yang, Zhipeng Zhao, Chenyu Wang, Jieke Shi, Dongsun Kim, Donggyun Han, and David Lo. 2024b.Unveiling Memorization in Code Models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (Lisbon, Portugal) (ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 72, 13pages.https://doi.org/10.1145/3597503.3639074
Zeng etal. (2022)Zhengran Zeng, Hanzhuo Tan, Haotian Zhang, Jing Li, Yuqun Zhang, and Lingming Zhang. 2022.An Extensive Study on Pre-Trained Models for Program Understanding and Generation. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (Virtual, South Korea) (ISSTA 2022). Association for Computing Machinery, New York, NY, USA, 39–51.https://doi.org/10.1145/3533767.3534390
Zheng etal. (2023)Zibin Zheng, Kaiwen Ning, Yanlin Wang, Jingwen Zhang, Dewu Zheng, Mingxi Ye, and Jiachi Chen. 2023.A Survey of Large Language Models for Code: Evolution, Benchmarking, and Future Trends.arXiv:2311.10372[cs.SE]
Zhou etal. (2023)X. Zhou, B. Xu, D. Han, Z. Yang, J. He, and D. Lo. 2023.CCBERT: Self-Supervised Code Change Representation Learning. In 2023 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE Computer Society, Los Alamitos, CA, USA, 182–193.https://doi.org/10.1109/ICSME58846.2023.00028

Hotfixing Large Language Models for Code: How Far Can Parameter-Efficient Fine-Tuning Go? (2024)

Abstract.

1. Introduction

2. Background

2.1. LLM-based Code Generators

2.2. Parameter-Efficient Fine-Tuning

3. Methodology

3.1. Motivating Scenario

3.2. Data Collection

3.3. Learning Objectives

3.3.1. Learning to Generate the Desired Code

3.3.2. Learning to Avoid Undesired Code

3.3.3. Learning to Retrain Knowledge of Other Code

3.4. Fine-tuning LLM4Code

4. Experiment Settings

4.1. Investigated Models

4.2. Baseline

4.3. Quantifying Side Effects

4.3.1. Functional Correctness

4.3.2. Perplexity

4.4. Experiment Settings

5. Results

5.1. RQ1. How do different PEFT methods affect hotfixing results?

5.2. RQ2. How do learning objectives affect hotfixing results?

5.3. RQ3: How does hotfixing affect functional correctness and how to mitigate the undesired side effects?

6. Discussion

6.1. Ethical Consideration

6.2. Potential Applications

6.3. Threats to Validity

6.3.1. Threats to Internal Validity

6.3.2. Threats to External Validity

7. Related Work

7.1. LLM4Code and Behaviors

7.2. Efficient-tuning for Models of Code

8. Conclusion and Future Work

References

FAQs

Hotfixing Large Language Models for Code: How Far Can Parameter-Efficient Fine-Tuning Go? ›

What is the best learning rate for fine-tuning BERT? ›