Improved Program Repair Methods using Refactoring with GPT Models

Teachers often utilize automatic program repair methods to provide feedback on submitted student code using model answer code. A state-of-the-art tool is Refactory, which achieves a high repair success rate and small patch size (less code repair) by refactoring code to expand the variety of correct code samples that can be referenced. However, Refactory has two major limitations. First, it cannot fix code with syntax errors. Second, it has difficulty fixing code when there are few correct submissions. Herein we propose a new method that combines Refactory and OpenAI's GPT models to address these issues and conduct a performance measurement experiment. The experiment uses a dataset consisting of 5 programming assignment problems and almost 1,800 real-life incorrect Python program submissions from 361 students for an introductory programming course at a large public university. The proposed method improves the repair success rate by 1-21% when the set of correct code samples is sufficient and the patch size is smaller than Refactory alone in 16-45% of the cases. When there was no set of correct code samples at all (only the model answer code was used as a reference for repair), method improves the repair success rate by 1-43% and the patch size is smaller than Refactory alone in 42-68% of the cases.


INTRODUCTION
Providing feedback on programming assignments is a time-consuming and labor-intensive task for teachers, particularly when dealing with a large class of students. 1The conventional approach of manually identifying problems with the code and suggesting ways to correct them requires a deep understanding of the student's source code, and it can be both difficult and time-consuming for both students and teachers.
Several attempts have been made to automate this process.For example, Sumit et al. developed a tool called CLARA [6], which effectively employs automatic program repair techniques using correct student submissions in Massive Open Online Courses.Yang et al. developed a more efficient tool called Refactory [7], which outperformed CLARA by repairing 30% of the programs faster and with a smaller patch size2 even with smaller datasets.
Despite these advancements, Refactory has two major limitations.It cannot repair programs with syntax errors (denoted as L1).
Although Refactory has a high repair rate, which is about 90% on average, the repair rate is less than 80% when using datasets with fewer correct submissions (denoted as L2).
Herein, we propose an improved program repair method by combining Refactory with OpenAI's GPT models to address the aforementioned challenges.We employ GPT models to repair programs with syntax errors (L1) and those that Refactory fails to repair due to a lack of enough correct submissions (L2).Additionally, to enhance performance, we leverage GPT models to create smaller patches based on the programs that were repaired by Refactory (or a teacher-provided reference program if Refactory fails to repair), subsequently selecting a more optimized solution from those generated by both Refactory and GPT models.
The novel contributions of this paper are as follows: (1) Achieving performance improvements by combining the GPT models with Refactory for more effective code repair.

RELATED WORK
To explain the context of our study, we divide earlier studies into three main categories: automated feedback, automatic program repair, and integration with large-scale language models (LLMs).Automated feedback: Various studies have explored automatic feedback [2] and developed systems called online judges for providing feedback in programming assignments [17].Online judges execute test cases to assess if submission code meets specifications.For example, Tillmann et al. developed Pex4Fun, which employs automated test generation technology for effective scoring [16].However, novice students often struggle to repair their code after failed test cases, leading to the application of more direct automatic program repair techniques.
Automatic program repair in software engineering field: There are a lot of approaches to automatic program repair to help software engineers [5].Kim et al. propose a patch generation approach, Pattern-based Automatic program Repair (Par), using fix patterns learned from existing human-written patches [9].Nguyen et al. presented an automated repair method based on symbolic execution, constraint resolution, and patch synthesis with specifications obtained from tests [12].D'Antoni et al. proposed an approach to program repair based on program distance that can quantify changes not only to program syntax but also to program semantics [4].Li et al. proposed an approach for multi-hunk, multi-statement fixes that combines traditional spectrum-based fault localization with deep learning and data-flow analysis [11].
Automatic program repair in programming education field: Several studies have employed program repair techniques to provide feedback on programming assignments.Rolim et al. introduced REFAZER, an approach that learns program transformations and has been proven effective in correcting student-submitted assignments [13].Gulwani et al. developed CLARA, a tool that utilizes correct student submissions to repair incorrect programs within Massive Open Online Courses, demonstrating its effectiveness with large student populations [6].Bhatia et al. proposed a neuro-symbolic approach that combines neural networks and constraint-based reasoning to repair syntax errors [1].Hu et al. proposed Refactory, a state-of-the-art program repair tool for generating student program repairs from correct submissions in real-time [7].Li et al. introduced another cutting-edge tool named Assign-mentMender, targeting newly released assignments that may lack sufficient correct submissions [10].We selected Refactory as our baseline because it operates effectively with or without an abundance of correct submissions.Furthermore, the availability of Refactory's source code, scripts, and datasets for replication enhances transparency and aligns seamlessly with our experimental design.
Integration with LLMs: Since 2023, LLMs like GPT models have been explored for source code repair.Tian et al, evaluated the performance of source code repair using GPT and other models on the LeetCode and Refactory datasets [15].Sobania et al. evaluated ChatGPT on QuixBugs, a set of standard bug-fixing benchmarks, and compared its performance with several other approaches in the literature.They found that ChatGPT's bug-fixing performance is on par with common deep learning approaches and significantly outperforms traditional program repair methods [14].Joshi et al. developed RING, a multilingual repair engine powered by an LLM trained on code.RING outperformed language-specific repair engines for three of six programming languages.[8].These existing studies have shown us that combining Refactory with GPT models has the potential to further improve program repair capabilities.

LIMITATIONS WITH REFACTORY
Refactory's core process of repairing incorrect programs is carried out through a three-phase process.These phases are: Phase 1. Refactoring: Refactoring the correct answer programs according to predefined rules to increase the variation of correct programs.
Phase 2. Structure Alignment: This phase involves analyzing the syntax of the incorrect program to find a structural match with the correct programs refactored in Phase 1.If no match is identified, the structure of the incorrect program is edited (mutated), and the search continues.
Phase 3. Block Repair: Within the incorrect program, the top-k closest correct programs with the same control flow are sought for patch construction (with k=5 in experimental evaluation).A mapping between basic blocks of the correct and incorrect programs is created as a patch.If a program is successfully patched and passes the test suite, it is outputted as a successful result.
Although Refactory achieves a high repair rate, it has two issues.The first issue is its inability to repair source code containing syntax errors (L1).This limitation is significant because Phase 2 of Refactory explicitly analyzes the syntax of incorrect code, thereby excluding programs with syntax errors from being repaired.Consequently, students receive no feedback for these instances, hindering their learning process.The second issue concerns Refactory's diminished ability to fix code when few correct answer variations are available (L2).Refactory attempts to find a program with a similar structure to the incorrect code among the correct programs in Phase 2. Although it incorporates advanced features such as refactoring and structure mutation, the repair rate and patch size are worse for problems lacking sufficient correct programs.
Table 1 shows Refactory's experimental results.For Q2, which has a large number of lines of code (28) and fewer correct programs (291), the success repair rate (78.16%) is lower and the relative patch size (RPS)3 (0.56) is larger than the other questions [7].

OVERVIEW OF OUR METHOD
Here, we propose a method to address the two problems discussed above and to achieve higher performance code repair.Our method combines Refactory with the OpenAI's GPT models.This study employs the two most recent general-purpose models as of August 2023: GPT-3.5-Turbo and GPT-4.

Leveraging GPT Models in Refactory
We added an implementation to Refactory's source code for improving program repair using GPT models.If the existing Refactory functionality can repair the program, then the input to GPT is the

How to Interact with GPT Models
We can interact with the GPT models via OpenAI's network API, sending requests and receiving synchronous responses.In our study, we modified Refactory to interact with the GPT models using the OpenAI Python library, version 0.27.8 4 .One critical parameter in controlling the GPT model's response is the temperature, which influences the randomness of answers.In the context of programming repair, controlling randomness is vital for consistent and accurate results.A value of 0 makes responses contain no randomness, whereas a value of 2 makes them more random.We set the temperature to 0 to aid in refining our prompts during development and improve the replicability of our study, enabling other researchers to replicate our experiment more easily.We made two prompts to GPT models.These prompts are tailored to specific programming assignments through the use of template variables.Below is the first prompt: As a Python programming expert, your objective is to correct the incorrect code provided.Follow these guidelines: -Ensure your corrected code produces the same output and logic as the provided model solution.
-Make only essential modifications to the incorrect code, preserving its essence.
-Start by listing all user-defined identifiers in the incorrect code.Use as many of these identifiers as possible in your corrected code.
-Your correct code's semantics should mirror the model solution.
-Ensure the syntax of the corrected code closely resembles the incorrect original, more so than the model solution.
-Retain variable and function names, comments, whitespaces, line break characters, parentheses, `pass`, `break`, `continue`, and any redundant expressions like The first prompt is designed with stronger restrictions to minimize the patch size.However, in some cases, it may fail to repair the code.If this failure occurs, our implementation replaces the first half of the prompt with one having the following looser restriction, and then re-requests the repair: Please amend the provided Python program code to align with the described functionality.When suggesting code changes, adhere to these guidelines: -Follow the output format.
-Ensure it's executable using Python's exec function.
-Retain as much of the original code's character count as possible.
-Maintain existing comments unchanged.
-Maintain the original line breaks.
-Keep variable and function names intact.
-Ensure the amended code remains closer in structure to the original, rather than resembling the model solution provided.Can we sort items other than integers?For this question, you will be sorting tuples!We represent a person using a tuple (<gender>, <age>).Given a list of people, write a function sort_age that sorts the people and return a list in an order such that the older people are at the front of the list.An example of the list of people is [("M", 23), ("F", 19), ("M", 30)].The sorted list would look like [("M", 30), ("M", 23), ("F", 19)].You may assume that no two members in the list of people are of the same age.5 Top-k elements Write a function top_k that accepts a list of integers as the input and returns the greatest k number of values as a list, with its elements sorted in descending order.You may use any sorting algorithm you wish, but you are not allowed to use sort and sorted.

# Problem Description {description}
The prompts are dynamically customized at runtime by replacing the template variables with the relevant content.The variables are defined as follows: • {incorrect_code} : The full text of incorrect source code.
• {reference_code} : The full source code of Refactory's repair results or the teacher-provided reference code.• {description} : The problem statement of the programming assignment.
If the response from GPT does not include the repaired code or the repaired code leads to a runtime error, a request for repair is made again with additional instructions to address the error.This retry mechanism is performed up to a maximum of three times, optimizing a balance between efficiency and the probability of successful repair.

EXPERIMENT
We investigated the following research questions (RQs): • RQ1 Does combining Refactory with GPT models increase the repair rate?• RQ2 Does combining Refactory with GPT models reduce the patch size?• RQ3 Does combining Refactory with GPT models increase the processing time?

Experimental Setting
We conducted an experiment using the same dataset as Refactory to answer RQ1-3.The dataset is from an introductory programming course at a large public university.It contains 5 programming assignment problems and almost 1,800 real-life incorrect Python program submissions from 361 students.Table 1 shows the number of correct programs, the number of incorrect programs, and the average lines of code for each question.Additionally, each question also includes teacher-provided reference code and 5-17 test suites.Table 2 shows the problem descriptions for the questions (Q1-5).
All experiments were performed on Google Cloud Platform's Google Compute Engine (machine type: general-purpose N2 series, number of cores: 2 vCPU (1 core), memory: 8 GB).In this experiment, the online refactoring phase, the structure mutation phase, and the block repair phase of Refactory are enabled.Two sampling rates were used: a 0 option, indicating that buggy programs were repaired using only the teacher-provided reference program, and a 100 option, indicating that 100% of correct student programs were used in the repair process. (

Experimental Results
Tables 3 and 4 show the repair success rates at sampling rates of 100% and 0%, respectively.Figures 1 and 2 show stacked bar graphs of the repair success rates at a 100% and 0% sampling rate, respectively.Figures 3 and 4 show a box plot of the RPS for each question at sampling rates of 100% and 0%, respectively.In the legend, "Refactory" means a repair by Refactory without GPT models.Figure 5 shows a box plot of the time to repair each question at a 100% sampling rate, while 6 shows a box plot of the time to repair each question at a 0% sampling rate.Tables 3 and 4 compare the repair rate values of these methods.Repairs of incorrect programs using GPT models typically had a  100%, but increased by 1-35% at a sampling rate of 0%.In addition, the repair rate of GPT-4 was 1-8% higher than that of GPT-3.5-Turbo for relatively complex problems such as Q2.
Figure 1 shows that GPT was successful for almost all repairs, including problems, where Refactory's repair success rate was relatively low (e.g., Q2 and Q4).Additionally, in some cases where Refactory's repair success rate was very high (e.g., Q1 and Q3), GPT could handle smaller patch size corrections in GPT-3.5-Turbo.This feature was more significant in GPT-4.For example, the number of smaller patch sizes for Q1 using GPT-4 was double that of GPT-3.5-Turbo.Figure 2 shows that repair by GPT is more efficient at lower sampling rates, particularly for Q2, which had relatively few correct programs and large LOC.These results demonstrate that combining Refactory with GPT models can increase the correction rate (RQ1).
The use of GPT had a negligible effect on the RPS in Figures 3 and  4. By contrast, the use of GPT often resulted in smaller patch sizes than Refactory alone in Figures 1 and 2. Overall, repairs with GPT can generate a slightly smaller patch size than Refactory.These results show that combining GPT models with Refactory can reduce the patch size (RQ2).
Figures 5 and 6 show that the use of GPT has a negligible impact on the repair time (seconds).The repair process with GPT takes only a few hundred milliseconds, including retries.For Q2, although the repair time showed a slight increase when using GPT, the repair was unsuccessful using Refactory alone.Thus, combining Refactory with GPT models does not increase the processing time significantly (RQ3).
In this study, we optimized the prompts to enhance the program repair capabilities of the GPT models.Deng et al. reported that LLMs could interpret and respond more effectively to prompts refined by the LLMs themselves, thereby boosting performance [3].Implementing this strategy, we utilized ChatGPT to refine our initial prompts, which significantly improved the repair efficiency.Notably, the frequency of GPT executing repairs with smaller patch sizes, compared to traditional refactoring methods, increased by a factor of 2.4 following the optimization of prompts.Although the analysis of which sentences LLMs more readily interpret is beyond the scope of this study, it remains a relevant topic.Details on the original prompts, the corresponding experimental outcomes depicted in Figures 1 and 2, and the comprehensive prompt-tuning process can be found at our web page5 .

Threats to Validity
Threats to the internal validity: The experiments were conducted on Google Cloud Platform's Google Compute Engine, where the instances did not have exclusive access to the hardware level.If an unusually demanding process were executed simultaneously on the same hardware, it could introduce variability in the results of the execution times, thus affecting the experiment.
Threats to the external validity: In this study, the same dataset utilized for performance measurements by Refactory was employed for experimentation.While this provided consistency with previous works, it also implies that the application of a different dataset could yield varying results.
Future work could consider employing multiple datasets and running experiments on isolated hardware to validate the consistency of the results.

Limitations
OpenAI sets a rate limit on API calls for GPT models. 6Therefore, our method should be carefully incorporated into online systems requiring a large amount of feedback in a very short period such as several thousand requests per minute.

CONCLUSIONS AND FUTURE WORK
Herein an automatic program repair method is proposed for feedback on programming assignments.The proposed method is a combination of the existing state-of-the-art tool Refactory and Ope-nAI's GPT models.The proposed method enhances the performance by several tens of percent over existing methods.Our experiment shows that combining Refactory with GPT models can increase the correction rate (RQ1), can reduce the patch size (RQ2) and does not increase the processing time significantly (RQ3).
In the future, we plan to investigate the effectiveness of GPT in repairing other datasets.This includes examining repair programs in languages other than Python along with targeting more complex algorithmic programming tasks and GUI programming tasks.

Table 1 :
Results of the performance measurement experiments reported by Refactory, including the success repair rate and relative patch size (RPS) for Q1-5 and a smaller patch size repair is requested.However, if the existing Refactory functionality cannot repair the program, then the input to GPT is the teacher-provided reference code for the repair, and the smallest possible patch size repair is requested.Thus, our method can consider cases where Refactory does not find correct code with the same structure in the Phase 2 process, and those where Refactory is given code that contains syntax errors or unsupported language features (such as lambda expressions).