PyDex: Repairing Bugs in Introductory Python Assignments using LLMs

Students often make mistakes in their introductory programming assignments as part of their learning process. Unfortunately, providing custom repairs for these mistakes can require a substantial amount of time and effort from class instructors. Automated program repair (APR) techniques can be used to synthesize such fixes. Prior work has explored the use of symbolic and neural techniques for APR in the education domain. Both types of approaches require either substantial engineering efforts or large amounts of data and training. We propose to use a large language model trained on code, such as Codex (a version of GPT), to build an APR system -- PyDex -- for introductory Python programming assignments. Our system can fix both syntactic and semantic mistakes by combining multi-modal prompts, iterative querying, test-case-based selection of few-shots, and program chunking. We evaluate PyDex on 286 real student programs and compare to three baselines, including one that combines a state-of-the-art Python syntax repair engine, BIFI, and a state-of-the-art Python semantic repair engine for student assignments, Refactory. We find that PyDex can fix more programs and produce smaller patches on average.


INTRODUCTION
Programming education has grown substantially in popularity in the past decade [Singer 2019].A key challenge associated with this growth is the need to provide novice students with e ective and e cient learning support.In an ideal world, teaching assistants would monitor students' learning process, and when students' code is not correct, they would then help them to derive a correct solution.However, this approach does not scale and educational institutions struggle to nd teaching assistants.As a result, there is an interest in developing automated tools that students can use for feedback instead.These tools provide custom repairs for their programming mistakes.The eld of automated program repair (APR), which has a long history in the software engineering community [Ahmed et al. 2022;Le Goues et al. 2012, 2019;Long et al. 2017;Long and Rinard 2016;Mechtaev et al. 2016], has introduced di erent approaches [Gulwani et al. 2018;Hu et al. 2019;Pu et al. 2016;Rolim et al. 2017] to produce such automated repairs for student mistakes in introductory assignments.Given a buggy student program, the APR system aims to produce a patch that satis es a speci cation (typically the instructor-provided test cases).The patch must also minimize the number of changes made, with the goal of facilitating student learning [Hu et al. 2019].
Prior automated program repair systems for student programming assignments have generally been implemented using purely symbolic [Gulwani et al. 2018;Hu et al. 2019;Rolim et al. 2017;Wang et al. 2018b] or purely neural [Ahmed et al. 2018;Pu et al. 2016] techniques.Symbolic approaches require substantial engineering e orts to develop, typically requiring signi cant program analysis/repair experience, as well as custom repair strategies tailored to the language domain in which students implement their assignments.Neural approaches mitigate some of the engineering challenges but typically require substantial amounts of data, often leading to specialized use cases for Massive Open Online Courses (MOOCs).Furthermore, these systems are typically tailored to focus exclusively on syntax repair or exclusively on semantic repair.For the latter, the assumption is the code to be repaired contains no syntactic errors.
In this paper, we introduce PyDex, a Python repair tool built on top of Codex, a version of the popular LLM GPT-3 [Brown et al. 2020] that was further trained on code.PyDex is a uni ed syntactic and semantic repair engine for introductory Python programming assignments.Using a large language model trained on code (LLMC) removes the need for custom symbolic repair logic or retraining of a new neural model, and it allows us to handle both syntactic and semantic mistakes.While LLMCs have been successfully applied to tasks such as code generation [cop 2024], their impact in the education domain remains controversial [Berger 2022].Using an LLMC for repair provides an opportunity to produce a positive impact in this domain.
We follow the approach of recent work [Joshi et al. 2023;Xia and Zhang 2022] in framing program repair as a code generation task that can be tackled with an LLMC.However, using LLMCs to produce student repairs requires addressing three challenges.First, the system must be able to handle multi-modality: the instructor may provide test cases, a description of the task in natural language, and language tooling (e.g. a compiler) may provide further information.Second, APR patches in the education domain need to reduce the number of changes to support learning -this requires that we limit the extent to which the LLM can generate more code than necessary or make changes to parts of the program that are not incorrect.Third, incorporating the LLMC as a core (but black box) component in our design requires that we adapt traditional prompt engineering techniques to our setting.
PyDex ensembles multi-modal prompts to generate complementary repair candidates.It employs prompts in an iterative querying strategy that rst uses syntax-targeted prompts and then semanticstargeted prompts.To reduce the number of changes induced by syntax errors that should have Proc.ACM Program.Lang., Vol. 8, No. OOPSLA1, Article 133.Publication date: April 2024.
PyDex: Repairing Bugs in Introductory Python Assignments using LLMs 133:3 relatively simple xes, PyDex uses the program's structure to extract a subprogram to give as input to the LLMC.By reducing the code surface exposed to the LLMC, PyDex biases repairs towards fewer edits.When xing semantics, PyDex takes inspiration from existing symbolic repair literature [Gulwani et al. 2018;Ke et al. 2015;Wang et al. 2018b] and leverages few-shot learning, which adds task-related examples to the prompt, by retrieving other students' programs that have similar mistakes (and eventual corrections).To identify these programs, PyDex computes a similarity metric over test-suite outcomes.
We evaluated PyDex on student programs from an introductory Python programming course at a major university in India.Our evaluation has 15 programming tasks, totalling 286 student programs.These student programs contain both syntactic and semantic mistakes.As there is currently no tool that can solve both errors simultaneously, we compare PyDex to three baselines built by composing: BIFI [Yasunaga and Liang 2021], a state-of-the-art syntax repair tool for Python; Refactory [Hu et al. 2019], a state-of-the-art semantic repair tool for education Python programs; and GenProg, a canonical semantic repair tool based on genetic programming.Speci cally, we compare PyDex to BIFI+Refactory, PyDex+Refactory, and PyDex+GenProg, where for the latter two baselines we use PyDex to produce syntactic xes before applying the corresponding semantic repair tool.
Our results show that PyDex can e ectively repair student programs in our benchmark set.PyDex without few-shot learning can repair 86.71% of the student programs.This repair rate climbs to 96.5% with few-shots.Meanwhile, BIFI+Refactory, PyDex+Refactory, and PyDex+GenProg repair 67.13%, 83.57%, and 49.30%, respectively.Our statistical analysis shows that the improvement over BIFI+Refactory and PyDex+GenProg is statistically signi cant.
The average token edit distance associated with PyDex patches is smaller (28.59 without fewshots and 29.68 with few-shots) compared to the patches produced by the baselines BIFI+Refactory (70.39) and PyDex+Refactory (73.53).We found that PyDex+GenProg (22.82) produces slightly smaller patches, but the di erence is not statistically signi cant.Our statistical analysis shows that the improvement over BIFI+Refactory and PyDex+Refactory is statistically signi cant.
We carried out an ablation study to understand the impact of our design decisions.Our results indicate that by performing iterative querying the repair rate rises from 82.87% to 86.71%.Furthermore, adding few-shots raises the repair success rate to 96.5%.The evaluation also shows that our techniques are important for maintaining the repaired program similar to the buggy input program.For example, removing the program chunker, which selects subprograms in the syntax repair phase, raises the average token edit distance from 5.46 to 9.38 in the syntax phase.We also show that di erent multi-modal prompts have varying performance, but if we combine their candidates as we do in PyDex, we obtain the best performance.
To summarize, we make the following contributions: • We propose an approach to automatically repair mistakes in students' Python programming assignments using a large language model trained on code (LLMC).Our approach uses multimodal prompts, iterative querying, test-case-based few-shot selection, and structurebased program chunking to repair student mistakes.In contrast to prior work, our approach uses the same underlying LLMC to repair both syntactic and semantic mistakes.• We implement this approach in PyDex, which uses OpenAI's popular Codex as the LLMC.We evaluate PyDex on a dataset of 286 real student Python programs drawn from an introductory Python programming course in India.We compare performance to three baselines, that leverage popular repair systems such as BIFI, Refactory, and GenProg.Our results show that PyDex yields a statistically signi cant higher repair rate than 2 of our 3 baselines, and a statistically signi cant smaller average token edit distance (i.e.smaller patches) than 2 of our 3 baselines.
The remainder of the paper is structured as follows.Section 2 walks through multiple examples of real student mistakes, as well as associated PyDex patches.Section 3 provides a brief background on concepts related to large language models.Section 4 describes our approach in detail.Section 5 provides experimental results on our dataset of student Python programs.Section 6 details further discussions and limitations.We discuss related work in Section 7. Finally, we conclude with takeaways in Section 8.

Understanding Challenges in Repairing Introductory-Level Programs
Consider Figure 1, which shows a student's incorrect program, along with a solution generated by PyDex.The student is solving the task of reading two numbers from stdin and printing di erent results depending on whether both, either, or neither are prime.
The student has made both syntactic and semantic mistakes.Lines 1 and 2 call input twice to read from stdin, and parse these values as integers using int.However, this constitutes a semantic mistake, as the assignment input format consists of two values on the same line separated by a comma.Furthermore, a traditional semantic repair engine would fail to x this student's assignment as there is also a syntactic mistake at line 30.The student used a single = for comparison in the elif clause (the correct syntax would be a double equals).
The PyDex solution, shown alongside it, xes the input processing (semantic mistake) by reading from stdin, splitting on the comma, and applying int (to parse as integer) using the map combinator.Line 23 xes the syntax error by replacing single equals with double equals (for comparison).Interestingly, the underlying LLMC (Codex) also refactored the student's program.In this case, lines 8 through 17 correspond to a function to check if a number is prime.This function is called twice, at lines 18 and 19.This replaces the repeated code in the original program, which spanned lines 9-17 and lines 18-26.
The edit distance between the PyDex repair and the original student program is 95, while the distance between the instructor's reference solution and the original student program is 188.A smaller edit distance is a key goal for APR in the educational domain, as this can help the student understand the repair with respect to their own mistakes.Figure 2 presents another example of an incorrect student program and a solution generated by PyDex.In this assignment, the students need to check whether a string, read from stdin, is a palindrome or not, and print out a message accordingly to stdout.For this student's program, PyDex has to generate a complex repair that xes four syntax mistakes and multiple semantic bugs.
The student has made syntax errors on lines 4, 8, 10, and 12, where they have left o the colon symbol necessary for control ow statements in Python.On line 2, the student called a non-existent function lower.The student has used standard division on lines 5, 6, 13, and 14 when they should have used integer division.The student has included two spurious print statements, at lines 7 and 15, which will interfere with the instructor's test-suite execution, as the suite checks values printed to stdout for correctness.Finally, the student has omitted the expected print statements (along with the equality check) for the case where the input string is of even length.
While the student's program has many mistakes, the overall structure and key concepts are there.Looking at the PyDex solution shown alongside, it resolves these mistakes but preserves the student's overall structure.In particular, PyDex replaces the non-existent lower function with a call to the string method with the same name.It replaces the division operator (/) throughout the program with the intended oor division operator (//), comments out the extra print statements, and adds the missing equality check and print statements in the case of even-length inputs.The edit distance between the PyDex repair and the original student program is 52, while the distance between the instructor's reference solution and the original student program is 97.The reference solution is a standard one-line program for palindrome.Once again, the PyDex repair is closer to the student submission than the instructor's reference solution.

Insights
Based on our observation of errors in student's introductory-level programs, we extract the following insights guiding our solution design.First, incorrect introductory-level programs often contain both syntactic and semantics errors at the same time, and this is an extremely challenging scenario for existing APR tools to handle alone as they are tailored to focus exclusively on syntax repair [Rolim et al. 2017;Yasunaga and Liang 2021] or exclusively on semantic repair [Mechtaev et al. 2018[Mechtaev et al. , 2016]].While combining a state-of-the-art syntactic xer and semantic xer to repair programs is possible, we detail the (lower) performance and challenges in Section 5 and Section 6.1.
Second, an introductory-level program can have many mistakes, which require complex repairs.Such cases are di cult to address by traditional existing APR techniques [Le Goues et al. 2012;Long and Rinard 2015;Mechtaev et al. 2018Mechtaev et al. , 2016;;Qi et al. 2014;Xuan et al. 2017], as they often focus on speci c error types, are limited to a small number of edits, and target speci c types of statements (such as conditionals).For example, the repairs (e.g., control-ow changes, in-lined function addition) shown in this section are out-of-scope for traditional APR tools.Third, because the eventual consumer of the generated patches are introductory-level programmers, we should minimize the cognitive load associated with many changes where possible.Finally, because students themselves may want to run the repair tool (enabling them to learn independently), the engineering e orts associated with running the APR tool should be minimized as much as possible.

BACKGROUND
We now provide a short background on concepts related to large language models.Large language model.A large language model (LLM) can be viewed as a probability distribution over sequences of words.This distribution is learned using a deep neural network with a large number of parameters.These networks are typically trained on large amounts of text (or code) with objectives such as predicting particular masked-out tokens or autoregressive objectives such as predicting the next token given the preceding tokens.When the LLM has been trained on signi cant amounts of code, we refer to it as a large language model trained on code (LLMC).In practice, most LLMs are now trained on code as well, so the functional di erence between the two categories has become increasingly less relevant.
Often, LLMs are pre-trained and then ne-tuned, meaning trained further on more specialized data or tasks.A particularly popular LLMC is OpenAI's Codex [Chen et al. 2021], a variant of GPT-3 [Brown et al. 2020] that is ne-tuned on code from more than 50 million GitHub repositories.Few-(or zero-)shot learning.In contrast to traditional supervised machine learning, LLMs have shown to be e ective for fewand even zero-shot learning.This means that the LLM can perform tasks it was not explicitly trained for just by giving it a few examples of the task or even no examples, respectively, at inference time.
In this setting of few-(or zero-)shot learning, the LLM is typically employed using what is termed prompt-based learning [Liu et al. 2023].A prompt is a textual template that can be given as input to the LLM to obtain a sequence of iteratively predicted next tokens, called a generation.A prompt typically consists of a query and possibly zero or more examples of the task, called shots.For example, the prompt below includes a speci c query to x a syntax error.One valid generation, that xes the syntax error, would be print().
In practice, a prompt can incorporate anything that can be captured in textual format.In particular, multi-modal prompts are those that incorporate di erent modalities of inputs, such as natural language, code, and data. 1i erent prompts may result in di erent LLM completions.Other factors may also a ect the completions produced, such as the sampling strategy or hyperparameters for the sampling strategy.One important hyperparameter is temperature, which controls the extent to which we sample less likely completions.LLM selection.While we use OpenAI's Codex in this work, other LLMs could be used such as Salesforce's CodeGen [Nijkamp et al. 2023] or OpenScience's BLOOM [Laurençon et al. [n. d.]].Even within OpenAI's Codex there are di erent underlying models o ered, including Codex-Edit [Open AI 2022].We found performance to be better with the standard Codex completion model.We now leverage these concepts to describe our approach.

METHODOLOGY
Figure 3 provides an overview of PyDex's architecture.The student's buggy program rst enters a syntax repair phase.In this phase, we extract subprograms from the original program that have a syntax error.Each such subprogram is fed to a syntax prompt generator that produces multiple syntax-oriented prompts.The LLMC then generates repair candidates, which are validated by the syntax oracle.This process is repeated until all syntax errors are removed.Any candidate that has no syntax errors moves on to the semantic phase.In this phase, PyDex uses a semantic prompt generator to produce semantics-oriented prompts.If it has access to other student's assignment history, PyDex can also add few-shots to these prompts.These prompts are fed to the LLMC, which generates new program candidates.These are validated by the test-suite-based semantic oracle.If multiple candidates satisfy all tests, PyDex returns the one with the smallest token edit distance with respect to the student's original program.We now describe each step in detail.

Syntax Phase
Students typically rst resolve syntax errors in their assignments, and then move on to resolve semantic errors (such as test case failures).PyDex takes inspiration from this approach and similarly splits its repair into syntax and semantic phases.
In the rst phase, PyDex receives the student's buggy program.A syntax oracle, for example, the underlying Python parser 2 , is used to determine if there is a syntactic mistake.If there is no such mistake, the program can move into the semantic phase.However, if there is a mistake, PyDex must produce a patch that resolves it, before moving to the semantic phase.
While our syntax prompt generator could directly include the original program in its entirety in the prompt, we have found that doing so can result in spurious edits that are not actually necessary to resolve the syntax error.Existing work has also observed similar phenomena in the related area of natural language to code generation [Poesia et al. 2022].As a result, we introduced a component we call the program chunker to mitigate this challenge by reducing the amount of code included in the prompt.

Program Chunking.
For each syntax mistake in the original buggy program, the program chunker extracts a subset of lines that contains (1) the oracle-reported syntax error location and (2) the nearest encompassing control-ow statement.These chunks are a heuristic approximation of a basic block, and allow us to restrict the code input given to the LLMC.Note that we perform this heuristic approximation as a standard analysis to extract basic blocks typically requires a syntactically correct input program.
else 10: return chunkedCode = slice( , PyDex extracts the program chunk for the rst (top-down) syntax error reported.Algorithm 1 outlines the procedure used to produce this program chunk.It takes advantage of both controlow structure (based on Python keywords) and indentation, which are meaningful in the Python language.The program chunker rst identi es the adjacent code that has the same or larger indentation level as the line with the syntax error.Then, if the code chunk contains control-ow related keywords, such as if and elif, PyDex makes sure the associated keywords (such as elif or else) for the same control ow statement are also in the chunk.This code chunk is then provided to the syntax prompt generator.this error line and stops upon encountering the rst line with an indentation level smaller than errIndent.The algorithm sets this as the starting line of the code chunk and then mark its indentation level as startIndent, which in this example is 0. At this starting line, if the line starts with a control-ow keyword (such as the if at line 2), the process moves down until reaching the rst unmatched control-ow statement at an indentation level less than or equal to startIndent.Otherwise, if at the starting line, the code chunk does not start with a control-ow keyword, the algorithm simply moves down to higher-indexed code lines, including any consecutive line with an indentation level greater than or equal to startIndent until it nds a line with less indentation.In the provided example, the algorithm stops at line 7, resulting in a nal code chunk spanning from line 2 to line 6.This example shows the algorithm's ability to selectively extract code chunk based on both indentation levels and control-ow structures, as depicted in Figure 4.
4.1.2Syntax Prompt Generator.The syntax prompt generator produces two (multimodal) prompts, one with and one without the syntax error message reported by the syntax oracle.An example of both is shown in Figure 5.Because the syntax oracle is available, we do not need to choose a single prompt template for all programs, but instead we query the LLMC with both prompts, extract the code portion from each generation, merge it into the original program by replacing the lines corresponding to the current program chunk, and then rely on the syntax oracle to lter out invalid repairs.
If a program candidate has no syntax errors, it can move on to the semantic phase.If any syntax errors remain, the syntax phase is repeated.This iteration allows the repair of multiple, spatiallyindependent, syntax errors.For our evaluation, we allow this procedure to iterate at most two times to limit repair times.

Semantic Phase
After PyDex has generated syntactically valid candidate programs, the repair procedure moves to a semantic repair phase.Intuitively, this phase incorporates information that allows the LLMC to generate candidate programs that satisfy the programming assignment task, as determined by a semantic oracle.Following the approach of existing work in automated repair for programming assignments [Gulwani et al. 2018;Hu et al. 2019], we use the instructor's test suite (inputs and expected outputs) as the semantic oracle.We say a program is repaired if it produces the expected outputs for the given inputs.Fig. 6.An example multimodal prompt (in zero-shot se ing for brevity) produced by the semantic prompt generator.This prompt includes code, natural language, and test cases.Lines starting with the double brackets are shown only for clarity, they are not part of the prompt itself.

Semantic Prompt
Generator.The semantic prompt generator takes advantage of the rich set of signals available in the education domain.In particular, we exploit the fact that programming assignments typically have available: (1) a natural language description of the task, (2) a set of test cases, and (3) peers' programming solutions.
The semantic prompt generator takes as input a syntactically valid program, the task description in natural language, and the set of instructor-provided test cases.The generator then produces prompts with di erent combinations of this information.Figure 6 shows an example of such a multimodal prompt.This prompt includes the student's buggy code, the natural language description of the assignment, as well as the input-output-based test cases.
If PyDex has access to other student's assignment solution history, then it can also employ few-shot learning, described in the following Section 4.2.2, in each of these prompts.
Similarly to the syntax phase, rather than picking a single prompt template, we use all prompts generated and rely on the semantic oracle to identify viable repair candidates.Each prompt given  to the LLMC can generate up to candidates, where we heuristically set to ten to balance the exploration of candidates with search space explosion.Each of these candidates is given to the semantic oracle, which executes that candidate on the test suite.We remove any candidate programs that result in a runtime exception or fail to satisfy any test cases.
If there are multiple valid candidate programs after the semantic phase, we return the one with the smallest token-based edit distance [Yasunaga and Liang 2021] to the student's submission as the repaired program.

Few-Shot Learning.
If PyDex has access to other students' programs it can employ few-shot learning.In contrast to other repair systems, such as Refactory [Hu et al. 2019], that typically employ only correct programs, PyDex's few-shots consist of both correct and incorrect programs.
In particular, PyDex's few-shot learning example bank consists of pairs of program versions ( , ′ ) where both and ′ satisfy the syntax oracle, ′ satis es the semantic oracle but does not, and is a historical edit-version ancestor of ′ .Given a candidate program produced by the syntax phase of PyDex, we retrieve the three most similar and their associated correct versions ′ to include as shots in the LLMC prompts produced by the semantic prompt generator.
We take inspiration from traditional automated program repair and say two programs are similar if they result in similar test suite executions [Perry et al. 2019].We de ne a test suite execution vector for program that captures test failures as where is the number of test cases, and is the boolean failure status of the th test.We de ne the similarity function between 1 and 2 as 1 − Hamming( 1 , 2 ), where Hamming is the normalized Hamming distance [Hamming 1950] between the two vectors.
Figure 7 is an illustrative example (note this is not an actual student problem, we have created a simpli ed example) of a prompt structure for our few-shot learning setting.In this prompt example, we lay out in few-shots as a pre x, followed by the target buggy program, the test suite information, and then a pre x to prompt the model to return a corrected version of the buggy program.Note that if PyDex does not have access to peer programs, then it can still query the LLMC using a zero-shot approach.In our evaluation (Section 5) we show that this ablated strategy still performs competitively.

EVALUATION
We explore the following two research questions in our evaluation of PyDex: • (RQ1) How does PyDex's overall performance compare to di erent baselines, which combine state-of-the-art syntactic and semantic repair approaches?• (RQ2) What is the impact of the underlying design decisions in PyDex?Speci cally, what is the impact of the structure-based program chunking, iterative querying, test-case-based few-shot selection, and multi-modal ensembled prompts?
Implementation.We have built a PyDex prototype using a mix of Python and open-source software libraries.The core of PyDex's implementation consists of approximately 600 lines of Python code, which is 5 to 10 times less than a typical symbolic repair system in the education domain [Gulwani et al. 2018;Hu et al. 2019;Rolim et al. 2017].In addition to the reduced engineering e orts, PyDex can handle both syntactic and semantic bugs in one system, while most systems address one type.We selected the top 10 program candidates in each syntax and semantics phase based on the average token log probabilities produced by the LLMC.We used OpenAI's Codex as our LLMC.Speci cally, we used the completion model.We found that other models, such as Codex Edit [Open AI 2022], did not perform as well.We set the temperature to 0.8 based on preliminary experiments.We ran experiments on a Windows VM (Intel i7 CPU, 32GB RAM).Benchmarks.We derived a benchmark set by selecting programs from a collection of introductory Python assignments collected by third-party authors in a large Indian university [H.Padmanabha et al. 2023].This dataset is a Python-version of the dataset described in [Chhatbar et al. 2020].
The dataset contains 18 assignments, each with a problem description, the test suite, and students' authoring history.A student's history consists of an ordered collection of program versions, where each version can be an explicit submission to the testing server, or a periodic (passive) snapshotthe dataset does not have a way to distinguish between these.For each assignment, we selected the students that had an eventually correct program.For each such student, we followed the standard practice [Rolim et al. 2017] of collecting the latest (closest to the correct version in time) version that had a syntactic mistake as our repair target.This results in a total of 286 program pairs, each consisting of a buggy and a ground-truth correct program version.We make available our ltered evaluation dataset here: https://github.com/microsoft/prose-benchmarks/tree/main/PyDex.
We removed three assignments that required reading les that are not reported in the dataset or that asked students to generate a PDF plot, which makes assessing correctness di cult without extra manual inspection.We manually checked the students' submissions and we found their errors were diverse.The repaired syntax errors in PyDex benchmarks include incorrect indentation, illegal usage of an empty block, misspelling a keyword, unde ned symbols, and unmatched delimiters such as parentheses, among others.Table 1 shows a summary of these errors.
Baselines.Most repair systems focus on either syntactic or semantic repairs. 3To create a state-ofthe-art baseline that performs both, we combined BIFI, a state-of-the-art transformer-based Python syntax repair tool, and Refactory, a state-of-the-art semantics repair tool designed for introductory Python assignments.
To run this baseline, we gave BIFI the original student program with syntax errors and generated 50 candidate programs for each buggy program.For each candidate, we ran the syntax oracle and checked for syntactic correctness.For each candidate that passed the syntax check, we called Refactory along with the instructor's reference solution. 4If Refactory can repair any of the candidates, we say it has repaired the student's program.If there are multiple candidate programs that passed the test suite, we choose the one with the smallest token edit distance from the original.
We also consider two additional baselines.We use PyDex to produce syntax repairs and then apply Refactory to solve any semantic repairs, as described previously.We refer to this baseline as PyDex+Refactory.
Finally, we consider a baseline that uses a version of GenProg [Le Goues et al. 2012] for semantic repairs.Because there is no o cial implementation of GenProg for Python programs, we took a publicly available implementation [Zeller 2023] that adapts portions of the algorithm to better match Python syntax.This approach evolves a student's buggy submission and can also incorporate statements from the instructor's reference solution.Like Refactory, GenProg assumes the input program does not contain syntax errors.So to run our comparison, we use PyDex to produce syntax repairs and then apply GenProg.We use GenProg to generate up to 10 candidates with a 30-second timeout for each repair attempt.We refer to this baseline as PyDex+GenProg.
Table 2. PyDex (without few shots) repairs a larger fraction of programs (86.71%) compared to our baselines (67.13%, 83.57%, 49.3%).On average, PyDex repairs are closer in terms of token edit distance (TED) to the original student program compared to two of the three baselines.Adding few-shots based on other peers' programs raises PyDex's repair rate to 96.50% while keeping a comparable average token edit distance (29.68).To save space in the table, "ID" represents the problem ID in the dataset, "# Sub" means the number of submissions of this problem, and "RR" is short for repair rate.The mean token edit distance between the buggy program and our repaired program is 28.59 (no few-shot) and 29.68 (with few shots) compared to 70.39 for BIFI+Refactory, 73.53 for PyDex+Refactory, and 22.82 for PyDex+GenProg.
We carry out a statistical analysis to compare performance across these systems.We exclude PyDex without few-shots as this is e ectively an ablation.We compare the repair rate and mean token edit distance across assignments and systems by using paired t-tests.We use paired tests as performance is paired at the assignment level.We carry out the paired t-tests using pairwise comparisons with a Bonferroni adjustment for repeated comparisons.For the repair rate, we consider a 1-sided test with an alternative hypothesis of performance being greater for PyDex.For the mean token edit distance (TED), we consider a 1-sided test with an alternative hypothesis of PyDex's TED being smaller.Because TED can be unde ned if a system fails to repair any programs, we exclude assignments where any baseline has a repair rate of zero (i.e.assignments 2882, 2920, 2921).
For repair rates, we nd that the comparison between PyDex and BIFI+Refactory (and similarly between PyDex+Refactory and BIFI+Refactory) is statistically signi cant (at 0.01), and so we reject the null hypothesis.We nd that the comparison between PyDex and PyDex+Refactory results in a p-value of 0.057 (after Bonferroni adjustment), so we do not reject the null hypothesis in this case (though if we reduce the number of pair-wise comparisons it is signi cant).Finally, the comparison between PyDex and PyDex+GenProg is statistically signi cant at 0.01.
For mean TED, we nd that the comparison between PyDex and BIFI+Refactory (as well as between PyDex and PyDex+Refactory) is signi cant at p=0.01.We also nd that the comparison between PyDex and PyDex+GenProg is not statistically signi cant.
From this analysis, we conclude5 that PyDex outperforms the baseline BIFI+Refactory on both repair rate (higher) and size of repair (smaller), PyDex+Refactory on the size of repair but not necessarily on repair rate, and PyDex+GenProg on repair rate but not on the size of repair.Repairing semantic errors typically depends on rst resolving any syntactic errors.Indeed, students often focus on resolving mistakes reported by the parser/compiler before they move on to debugging test cases.PyDex's architecture re ects this approach.As a result, we also want to understand syntax repair performance by comparing just PyDex and BIFI.
Table 3 summarizes the syntax repair rates across assignments and approaches.Our results show that PyDex repairs the syntax bugs in all of the 286 programs, with a 100% syntax repair rate.This outperforms the state-of-the-art BIFI, which has a syntax repair rate of 80.07%.In addition, PyDex's syntax repairs have a substantially lower mean token edit distance (5.46 versus 25.07), meaning our repairs on average introduce fewer changes to the original programs, which may facilitate understanding of the xes.
We also observed that in 17 out of 286 cases, BIFI fails to handle the input program, potentially due to lexer issues.This highlights another advantage of using PyDex to repair programs because PyDex does not have any constraints over the input as a result of its prompt-based learning strategy.
BIFI is very e ective at repairing small syntax mistakes in assignments of lower di culty.For example, in assignment 2865, BIFI repairs all syntax errors and does so with a smaller average token edit distance (1.82 versus 2.18) compared to PyDex.One interesting direction for future work is to combine BIFI with PyDex, as the repairs can be complementary.In this case, PyDex could focus on generating more complex repairs and BIFI could focus on small edits for simpler tasks such as missing a quote in a string.

RQ2: Ablation Study
We now present the results of experiments to analyze di erent design choices in PyDex.PyDex uses multimodal prompts, iterative querying, test-case-based few-shot selection, and structure-based program chunking to repair student mistakes.The power of few-shot selection was already shown in Table 2.We will now present the results of the other three design choices.The intuition is that these chunks contain the syntax error we want to x, along with the surrounding context, while excluding code lines that are not relevant to the x.Our goal is to reduce the number of (spurious) edits produced by the LLMC by reducing the code surface in the prompt.
To evaluate the impact of program chunking on the syntax repair stage, we removed it from PyDex and compared syntax repair performance to the original approach.Table 4 shows the average token edit distance produced in the syntax phase with and without program chunking.We found that program chunking can reduce the average token edit distance up to 56.32% (problem assignment 2878).Overall, the average token edit distance is reduced from 9.38 to 5.46 (41.79%) by adding program chunking.

Iterative erying.
Students typically resolve syntax errors rst and then move on to resolving semantic mistakes.PyDex's architecture follows this same intuition.To compare the e ectiveness of this iterative approach, we ran a variant of PyDex that addresses both syntax and semantic bugs in a single round.Table 5 shows the results of this ablated variant and full PyDex (without few-shots).We nd that splitting concerns into two phases results in an increase in the overall repair rate from 82.87% to 86.71%.Using two phases increases the average TED slightly (26.79 to 28.59).However, for the majority of the problems (10 out of 15), PyDex (with iterative) has a smaller or equal mean TED than PyDex (no iterative).In the remaining 5 problems, we found PyDex with iterative querying has a larger mean TED because it successfully generates repairs for challenging buggy submissions where PyDex (no iterative) is unable to repair.

Multimodal Prompts.
PyDex combines di erent types of input (code, natural language, test cases) into its prompts.This richness of inputs is a particular advantage of the educational setting.PyDex ensembles these various prompts by querying the LLMC and then relying on the (syntax or semantics) oracle to lter out candidates.This approach is based on the idea that di erent prompts may produce complementary candidates.Figure 8 shows that di erent prompt structures result in di erent overall performances in terms of x rate.If a single prompt structure needs to be chosen, Program + Diagnostics + Description + Tests structure is most e ective in this experiment.However, if we ensemble the candidates, these are complementary.

DISCUSSION
We now discuss two important points.First, we provide details on why simply combining a stateof-the-art syntax repair tool and a separate semantics repair tool is not as e ective as using PyDex.Second, we discuss important limitations.

Why Not Combine a State-of-the-Art Syntactic Fixer and Semantics Fixer to Repair
Programs?We investigated why BIFI+Refactory, which combines two state-of-the-art repair systems, produces repairs that (on average) have a larger token edit distance compared to PyDex.We found that in some cases, BIFI produces repairs by deleting a portion of the code snippet that contains the syntax errors.Although this is an e ective way to deal with syntax errors, it makes repairing semantic errors harder by deleting parts that may capture crucial logic.
Below is one such example from our evaluation.The code snippet contains a syntax mistake in the last line.The parser complains that the "Expression cannot contain an assignment =".In particular, the student has written an equal (highlighted below in red) when they should have used a plus operator (which corresponds to the repair produced by PyDex).
However, BIFI produced a di erent x by removing the second for loop (lines 8-9) completely.This deletion introduced challenges for Refactory in the later semantic repair phase.Although Refactory in the end successfully repaired this program, the repair it generated is syntactically equivalent to the reference solution and is e ectively completely rewritten with respect to the original incorrect program.
Overall, our comparison between PyDex and BIFI+Refactory highlights the challenges in combining state-of-the-art syntax and semantics tools to repair incorrect introductory programming assignments.BIFI and Refactory each focus on their targets, syntactic bug repair and semantic bug repair, respectively, and combining them may result in unexpected performance.Additionally, combining BIFI and Refactory required non-trivial engineering e orts (approximately 3 weeks of e ort from one Python expert).This further motivates the need for a uni ed approach that can handle both types of bugs for introductory Python programmers.

Limitations
PyDex validates candidate repairs by comparing execution results on the test suite with the reference program given by instructors.Validating program correctness through tests is not as strong as formal veri cation.To the best of our knowledge, the use of tests as a proxy for correctness is standard in the educational domain [Gulwani et al. 2018;Singh et al. 2013].
We carried out our evaluation on one particular set of 286 student programs.The size of the dataset is on par with literature on state-of-the-art automated program repair [Ahmed et al. 2022;Li et al. 2022b], but increasing the size of the evaluation dataset may provide additional insights and present an opportunity for future work.
PyDex relies on an LLM so it inherits its limitations.PyDex (like the LLM it uses) does not have a soundness or completeness guarantee.Also, we acknowledge randomness is another limitation in PyDex and we sampled and picked the top repair candidates to mitigate the e ect caused by randomness.These limitations might be addressed by requesting further information from the students, and it remains future work.
Language requirement.We scoped PyDex to introductory Python assignments as that is the only domain where we have a suitable dataset and carry out an evaluation.Other education tools [Bhatia et al. 2018;Wang et al. 2018a] share this same limitation of focusing on one programming language.However, the principles behind the design of PyDex apply to programs written in other imperative languages, as conceptually none of our prompt engineering methods are language-speci c.Applying PyDex to education assignments in these other languages would require reimplementation of the chunking procedure, test-case execution harness, and swapping the syntax oracle for the corresponding domain.For example, if we were to apply PyDex to Java programs, the chunker can rely on control-ow keywords (if, for, while), but indentation may no longer be meaningful; the execution harness could be replaced with JUnit, and the syntax oracle could be replaced with javac.Data leakage.Data leakage is also a threat to validity in PyDex.PyDex is built on top of Codex, and Codex is trained on public internet data.To have a fair comparison, we only use a non-public dataset as our evaluation target to mitigate the data leakage problem (all our results are on this dataset).Using a public dataset could otherwise in ate performance.This limitation is unfortunately shared by all existing work that uses LLMs.Using a non-public dataset is our best e ort, but we agree that data leakage cannot be completely avoided at this stage.For example, "determine if a string is a palindrome" is a question used in our evaluation, but we also found "determine if a string is a palindrome" is also one of the questions in the HumanEval dataset (the human-written dataset used to evaluate the original Codex model).
Moreover, for introductory programming assignments, PyDex provides unique value in that it can craft and customize the solution to the student's errors (i.e., smaller edit distance patches, as shown in our evaluation).Students can always search for reference solutions as repairs, but we observe that this is not a good option in practice because the di erences between buggy program and reference program can be large, as we show in Section 2. Why can students bene t pedagogically from a tool that automatically repairs their buggy programs?Automatically xing students' submissions is not the same as providing an explanation for their mistakes.However, human feedback, in the form of a student-tailored corrected solution, represents a substantial time investment [Keuning et al. 2016;Singh et al. 2013].Absent such time investment, students typically must rely on a reference solution.This is the starting motivation for employing automated repair in this context.In this way, PyDex provides a preferable alternative to comparing to a standard answer key.Furthermore, a repaired solution is often a starting point for more meaningful feedback.For example, Tung et al [Phung et al. 2023] produced a syntax repair to then generate a natural language explanation of the error and needed changes.Runtimes of the di erent tools.We use tools with substantially di erent environments.PyDex relies on an API (so network time plays a role), BIFI requires GPU-based computing for inference, and GenProg is done on a CPU.Therefore, we did not compare runtimes as they would be hard to interpret.More importantly, from our analysis, the lower repair rate of the baselines is due to the repair capability, not tool timeouts.

RELATED WORK
Automated Program Repair.The programming languages and software engineering community has a long history of developing tools for automatically repairing errors in buggy programs.Existing approaches have applied a variety of technical ideas, including program analysis [Mechtaev et al. 2018[Mechtaev et al. , 2016;;Shari deen et al. 2021;Zhang et al. 2021], search-based techniques [Wong et al. 2021] like genetic programming [Kim et al. 2013;Le Goues et al. 2012;Qi et al. 2014], machine learning [Ahmed et al. 2021;Bhatia et al. 2018;Long et al. 2017;Long and Rinard 2016;Santolucito et al. 2022;Wang et al. 2018a;Zhang et al. 2020] and more recently LLM [Xia and Zhang 2023a,b].A particularly popular approach to APR consists of generating many program candidates, typically derived by performing syntactic transformations of the original buggy program, and then validating these candidates using a test suite as an oracle [Le Goues et al. 2012].Similarly, PyDex uses a syntax oracle (the Python parser) and semantic oracle (test cases) to validate candidate programs produced.However, state-of-the-art APR tools are limited to repairing either syntax [Joshi et al. 2023] or semantic errors [Fan et al. 2023], but not both.PyDex signi cantly di ers from these existing tools by automatically repairing both syntax and semantic errors in buggy programs.
In addition, PyDex employs a large language model (Codex) as the main program transformation module and uses an ensemble of multi-modal prompts to improve its success rate.Therefore, PyDex is able to generate complex repairs, which are di cult to address by existing traditional APR techniques [Le Goues et al. 2012;Long and Rinard 2015;Mechtaev et al. 2018Mechtaev et al. , 2016;;Qi et al. 2014;Xuan et al. 2017], which often focus on speci c error types, are limited to a small number of edits, and repair speci c statements (such as conditionals) exclusively.Moreover, PyDex targets students' incorrect submissions, rather than professional developers' production bugs or LLM-generated bugs [Chen et al. 2024;Fan et al. 2023].As a result, PyDex has two additional requirements: 1) minimizing the size of the change made to allow students to better learn from the repaired program, and 2) reducing the engineering e orts to run the APR tool.
AI for Programming Education.AI has been extensively applied to the domain of education [Finnie-Ansley et al. 2022;Li et al. 2022a].Past programming education research has explored topics including feedback generation [Gulwani et al. 2018;Hu et al. 2021Hu et al. , 2019;;Phung et al. 2023;Rolim et al. 2017;Singh et al. 2013;Song et al. 2021;Wang et al. 2018b;Zhang et al. 2023] and program repair [Dinella et al. 2020;Lu et al. 2021;Wang et al. 2018a;Xin and Reiss 2017;Yasunaga and Liang 2021;Yi et al. 2017].PyDex is complementary to this work, showing that the task of program repair in this domain can be successfully tackled using an LLMC.
LLMs for Code Intelligence.Large pre-trained language models, such as OpenAI's Codex, Salesforce CodeGen [Nijkamp et al. 2023], and BigScience's BLOOM [Laurençon et al. [n. d.]], have been shown to be e ective for a range of code intelligence tasks.For example, Microsoft's Copilot[cop 2024] builds on Codex to produce more e ective single-line and multi-line code completion suggestions.Prior work has shown that such LLMs can also be used for repairing programs outside of the educational context [Dinella et al. 2022;Lian et al. 2023;Mao et al. 2023;Rahmani et al. 2021;Su et al. 2023;Verbruggen et al. 2021;Xiang et al. 2023;Zhang et al. 2022].Using these models to perform code generation from informal speci cations, such as natural language, has also been a topic of active research [Li et al. 2022a].Similarly to this work, PyDex uses an LLM but is designed to focus on student programming, and as such our design decisions (e.g., reducing token edit distance) may not apply to other domains such as professional developers.

CONCLUSION
We introduced an approach to repair syntactic and semantic mistakes in introductory Python assignments.At the core of our approach sits a large language model trained on code.We leverage multi-modal prompts, iterative querying, test-case-based few-shot selection, and program chunking to produce repairs.We implement our approach using Codex in a system called PyDex and evaluate it on real student programs.Our results show that our uni ed system PyDex can e ectively repair real student programs, while producing smaller patches.

Fig. 4 .
Fig. 4.An illustrated example of program chucking.Lines 3 and 4 have an indentation level of four, line 6 has an indentation level of two, and the rest of the lines have an indentation level of zero.Line 3 has the initial syntax error flagged by the interpreter.PyDex uses such indentation (along with control flow keywords) to heuristically extract program chunks for syntax repair.
Fig.5.The syntax prompt generator produces prompts that can include the buggy program or the error message.We elide portions of the code fragments for brevity.

Fig. 7 .
Fig. 7.An illustrative example of few-shot learning in PyDex.The incorrect program in the shot and the target buggy program have the same test suite execution [pass, fail].

Fig. 8 .
Fig.8.PyDex ensembles multiple prompts, by querying and then relying on the (syntax and semantic) oracles to rule out invalid candidates.Ensembling complementary prompts outperforms any particular prompt.

Syntax error message Task description and test cases Peer programs Repaired Program Buggy Program Syntax Phase Semantic Phase PyDex Test cases
LLMCFig.3.PyDex architecture.A buggy program first enters a syntax repair phase.In this phase, PyDex transforms the program using a program chunker, which performs a structure-based subse ing of code lines to narrow the focus for the LLMC.Multiple syntax-oriented prompts are generated using this subprogram, fed to an LLMC, and any patches are integrated into the original program.If any candidate satisfies the syntax oracle, it can move on to the semantic phase.In the semantic phase, PyDex leverages both the natural language description of the assignment and the instructor-provided test cases to create various prompts.In addition, if available, PyDex can use other peers' solutions as few-shots by selecting them using test-case-based selection to identify failures that resemble the current student's program, along with eventually correct solutions.Prompts are fed to the LLMC to generate candidates.If multiple candidates satisfy the test suite, PyDex returns the one with the smallest edit distance with respect to the original student program.
Algorithm 1 Chunker: extracting the code chunk that contains the error message

Table 1 .
Statistics of 286 syntax errors reported in the datasets.

Table 3 .
The first stage in the repair process is to fix syntax errors.PyDex can produce a syntactically valid candidate for all programs in our benchmark, compared to 80.07% for BIFI.On average, PyDex's repairs are also closer to the original program (edit distance of 5.46 versus 25.07).

Table 4 .
Chunking reduces the average token edit distance across all assignments.PG is short for performance gain.In the syntax stage, PyDex rst extracts program chunks from the original buggy program as detailed in Section 4.

Table 5 .
PyDex performs iterative querying, spli ing the repair procedure into a syntactic and a semantic phase.We find that this iterative approach raises the overall repair rate (RR) from 82.87% to 86.71% (without few-shots).