Decoding Logic Errors: A Comparative Study on Bug Detection by Students and Large Language Models

Identifying and resolving logic errors can be one of the most frustrating challenges for novices programmers. Unlike syntax errors, for which a compiler or interpreter can issue a message, logic errors can be subtle. In certain conditions, buggy code may even exhibit correct behavior – in other cases, the issue might be about how a problem statement has been interpreted. Such errors can be hard to spot when reading the code, and they can also at times be missed by automated tests. There is great educational potential in automatically detecting logic errors, especially when paired with suitable feedback for novices. Large language models (LLMs) have recently demonstrated surprising performance for a range of computing tasks, including generating and explaining code. These capabilities are closely linked to code syntax, which aligns with the next token prediction behavior of LLMs. On the other hand, logic errors relate to the runtime performance of code and thus may not be as well suited to analysis by LLMs. To explore this, we investigate the performance of two popular LLMs, GPT-3 and GPT-4, for detecting and providing a novice-friendly explanation of logic errors. We compare LLM performance with a large cohort of introductory computing students (n = 964) solving the same error detection task. Through a mixed-methods analysis of student and model responses, we observe significant improvement in logic error identification between the previous and current generation of LLMs, and find that both LLM generations significantly outperform students. We outline how such models could be integrated into computing education tools, and discuss their potential for supporting students when learning programming.


INTRODUCTION
Learning to program involves navigating a landscape where mistakes are an inherent part of the journey.Novice programmers are bound to encounter numerous errors when writing code, ranging from logic flaws and syntactical inaccuracies to runtime glitches.These mistakes pose substantial hurdles to students as they strive to develop their programming skills.Despite extensive efforts by computing education researchers and practitioners to establish taxonomies and recognize patterns of common programming errors [1,6,36,37,60] , the process of effectively detecting and resolving bugs remains a persistent challenge.
Simultaneously, the emergence of large language models (LLMs) has demonstrated remarkable capabilities in understanding and generating text that is highly similar to the text generated by people.These models, trained on vast amounts of textual data, have been used in a variety of computing education contexts including helping students to understand code [29,34] and programming error messages [30].These use cases demonstrate the ability of LLMs to understand the syntax and structure of code.Still, it is unclear whether models can reason about runtime performance without explicitly running the code.Therefore, detecting runtime errors may present a challenge for LLMs, limiting their potential to help learners.
In this paper, we conduct a large-scale comparative study that investigates the abilities of two LLMs and students to detect bugs in faulty code.We recruited 964 students in a large introductory C programming class to identify bugs in three code examples.The selected code examples contained three types of bugs including an out-of-bounds error, an expression error, and an operator error.Students were selected because they are increasingly relying on LLMs as a legitimate help-seeking resource [19,24,61].Our results suggest that LLMs outperform students in bug detection performance, especially for faulty code.However, in addition to detecting the pre-inserted bugs, the LLMs had a tendency to be overly proactive, also commenting on extremely minor 'bugs' such as naming conventions, and other considerations that might be overwhelming if used for learning purposes.GPT-4 was nearly perfect at identifying bugs in faulty code, but was much more likely than GPT-3 to identify these minor 'bugs' in the correct programs and therefore performed 'poorly' on correct code.Studying correct code was important because students may use these tools when their code is mostly correct, and a list of minor errors may be demoralizing or may lead them off-track.Based on our findings, we conclude that LLMs appear to be capable of identifying logic errors, outperforming students at this task.However, additional work is needed to extend this work toward more complex code examples and with more advanced computing students.Given that experts are more likely to 'chunk' code and see emergent structure, it is unclear whether they would be more or less able to identify bugs in the code without writing test cases.

RELATED WORK 2.1 Students and Bugs in Code
Bugs and errors are a common feature in student code and understanding the encountered problems and errors has been a longstanding endeavour within Computing Education Research.Early research in this area centered often on specific problems such as the looping problem or the rainfall problem [23,47,50,51], leading also toward investigations into the design and features of programming languages (e.g.[50]).In general there are differences in frequency of programming errors [52] and the time that it takes to fix those errors [7,11,37,49].The types of errors that students encounter also gradually change [2], and they can stem from multiple sources [2,13].These sources include misinterpreting the programming problem and having flaws in programming knowledge [13], not to mention the role of the used programming language [25].
When students encounter a problem, they need to resolve it.Resolving programming problems -or debugging -can be done using multiple approaches, including tracing code, commenting out code, and adding print statements [16,39,58].Simply looking at the code and trying to find places that do not look right -i.e.pattern matching -can also be a viable strategy in some cases [16].Like programming, finding problems in code by tracing the code is a skill, and both of them have been highlighted as something that students can struggle with.As an example, an ITiCSE working group from 2001 highlighted a lack of programming skills at the end of introductory programming course [38], and a subsequent ITiCSE working group from 2004 focused on the results by looking into students' ability to read and trace code [31], also highlighting problems.These issues have in part led to national and international efforts in understanding the struggles that students face, such as the BRACElet project that started in 2004 [57].
These studies tend to highlight that students have difficulties with tracing code [31,55], which might in part be explainable by lack of expertise.A student might, when solving a tracing problem, even just guess a solution if they do not have a higher-level reasoning strategy [31], or might simply have misconceptions about how a program executes, which in turn leads to faulty conclusions [55].This possibility of guessing code tracing outcomes has also in part led to the emergence of "explain in plain English" problems.For these problems, students are expected to provide a highlevel overview of the program's functionality and purpose rather than simply outlining what the program does [32,59].These problems can also be challenging, and any tools that would help students learn to understand and explain code would be of benefit.

Generative AI and Computing Education
Recently computing education researchers are expressing concern and excitement about the ways that generative models may affect the computing education landscape [28,33,35,40,41,61].While a strong consensus about how we should adapt our pedagogical practice has yet to emerge, each of these discussions acknowledge that generative models are not likely a passing fad.
Numerous examples of the capabilities of generative models are emerging such as their ability to both solve and create programming assignments [15,43], explain code [29,34], identify programming concepts [54], answer multiple choice questions [45,46], write code [42,56], solve visual problems [20], and enhance programming error messages [30].These use cases are critical because without understanding the capabilities of generative models, it is extremely challenging to adapt to this rapidly changing landscape.
However, limited work has investigated the capabilities of generative models to identify bugs within code.Given that novice programmers often encounter bugs and may lack the ability to identify and fix these bugs, it is important to explore the capabilities of generative models to accomplish this task.Very recent papers focus on enhancing programming error messages [30] and automatically repairing bugs in code [14,22,27].In this paper, we add to the growing set of use cases by exploring the potential for generative models to identify potential bugs and errors.

METHOD 3.1 Research Questions
Previous research has demonstrated many impressive capabilities of large language models.However, many of these examples, such as generating explanations and identifying programming concepts, are closely linked to code syntax, which aligns with the next token prediction behavior of LLMs.To better explore the potential limits of LLMs, this study focuses on identifying logic errors in code, which relate to the runtime performance of code, and thus may not be as well suited to analysis by LLMs as they are unable to execute code.If large language models perform well in this task, there is an exciting opportunity to use these models to help students to debug their code.Based on these goals, we investigated the following research questions: RQ 1: How do students and large language models compare in their ability to correctly identify logic errors in faulty code?RQ 2: Which types of logic errors are easiest for students and large language models to correctly identify?RQ 3: How many bugs or issues do students and large language models identify when reviewing faulty and correct code?

Study Design
In this study, we seek to investigate the performance of large language models in detecting bugs in faulty code.We conducted a study that compared the performance of students with the two large language models GPT-3 and GPT-4.Performance was measured across three code examples with four variants.These variants included the correct code and three variants with bugs introduced: 1) an operator error, 2) an out-of-bounds error, and 3) an expression error.The study was designed with two between-subjects components which include the source of the detection method, i.e., whether it was performed by the students, GPT-3, or GPT-4, and the bug variant.
The study also included a within-subjects component which was the three code examples.By showing students multiple examples, we could partially control for participant error.

Participants, Data Collection, and Ethics.
The data used in this study were collected from a first-year C programming course at The University of Anonymous.The data were collected during a single lab session that ran over a one-week period.Leading up to this lab, the course covered the concepts of arithmetic, types, functions, loops, and arrays.We collected 964 total complete responses from students.The data collection followed the ethical guidelines of the university and was approved by the ethics review board 1 .

Study Tasks.
As part of the lab, students were shown three code examples.Figure 1 shows the three examples that were shown to students during the lab.Each example contains a function with a single loop that processes elements of an array.The task for the students was to identify any bugs that might exist within the 1 IRB approval number anonymized for review.
code.The instructions said "Consider the following definition of a function called <Function Name>:" which was followed by the code without comments.They were then asked to come up with a short description of what they believe the intended purpose of the function to be.This was followed by having them "List all errors, if any, found in this code based on your explanation of the purpose of the function.It is possible that the code contains one or more small errors (however, this is not necessarily true and the code may be correct).If you can identify any errors in the implementation of the code, you should describe these errors."

Measures.
The data collection resulted in 2980 total responses from students.In addition, 30 LLM responses were generated for each code example and version pair by varying the temperature and prompt to account for variations that might affect performance.This resulted in 720 total additional responses from the two models.
A team of four researchers manually coded each student and model response.The coders evaluated the correctness of the identified bug as a dichotomous variable (e.g.: correct or incorrect).The coders also evaluated the number of bugs that the response contained.The coding was mutually exclusive: a response correctly identifying a bug but also noting other incorrect bugs was coded as correct.When coding the example that did not contain bugs, we coded a blank response or an explicit statement that no bugs were contained as a correct response and other responses were considered incorrect.This coding scheme did not allow for explicitly tracking false positives and false negatives, but it was necessary to obtain substantial inter-rater reliability ( = 0.873, 30 ratings).Students often did not explicitly state the bug so we coded their response as 'correct' even if they only provided a solution that would fix the expected bug.

Analysis for Conditional Differences.
We analyzed the dependent measures (e.g.: number of bugs) using a linear mixed-effects model.The main fixed factors of interest were the "Source" (representing GPT-3, GPT-4, or Students) and the "Version" of the code example (representing different versions of the example).Additionally, an interaction term between "Source" and "Version" was included to examine potential differences in bug identification across sources and versions.To account for potential dependencies among observations from the same example, a random intercept term was included in the model specification.This random effect was nested within the "Code Example" factor, capturing the variability associated with different examples.Pairwise comparisons were made using the Tukey method with Holm's correction for multiple comparisons.

Models
3.3.1 Model Specification.To automatically identify the bugs in the study, we used two large language models [8] developed by OpenAI.The first model, text-davinci-003, has been widely used up until the time of running the study.Later, when GPT-4 was released, we included results using the gpt-4-0314 model to understand how the state-of-the-art models perform at the same task.

Prompt
Engineering.Prompt engineering is a process of developing instructions to guide the responses of an LLM.The specificity and phrasing of these prompts have the potential to strongly influence the content and quality of the responses [3,48,62].Understanding the potential effects that prompts can have on performance, we used multiple prompting strategies to account for this aspect.In addition, the hyperparameters of an LLM, such as the temperature, can also affect the output.Lower temperatures tend to result in more deterministic responses while higher temperatures tend to provide more 'creative' responses.We chose to use the default temperature of 0.7 and a lower temperature of 0.3.The three prompts used for this study are listed below.
• # List all errors and bugs, if any, found in the following C code: <code> • # List any issues, including bugs, errors, or potential problems that exist in the following C code: <code> • # Assume the role of a highly intelligent computer scientist who is capable of easily finding bugs and errors by reading source code.List all errors and bugs, if any, found in the following C code: <code> Between the variations in prompt and temperature, there were 6 possible permutation.For each permutation, we issued 5 requests to the OpenAI API.The reason for issuing 5 requests was to account for the non-deterministic nature of LLM prompts.This resulted in 30 responses for each combination of code example and bug type and 360 total requests to OpenAI.

Bug Detection Performance
Performance in bug detection rates varied between the students and the models, as shown in Table 1.GPT-3 exhibited an overall correctness rate of 85.3%, while GPT-4 closely followed with a correctness rate of 85.0%.Notably, students had a much lower bug detection rate at 49.1%.While both models detected bugs at nearly twice the rate of students, performance was even higher when only considering model performance on faulty code.
4.1.1For faulty code, LLMs outperform students.When presented with incorrect code, GPT-3 exhibited a bug detection rate of 87.3%, demonstrating a substantial ability to identify coding errors.GPT-4 surpassed this performance with an impressive bug detection rate of 99.2%, indicating a higher sensitivity to identifying bugs within faulty code.On the other hand, students detected bugs at rate of 34.5%, showcasing a limited proficiency in detecting coding errors.
4.1.2LLMs tended to identify bugs in correct code.In the case of identifying correctly functioning code, GPT-3 achieved a bug detection rate of 79.4% (i.e., classified the code as bug-free).GPT-4, however, displayed a comparatively lower rate of 42.2% in correctly identifying bug-free code.In contrast, students demonstrated a notably high proficiency in identifying correct code, with a bug detection rate of 92.8%.

Analyzing the Bug Reports
4.3.1 GPT-4 was more verbose, even when normalized by the number of bugs detected.We computed the average word count for responses made by students and each model.GPT-4 responses had on average 129.0 ( = 44.7)words followed by GPT-3 and students with 54.2 ( = 19.5)and 38.9 ( = 27.0)words respectively.This constitutes a 3.31-fold increase in the number of words GPT-4 produced compared to students.Given the differences in number of bugs identified by source, we normalized word count by the number of bugs reported.This resulted in 52.7 ( = 25.7)words for GPT-4 and 23.5 ( = 9.11) and 35.6 ( = 24.5)words for GPT-3 and students.These results should be contextualized by the observation that GPT-4 had a tendency to provide partial and in a few cases complete solutions for the bugs that it identified.Moreover, the models exhibited a more concentrated distribution around their means, while student responses exhibited notably higher variability.

Qualitative analysis of responses.
To better understand the capabilities of large language models in detecting bugs in code and to draw distinctions between students' responses and these models, we did a qualitative exploration of the error messages.By analyzing selected bug reports generated by the models in our study, we could shed light on a variety of shortcomings of the models' approaches.As suggested by our analysis of word count, there appear to be differences in the ways that LLMs and students identify and describe bugs.Students tended to describe the bug, offer a solution, or both.We did not observe an instance where a model offered a solution without also describing the underlying bug.Models often combined a description of the bug along with a partial solution, in some cases they provided a complete solution.GPT-4 appeared to be most likely to offer a complete solution.As noted in the analysis of bug detection performance, LLMs often identified bugs even for 'correct' code.Below is an example where GPT-4 identified 6 bugs in otherwise correct code for Example 2: (1) No function prototype: There is no function prototype provided for the CountZeros function.It is a good practice to provide a function prototype before the function definition to avoid any potential issues with implicit declarations.(2) No include statements: The code does not include any necessary header files, such as stdio.hor stdlib.h,which might be required for certain operations or functions used in the program.It should be noted that most of these identified 'bugs' could be interpreted as areas for improvement.For example, the model suggests improvements to naming conventions or including comments.However, both the fourth and sixth bugs could actually result in runtime issues.The model identifies a lack of input sanitization checks for two potential corner cases.The qualitative investigation underscores the intricacies of LLMs' code interpretations.They exhibit detailed understanding but can occasionally miss the mark on context or offer feedback that, while technically correct, might feel misaligned with the coder's intent.Balancing LLM insights with human discernment could yield the most effective outcomes.
Many student responses just contained the proposed change without any explanation or reasoning.They often did not explicitly define a bug but instead only described the solution.Some students also indicated errors that either did not fix the issue, introduced new problems, or focused too heavily on syntactical correctness without addressing the core problem.In the example below, a student highlighted changes that should be made to the code which do not fix the bug: • i = 0; should change to i = 1; to avoid using the 0th value.• Instead of count++, use count = count + 1;.• There should be no space between for and the opening parenthesis (. • Similarly, there should be no space between if and the opening parenthesis (.

DISCUSSION
Our results suggest that large language models are more capable than students at identifying bugs in code.There are multiple possible explanations for this.First, more expert programmers often do not necessarily need to read the code character by character or word by word when forming an understanding of the code, rather, they study features of the code that are relevant to the task at hand [18].Consequently, a student may miss syntax errors or minor bugs, if they are not in focus.This can also be explained by the happy path mentality where because most of the code is correct, students may become complacent and fail to detect bugs; some bugs also take more time to identify and fix than others [11].Participants were explicitly prompted to find errors, which puts them into an explicit debugging mindset.In practice, they might not critically examine their code with the same scrutiny, so the bug detection rate for students may actually be even lower in practice.Both LLMs performed extremely well, with GPT-4 performing near perfect when presented with buggy code.However, both models performed poorly in our analysis of correct code as they identified very minor bugs and stylistic aspects such as naming conventions contrary to our expectation that they would classify the code as bug-free.While the suggestions were largely correct, it might not be helpful to point out minor bugs and code conventions in otherwise correct code, especially considering students' preferences for concise bug reports [12].
One noticeable difference between GPT-3 and GPT-4 was that GPT-4 would point out these minor bugs more than GPT-3.One possible explanation for this is that the newer model has possibly had more instruction fine-tuning, where the model is trained to follow instructions from the user.This might cause the model to try please the user by going above and beyond the ask, e.g. in our case not only pointing out the obvious bug, but also commenting on more minor issues.We also found that GPT-4 was more verbose, even when controlling for the number of bugs in the code.This aligns with prior findings where newer models often add superfluous textual content to responses [10] and may come up with non-existing bugs to fix when asked to help with buggy code [19].
The ability of LLMs to correctly identify bugs at a much higher rate than students has exciting implications for computing education.LLMs could be used to help novices (and more experienced programmers too) in detecting bugs in code, for example, by having LLMs integrated directly into the IDE that students use to work on their course exercises.Models could make suggestions for improvement as they did in cases with correct code or identify subtle logic errors in the code, potentially building on prior research on improving programming error messages, which has the promise of improving learning [5,12].Despite the allure of the technological possibilities, there likely should be a mechanism that would control how often the suggestions would be shown, as not all errors require help [19].Similarly, it is important to carefully curate educational content, especially with growing concerns about over-reliance on LLMs [9,28,33,56,61].To mitigate potential issues, it is likely preferable to avoid directly presenting errors and solutions to students.Instead, pedagogical systems could detect when students are spinning their wheels trying to debug their code [4] and then use the LLM to scaffold students toward identifying the error themselves.Thus providing learning opportunities that also mitigate stress associated with debugging.
Similarly, as LLMs are adept at detecting bugs and writing suggestions on how to fix them, they could be further integrated into teacher tools.As an example, tools such as OverCode [17] and CodeClusters [26] that are designed to provide feedback to masses of students could be integrated with LLMs so that LLMs would create draft feedback, which instructors then could -when needed -adjust and send out.The ability of LLMs to identify rare corner cases also has interesting implications for teaching testing, as feedback from LLMs could help with writing more comprehensive test suites.The good performance of the models could also lead to new, innovative exercise types.For example, we envision that an LLM could create buggy code where students would need to find and fix the bug -similarly, one activity could be trying to create bugs that LLMs fail to identify.Such activities could also provide additional data on learning, which then could be used to fine-tune LLMs.
As the educational landscape continues to adapt to LLMs [28,40,41,61], the new bug capabilities of LLMs identified in this paper may further inform how students seek help in classroom settings [21].

Limitations
To make the task more ecologically valid, we provided students with an open-response question rather than a multiple-choice question.This had the advantage that students could not guess the right answer and was more similar to how students would encounter code in the wild; however, it became difficult to differentiate between a response that explicitly stated 'no bugs' and a blank response.To address this limitation, we evaluated the rates of default responses by variant and observed no statistically significant difference in the number of default responses across all four variants.
Participants were asked to identify any bugs that were present within the code, so in this case, a lack of an explicit response was treated as a default response (e.g.: 'no bugs').To assess the impact on our results, we recalculated percentages by excluding blanks.The revised student correctness rates are as follows: 79.6% (133 blanks removed), 89.8% (169 blanks removed), and 60.0% (181 blanks removed).These results represent a conservative estimate, considering only explicitly stated correct answers.The resulting rates remained higher than GPT-4, but closer to GPT-3 correctness rates.
Participants were also explicitly instructed to identify bugs as part of the lab activity.While prior research has demonstrated that debugging others' code can be challenging [58], it is possible that if students were studying their own code, it might have been easier for them.Relatedly, the code did not have comments that would explain what each line of code does.This may align with code students often encounter naturally, but could have affected the students' performance or required the model to infer too much from the code structure and function name.
The code examples used in this study only contained a single intended error.It is possible that the presence of multiple bugs in code might affect the performance of LLMs (and students) in detecting bugs.The goal for this paper was an initial tightly scoped investigation of identifying a bug within code.Future work will investigate cases where multiple bugs are included.
In our study, we employed a robust approach by utilizing three distinct prompts, leveraging multiple models, including both GPT-3 and GPT-4, and exploring various temperatures (i.e., 0.4 and 0.7).Additionally, each prompt was issued multiple times to accommodate the inherent probabilistic nature of generative AI.While we acknowledge the potential impact of further prompt optimization on mitigating false positives in the correct code condition, it's essential to note the dynamic nature of these models, characterized by continuous changes in verbosity and performance [44,45,53].Rather than providing a definitive characterization of performance, our primary objective was to delve into a novel capability of LLMs.

Figure 1 :
Figure 1: The three examples of code with the correct variant, and the three incorrect variants annotated.The incorrect variants were 1) operator error, 2) expression error, and 3) out-of-bounds error.

( 3 )
Inconsistent naming convention: The function name COUNTZEROS is mentioned in the question, but the actual function name in the code is CountZeros.It is important to maintain a consistent naming convention throughout the code.(4) No error handling: The function does not perform any error handling, such as checking for NULL pointers or negative array length.This could lead to undefined behavior if the function is called with invalid arguments.(5) No comments: The code lacks comments explaining the purpose of the function and its parameters.Adding comments can make the code more readable and easier to understand for others.(6) Potential integer overflow: If the length parameter is very large

Table 1 :
A summary of student and model performance in correctly identifying bugs.In instances where coders were unsure, they coded it as 'uncertain.'These are excluded from the table.For instance, if we consider the number of correct and incorrect responses for GPT-3 in Code Example 1 with Bug 2, their total does not sum to 30.Bug 1 Bug 2 Bug 3 Correct Bug 1 Bug 2 Bug 3 Correct Bug 1 Bug 2 Bug 3 Correct

Table 2 :
Number of bugs detected by condition