Validating AI-Generated Code with Live Programming

AI-powered programming assistants are increasingly gaining popularity, with GitHub Copilot alone used by over a million developers worldwide. These tools are far from perfect, however, producing code suggestions that may be incorrect in subtle ways. As a result, developers face a new challenge: validating AI’s suggestions. This paper explores whether Live Programming (LP), a continuous display of a program’s runtime values, can help address this challenge. To answer this question, we built a Python editor that combines an AI-powered programming assistant with an existing LP environment. Using this environment in a between-subjects study (N = 17), we found that by lowering the cost of validation by execution, LP can mitigate over- and under-reliance on AI-generated programs and reduce the cognitive load of validation for certain types of tasks.


INTRODUCTION
Recent advances in large language models have given rise to AIpowered code suggestion tools like GitHub Copilot [12], Amazon CodeWhisperer [1], and ChatGPT [23].These AI programming assistants are changing the face of software development, automating many of the traditional programming tasks, but at the same time introducing new tasks into the developer's workflow-such as prompting the assistant and reviewing its suggestions [2,22].Development environments have some catching up to do in order to provide adequate tool support for these new tasks.
In this paper, we focus on the task of validating AI-generated code, i.e., deciding whether it matches the programmer's intent.Recent studies show that validation is a bottleneck for AI-assisted programming: according to Mozannar et al. [22], it is the single most prevalent activity when using AI code assistants, and other studies [3,21,32,36] report programmers having trouble evaluating the correctness of AI-generated code.Faced with difficulties in validation, programmers tend to either under-rely on the assistanti.e., lose trust in it-or to over-rely-i.e., blindly accept its suggestions [27,30,34,37]; the former can cause them to abandon the assistant altogether [2], while the latter can introduce bugs and security vulnerabilities [26].These findings motivate the need for better validation support in AI-assisted programming environments.
This paper investigates the use of Live Programming (LP) [13,31,35] as a way to support the validation of AI-generated code.LP environments, such as Projection Boxes [20], visualize runtime values of a program in real-time without any extra effort on the part of the programmer.We hypothesize that these environments are a good fit for validation, since LP has been shown to encourage more frequent testing [4] and facilitate bug finding [41] and program comprehension [5,7,8].On the other hand, validation of AI-generated code is a new and unexplored domain in program comprehension that comes with its unique challenges, such as multiple AI suggestions for the programmer to choose from, and frequent context switches between prompting, validation, and code authoring [22], which cause additional cognitive load [36].Hence, the application of LP to the validation setting warrants a separate investigation.
To this end, we constructed a Python environment that combines an existing LP environment [20] with an AI assistant similar to Copilot's multi-suggestion pane.Using this environment, we conducted a between-subjects experiment ( = 17) to evaluate how the availability of LP affects users' effectiveness and cognitive load in validating AI suggestions.Our study shows that Live Programming facilitates validation through lowering the cost of inspecting runtime values; as a result, participants were more successful in evaluating the correctness of AI suggestions and experienced lower cognitive load in certain types of tasks.

RELATED WORK
Validation of AI-Generated Code.A rapidly growing body of work analyzes how users interact with AI programming assistants.Studies show that programmers spend a significant proportion of their time validating AI suggestions [2,3,22].Moreover, a largescale survey [21] indicates that 23% of their respondents have trouble evaluating correctness of generated code, which echoes the findings of lab studies [2,32] and a need-finding study [36], where participants report difficulties understanding AI suggestions and express a desire for better validation support.Barke et al. [2] and Liang et al. [21] find that programmers use an array of validation strategies, and the prevalence of each strategy is closely related to its time cost.
Specifically, despite the help of execution techniques built into the IDE for validating AI suggestions [30], execution is used less often than quick manual inspection or type checking because it is Users can inspect the runtime behavior of the suggestion in Projection Boxes [20], which are updated continuously as the user edits the code.more time-consuming [2,21] and interrupts programmers' workflows [36].The lack of validation support designed for AI-assisted programming, as Wang et al. [36] identify, leads to a higher cognitive load in reviewing suggestions.The high cost of validating AI suggestions, according to some studies [27,34,37], can lead to both under-reliance-lack of trust-and over-reliance-uncritically accepting wrong code-on the part of the programmer.
Comparatively fewer existing papers explore interface designs to support validation of AI-generated code: Ross et al. [27] investigates a conversational assistant that allows programmers to ask questions about the code, while Vasconcelos et al. [33] targets over-reliance by highlighting parts of generated code that might need human intervention; our work is complementary to these efforts in that it focuses on facilitating validation by execution.
Validation in Program Synthesis.Another line of related work concerns the validation of code generated by search-based (non-AI-powered) program synthesizers.Several synthesizers help users validate generated code by proactively displaying its outputs [9,16,40] and intermediate trace values [25], although none of them use a LP environment.The only system we are aware of that combines LP and program synthesis is SnipPy [11], but it uses LP to help the user specify their intent rather than validate synthesized code.
Live Programming.Live Programming (LP) provides immediate feedback on code edits, often in the form of visualizations of the runtime state [13,31,35].Some quantitative studies find that programmers with LP find more bugs [41], fix bugs faster [18], and test a program more often [4].Others find no effect in knowledge gain [15] or efficiency in code understanding [5].Still, qualitative evidence points to the helpfulness of LP for program comprehension [5,7,8] and debugging [15,17].In contrast to these studies, which evaluate the effectiveness of LP for comprehending and debugging human-written code, our work investigates its effectiveness for validating AI-generated code, a setting that comes with a number of previously unexplored challenges [22,36].

LEAP: THE TOOL USED IN THE STUDY
To study how Live Programming affects the validation of AIgenerated code, we implemented Leap (Live Exploration of AI-Generated Programs), a Python environment that combines an AI assistant with LP.This section demonstrates Leap via a usage example and discusses its implementation.Example Usage.Naomi, a biologist, is analyzing some genome sequencing data using Python.As part of her analysis, she needs to find the most common bigram (i.e., two-letter sequence) in a DNA strand. 1To this end, she creates a function dominant_bigram (line 3 in Fig. 1); she has a general idea of what this function might look like, but she decides to use Leap to help translate her idea into code.
Naomi adds a docstring (line 5), which conveys her intent in natural language, and a test case (line 24), which will help her validate the code.With the cursor positioned at line 7, she presses and to ask for suggestions.
Within seconds, a panel opens on the right containing five AIgenerated code suggestions; Naomi quickly skims through all of them.The overall shape of Suggestion 3 looks most similar to what she has in mind: it first collects the counts of all bigrams into a dictionary, and then iterates through the dictionary to pick a bigram with the maximum count.Naomi tries this suggestion, pressing its Preview button; Leap inserts the code into the editor and highlights it (lines 8-18).As soon as the suggestion is inserted, Projection Boxes [20] appear, showing runtime information at each line in the code.

Inspecting intermediate values helps Naomi understand what
the code is doing step by step.When she gets to line 18, she realizes that the dictionary actually has two dominant bigrams with the same count, and the code returns the last one.She realizes this is not what she wants: instead, she wants to select the dominant bigram that comes first alphabetically (ag in this case).
One option Naomi has is to try other suggestions.She clicks on the Preview button for Suggestion 2; Leap then inserts Suggestion 2 into the editor, in place of the prior suggestion, and the Projection Boxes update instantly to show its behavior.Naomi immediately notices that Suggestion 2 throws an exception inside the second loop, so she abandons it and goes back to Suggestion 3, which got her closer to her goal.
To fix Suggestion 3, Naomi realizes that she can accumulate all dominant bigrams in a list, sort the list, and return the first element.She does not remember the exact Python syntax for sorting a list, so she tries different variations-including l = l.sort,l = l.sort(),l = sort(l), l = l.sorted(), and so on.Fortunately, Leap's support for LP allows her to get instant feedback on the behavior of each edit, so she iterates quickly to find one correct option: l = sorted(l).Note that Naomi's workflow for using Suggestion 3-validation, finding bugs, and fixing bugs-relies on full LP support, and would not work in traditional environments like computational notebooks, which provide easy access to the final output of a snippet but not the intermediate values or immediate feedback on edits.Implementation.To generate code suggestions, Leap uses the text-davinci-003 model [24], the largest publicly available codegenerating model at the time of our study.To support live display of runtime values (Fig. 1 ), we built Leap on top of Projection Boxes, a state-of-the-art LP environment for Python [20] capable of running in the browser.The code for Leap can be found at https://bit.ly/leap-code.As the control condition for our study, we also created a version of Leap, where Projection Boxes are disabled, and instead the user can run the code explicitly by clicking a Run button and see the output in a terminal-like Output Panel.

USER STUDY
We conducted a between-subjects study to answer the following research questions: RQ1) How does Live Programming affect over-and under-reliance in validating AI-generated code?RQ2) How does Live Programming affect validation strategies?RQ3) How does Live Programming affect the cognitive load of validating AI-generated code?
Tasks.Our study incorporates two categories of programming tasks, Fixed-Prompt and Open-Prompt tasks.
In Fixed-Prompt tasks, we provide participants with a fixed set of five AI suggestions that are intended to solve the entire problem.We curated the suggestions by querying Copilot [12] and Leap with slight variations of the prompt.Fixed-Prompt tasks isolate the effects of Live Programming on validation behavior by controlling for the quality of suggestions.We created two Fixed-Prompt tasks, each with five suggestions: (T1) Bigram: Find the most frequent bigram in a given string, resolving ties alphabetically (same task in Sec.3); (T2) Pandas: Given a pandas data frame with data on dogs of three size categories (small, medium, and large), compute various statistics, imputing missing values with the mean of the appropriate category.These tasks represent two distinct styles: Bigram is a purely algorithmic task, while Pandas focuses on using a complex API.Pandas has two correct AI suggestions (out of five) while Bigram has none, a realistic scenario that programmers encounter with imperfect models.
In Open-Prompt tasks, participants are free to invoke the AI assistant however they want.This task design is less controlled than Fixed-Prompt, but more realistic, thus increasing ecological validity.We used two Open-Prompt tasks: (T3) String Rewriting: parse a set of string transformation rules and apply them five times to a string; (T4) Box Plot: given a pandas data frame containing 10 experiment data records, create a matplotlib box plot of time values for each group, combined with a color-coded scatter plot.Both tasks are more complex than the Fixed-Prompt tasks, and could not be solved with a single interaction with the AI assistant.Participants and Groups.We recruited 17 participants; 5 selfidentified as women, 10 as men, and 2 chose not to disclose.6 were undergraduate students, 9 graduate students, and 2 professional engineers.Participants self-reported experience levels with Python and AI assistants: 2 participants used Python 'occasionally', 8 'regularly', and 7 'almost every day'; 7 participants declared they had 'never' used AI assistants, and 8 used such tools 'occasionally'.
There were two experimental groups: "LP" participants used Leap with Projection Boxes, as described in Fig. 1; "No-LP" participants used Leap without Projection Boxes, instead executing programs in a terminal-like Output Panel.Participants completed both Fixed-Prompt tasks and one Open-Prompt task.We used block randomization [10] to assign participants to groups while evenly distributing across task order and selection and balancing experience with Python and AI assistants across groups.The LP group had 8 participants, and No-LP had 9. Procedure and Data.We conducted the study over Zoom as each participant used Leap in their web browser.Each session was recorded and included two Fixed-Prompt tasks (10 minutes each), two post-task surveys, one Open-Prompt task (untimed), one poststudy survey, and a semi-structured interview.A replication package 2 shows the details of our procedure, tasks, and data collection.
For quantitative analysis, we performed closed-coding on video recordings of study sessions to determine each participant's subjective assessment of their success on the task; we matched this data against the objective correctness of their final code to establish whether they succeeded in accurately validating AI suggestions.
We also measured task duration-proportion of time Suggestion Panel (Fig. 1 ) was in focus-and participants' cognitive load (via five NASA Task Load Index (TLX) questions [14]).We used Mann-Whitney U tests to assess all differences except for validation success, which we analyzed via Fisher's exact tests.
In addition, we collected qualitative data from both Fixed-Prompt and Open-Prompt tasks.We noted validation-related behavior and quotes, which we discussed in memoing meetings [6] after the study.Through reflexive interpretation, we used category analysis [39] to assemble the qualitative data into groups.We then revisited notes and recordings to iteratively construct high-level categories.

RQ1: Over-And Under-Reliance on AI
To investigate if Live Programming affects over-and under-reliance, we measured whether participants successfully validated the AI suggestions in the Fixed-Prompt tasks, as described below.We also compared task completion times and participants' confidence in their solutions (collected through post-task surveys).However, neither result was significantly different between the two groups, so we do not discuss them below. 3 We found six instances of unsuccessful validation, all from the No-LP group.As described in Sec. 4, we compared subjective and objective assessments of code correctness on the two Fixed-Prompt tasks, which resulted in four outcomes: (1) Complete and Accurate, where the participant submitted a correct solution within the task time limit, (2) Complete and Inaccurate, where the participant submitted an incorrect solution without recognizing the error, (3) Timeout after Validation, where the participant formed an accurate understanding of the correctness of the suggestions but reached the time limit before fixing the error in their chosen suggestion, and (4) Timeout during Validation, where the participant reached the time limit before they had finished validating the suggestions.We consider (1) and (3) to be instances of successful validation, (2) to be an instance of over-reliance on the AI suggestions, and (3) to be an instance of under-reliance, as the participant did not successfully validate the suggestions in the given time.As Fig. 2 shows, we found three instances of over-reliance in the Bigram task and three instances of under-reliance in the Pandas task, all from the No-LP group, though the overall between-group difference was not significant ( = .206for both tasks).Participants with over-reliance did not inspect enough runtime behavior.The three No-LP participants with over-reliance in Bigram (P5, P12, P15) made a similar mistake: they accepted one of the mostly-correct suggestions (similar to Suggestion 3 in Sec. 3) and failed to notice that ties were not resolved alphabetically.Among the three participants, P5 did not run their code at all.P12 and P15 both tested only one suggestion on the given input and 3 In median times, the LP group completed the Pandas task faster by 35 seconds ( = .664, = 31).For Bigram, LP participants were slower by 3 minutes and 51 seconds ( = .583, = 42), though this difference changes to faster by 10 seconds if we exclude those who solved the task incorrectly.For Pandas, both groups had the median ratings of confidence in correctness as "Agree" on seen inputs ( = .784, = 30) and "Neutral" on unseen inputs ( = .795, = 33).For Bigram, the LP group had the median rating of confidence in correctness on seen inputs as "Agree", while the No-LP group had "Strongly Agree" ( = .097, = 19.5).As for confidence in correctness on unseen inputs, the median for the LP group was "Neutral", and that for the No-LP group was "Agree" ( = .201, = 22.5).Figure 2: Success in validating AI suggestions across groups for Fixed-Prompt tasks."Completed" means the participant submitted a solution they were satisfied with by the time limit, and "Timeout" means they did not.We deem the validation successful if a participant submitted a correct solution (dark blue) or timed out when attempting to fix the correctly identified bugs in their chosen suggestion (light blue).
failed to notice the presence of two bigrams of the same count (and the fact that other suggestions returned different results).In addition, P15 cited "reading the comments on what it was doing" as a key factor for choosing the suggestion they did.That suggestion began with a comment stating that it resolved ties alphabetically, but the following code did not do so.
Participants with under-reliance lacked affordances for inspecting runtime behavior.The three No-LP participants who under-relied on AI suggestions (P7, P9, P15) tried to use runtime values for validation but struggled with doing so.P9 previewed and ran multiple suggestions but did not add any print statements to the code, and so they could only see the output of one of the suggestions, which ended in a print statement.P15 ran all suggestions and did add a print statement to each to inspect the final return value, but the need to change the print statement and re-run each time made this process difficult, and they lost track of which suggestions they considered the most promising, saying "I forgot which ones looked decent."Finally, P7's strategy was to print the output of subexpressions from various suggestions in order to understand their behavior and combine them into a single solution, but this was time-consuming, so they did not finish.

RQ2: Validation Strategies
Our participants had access to two validation strategies: examination (reading the code) and execution (inspecting runtime values).The general pattern we observed was that participants first did some amount of examination inside the Suggestion Panel-ranging from a quick glance to thorough reading-and then proceeded to preview zero or more suggestions, performing further validation by execution inside the editor.To this end, No-LP participants in most tasks ran the code and added print statements for both final and intermediate values; LP participants in all tasks inspected both final and intermediate runtime values in Projection Boxes (by moving the cursor from line to line to bring different boxes into focus), and occasionally added print statements to see variables not shown by default.Below we discuss notable examples of validation behavior, as well as differences between the two groups and across tasks.LP participants spent less time reading the code.We use the time the Suggestion Panel was in focus as a proxy for examination time; Fig. 3 shows this time as a percentage of the  did not really help with choosing between suggestions" (P15).In comparison, some in the LP group (P1, P16) specifically commented that Live Programming was helpful in distinguishing and choosing between multiple code suggestions; P1 said: "Being able to preview, edit, and look at the projection boxes before accepting a snippet was very helpful when choosing between multiple suggestions."As far as we are aware, this is a new application of Live Programming, specific to AI programming assistants and not previously explored in Live Programming literature.

DISCUSSION
Live Programming lowers the cost of validation by execution.
Although both LP and No-LP participants had access to runtime values as a validation mechanism, those without LP needed to examine the code to decide which values to print, add the print statements, run the code, and match each line in the output to the corresponding line in the code.If they wanted to inspect a different suggestion, they had to repeat this process from the start.Meanwhile, LP participants could simply preview a suggestion and get immediate access to all the relevant runtime information, easily switching between suggestions as necessary.In other words, LP lowers the cost-in terms of both time and mental effort-of access to runtime values.As a result, we saw LP participants relied on runtime values more for validation, as they spent less time examining the code in general-and significantly so for the Pandas task-and more often used intermediate values to find bugs in Bigram (Sec.5.2).Our findings are consistent with prior work [2,21], which demonstrated that programmers more often use validation strategies with lower time costs.Hence, by lowering the cost of access to runtime values, Live Programming promotes validation by execution.
The lower cost of validation by execution prevents over-and under-reliance.As discussed in Sec.5.1, we found six instances of unsuccessful validation in our study, all from the No-LP group, overrelying on AI suggestions in the Bigram task, and under-relying in Pandas.We attribute these failures to the high cost of validation by execution: those who over-relied did not inspect the runtime behavior of the suggestions in enough detail, while those with under-reliance lacked the affordances to do so effectively, and so ran out of time before they could validate the suggestions.Our results echo prior findings [34] that relate the cost of a validation strategy to its effectiveness in reducing over-reliance on AI.Prior work has also shown [11,36] that programmers often struggle to form an appropriate level of trust in code synthesizers, whether AI-based or not; our results suggest an important new role for Live Programming in addressing this challenge.We conclude that the lower cost of validation by execution in Live Programming leads to more accurate judgments of the correctness of AI-generated code.
Validation strategies depend on the task.Sec.5.2 shows that participants overall spent significantly more time examining the code in Bigram than in Pandas and also paid more attention to code attributes in the former.Participants explained the difference in their validation strategies by two factors: (1) Pandas contained unfamiliar API calls, the meaning of which they could not infer from the code alone; and (2) they perceived Pandas as a one-off task, which only had to work on the given input.We conjecture that (1) is partly due to our participants being LP novices: as they get more used to the environment, they are likely to rely on previews more, even if they are not forced into it by an unfamiliar API (as P4 mentioned in Sec.5.2). ( 2), though, is more fundamental: when dealing with a general task, correctness is not all that matters; code quality becomes important as well, and LP does not help with that.
In Open-Prompt tasks, code examination was less prevalent in the overall task duration, because in these tasks participants spent a significant amount of time on activities besides validation (e.g., decomposing the problem and crafting prompts).It might seem surprising, however, that we did not see any difference in examination time between the two groups in Box Plot, which is an API-heavy, one-off task, similar to Pandas.This might be because, in Box Plot, the cost of validation by execution was already low for No-LP participants: this task did not require inspecting intermediate values, because the effects of each line of code were reflected on the final plot in a compositional manner (i.e., it was easy to tell what each line of code was doing just by looking at the final plot).
In conclusion, Live Programming does not completely eliminate the need for code examination but reduces it in tasks amenable to validation by execution.
Live Programming lowers the cognitive load of validation by execution.In Pandas, LP participants experienced lower cognitive load in four out of five TLX categories (Sec.5.3).This confirms our hypotheses that LP lowers the cost of validation by execution, and that Pandas is a task amenable to such validation.More specifically, we conjecture that, by automating away the process of writing print statements, LP reduces workflow interruptions, which were identified as one of the sources of increased cognitive load in reviewing AI-generated code [36].
In Bigram, however, we did not observe a similar reduction in cognitive load; in fact, LP participants reported higher cognitive load in the "performance" category (i.e., they perceived themselves as less successful).Our interpretation is that the cognitive load in this task was dominated by debugging and not validation, and whereas all participants in the LP group engaged in debugging, only two-thirds of the No-LP group did so.Finally, the higher "performance" ratings from the LP group were from those who ran out of time trying to fix the code, and hence were aware that they had failed.These findings show that Live Programming by itself does not necessarily help with debugging a faulty suggestion.As we saw in Sec.5.2, it can be helpful when the user has a set of potential fixes in mind, which they can quickly try out and get immediate feedback on.But when the user does not have potential fixes in mind, they need to rely on other tools, such as searching the web or using chat-based AI assistants.
From these findings, we conclude that Live Programming lowers the cognitive load of validating AI suggestions when the task is amenable to validation by execution.

CONCLUSION AND FUTURE WORK
We investigated an application of Live Programming in the domain of AI-assisted programming, finding that LP can reduce overand under-reliance on AI-generated code by lowering the cost of validation by execution.Our work highlights new benefits of LP specific to AI-assisted programming, such as building appropriate trust in the assistant and helping to choose between multiple suggestions.Our study is necessarily limited in scope: we focused on self-contained tasks due to LP's limited support for complex programs [20,31] and its need for small demonstrative inputs [28].We hope that our findings inform future studies on code validation and motivate further research into AI-LP integration.To that end, we highlight key opportunities below.
To offer liveness, LP places several burdens on the user.The user must provide a complete executable program and a set of test cases, and then look through potentially large runtime traces for the relevant information.AI may alleviate these burdens by filling in missing runtime values [29] for incomplete programs, generating test cases [19,38], and predicting the most relevant information to be displayed at each program point.Looking beyond the validation of newly generated code, there are also opportunities for AI-LP integration for debugging and code repair [38].In combination, AI-LP would tighten the feedback loop of querying and repairing AI-generated code: users could validate code via LP, request repair using the runtime information from LP [11], and further validate the repair in LP.

Figure 1 :
Figure 1: Leap is a Python environment that enables validating AI-generated code suggestions via Live Programming.Users prompt the AI assistant via comments and/or code context.The Suggestion Panel shows the AI-generated suggestions.Pressing a Preview button inserts the suggestion into the editor.Users can inspect the runtime behavior of the suggestion in Projection Boxes[20], which are updated continuously as the user edits the code.

Figure 4 :
Figure 4: NASA Task Load Index (TLX) results for the Fixed-Prompt tasks: Bigram on the left, and Pandas on the right.Higher scores indicate higher cognitive load (in case of Performance this means higher failure rate).