How Beginning Programmers and Code LLMs (Mis)read Each Other

Generative AI models, specifically large language models (LLMs), have made strides towards the long-standing goal of text-to-code generation. This progress has invited numerous studies of user interaction. However, less is known about the struggles and strategies of non-experts, for whom each step of the text-to-code problem presents challenges: describing their intent in natural language, evaluating the correctness of generated code, and editing prompts when the generated code is incorrect. This paper presents a large-scale controlled study of how 120 beginning coders across three academic institutions approach writing and editing prompts. A novel experimental design allows us to target specific steps in the text-to-code process and reveals that beginners struggle with writing and editing prompts, even for problems at their skill level and when correctness is automatically determined. Our mixed-methods evaluation provides insight into student processes and perceptions with key implications for non-expert Code LLM use within and outside of education.


INTRODUCTION
Computer scientists have been working towards programming in natural language for decades [4,38,86]

Editing the Prompt
Figure 1: Visualization of the multi-step process of querying a large language mode of code (Code LLM).The user starts with crafting their prompt in natural language (NL).They provide the prompt to the model, which produces code.The user then assesses the correctness of the generated code.If there are errors, they must identify how to resolve them and how to edit the prompt.This continues in an iterative fashion.
of making programming easier for a broader set of users.Recent advances in generative AI have brought us nearer to this goal.In programming, along with fields like digital art [79,81,83], creative writing [2,45,68], and digital music [1,63], generative AI has reduced the technical skills that users need by allowing them to prompt a model with a natural language description of their desired output.In many fields, experts have started to use generative AI to accelerate their work, including in software engineering, where large language models of code (Code LLMs) have enhanced expert programmer productivity [69,76,105].However, to fulfill their potential of democratizing these fields, models must be usable without extensive technical training at each stage of creation: 1) writing prompts for the model, 2) evaluating model output for quality, and 3) iteratively refining prompts when generation is unsuccessful.
Programming presents a particularly challenging domain for non-experts.Like art, computer science has evolved an extensive technical vocabulary; since generative models are trained largely on professional code, they may not work as well if users lack this vocabulary.In visual art, music, and creative writing, a user can quickly determine whether they like the generated output even if they are not an expert (embodying the cliché "I don't know anything about art, but I know what I like").However, this attitude does not extend to programming.It is very challenging for a non-expert to evaluate the quality of a generated program.Even when a user knows enough to determine a generated program is incorrect, they also need to understand it well enough to know what needs to change and how to update their prompt.
In order to use a Code LLM, non-experts must grapple with a multi-step process (Figure 1).First, they must have a clear understanding of what they want the code to do.This may seem trivial, but research on requirements engineering has shown that it can be challenging [75].Next, the user must clearly articulate the intended behavior of the program in natural language to the model.Once the model generates code, the user must evaluate its correctness by reading it or writing tests.If the code is not correct, they must determine what has gone wrong, and update their prompt accordingly.This requires not only understanding the generated code, but also, understanding the model's generative process.These barriers mirror well-known challenges for non-experts with end-user programming [50] and classical AI systems [53].
There is a growing body of work studying how non-expert programmers use AI-assisted programming systems in naturalistic settings [48,78].However, in open-ended tasks, it is difficult to decouple the steps of the code generation process, since they feed each other: if the user fails to identify incorrect code and moves on, their editing process can't be observed.We present results from a carefully-controlled experiment targeting two steps in the code generation process: prompt creation (How do users describe the intended program in natural language?)and prompt modification (How do users modify their prompts when a generated program is incorrect?).
One challenge in studying how non-experts use Code LLMs is selecting tasks that make sense to them.For example, replicating Barke et al. [6]'s insightful study of experienced programmers would not be appropriate for novices, because the tasks presuppose technical knowledge.Novices have diverse goals, backgrounds, and familiarity with mathematical and computational thinking.Our solution is to target a large population of near-novices with similar experience levels: university students who have completed a single introductory computer science course (CS1).This allows us to select tasks that are conceptually familiar to them.
Our Approach.We ask whether students who have completed CS1 can effectively prompt a Code LLM to solve tasks from their previous course.In order to isolate students' experiences in writing and editing prompts, our experiment presents tasks as input/output pairs and tests the generated code for correctness.This provides in-depth insight into the processes they develop for describing code in natural language and iteratively refining their prompts.We pose three main research questions: • RQ1: Can students who have completed a CS1 course effectively prompt a Code LLM to generate code for questions from their previous courses?• RQ2: What is the origin of student challenges with Code LLMs?Do these differ across different groups of students?• RQ3: What are students' mental models of Code LLMs and how do they effect their interactions?We find that students struggle significantly with this task, even though we pose problems tailored to their skill level and test code correctness for them.In essence, beginning programmers and current Code LLMs tend to misread each other: the Code LLM fails to generate working code based on student descriptions and students have a hard time adapting their descriptions to the model.Our study has concerning implications for democratizing programming: if these students, who already have basic skills in code explanation and understanding, struggle with this simplified task, the full natural language-to-code task-where the user has to determine correctness themselves-must be very challenging indeed for true novices.This finding also has important implications for education.Code LLMs have sparked an intense debate over the future of computing education, including claims that traditional programming training is no longer necessary [65,100].By contrast, our findings highlight the continuing importance of teaching students technical communication and code understanding.
Our work differentiates itself from previous work in three key ways: scale, population, and experimental design.First, we study 120 students solving 48 different programming problems.To our knowledge, no previous work has studied user interactions with Code LLMs at this scale.Second, we focus on a near-novice population with fairly uniform levels of experience, allowing us to carefully tailor tasks to their skill level.Finally, we use an experimental paradigm that allows us to isolate the prompt writing and editing aspects of the task. 1

RELATED WORK
Our work focuses on how programmers use LLMs to turn natural language into code.Programming with natural language is a decades old proposition [67] and has led to several ideas about bringing programming closer to how users communicate [70].For instance, Hindle et al. [39] imagined that future language models could be effective at turning natural language to code, a prediction that has been borne out with Code LLMs.
By exploring beginner interactions with Code LLMs, our study contributes to a growing body of work on how non-experts interact with emerging automated technologies [98], ranging from automated feedback [22,44,94] to augmented reality [42,80].We situate our study within existing work on user interactions with Code LLMs below.
Experienced programmers and LLMs.We study how beginning programmers interact with a Code LLM, the same foundational technology that powers autocomplete tools such as GitHub Copilot and others [14,15,95].These tools are promoted as productivityboosting technology for experienced programmers.Recent in-thewild studies and surveys indicate that these tools are popular with expert programmers, improve their self-perception of productivity, and shift their work from writing code to understanding LLM outputs [10,60,69].In contrast, our study of beginners' interactions with a Code LLM reveals that (1) they have mixed success with writing natural language prompts, (2) and they often struggle to understand LLM-generated code.
Vaithilingam et al. [96] present the earliest academic study of GitHub Copilot with 24 students (undergraduate-PhD) and three tasks.Their main finding is that although participants enjoyed using it, Copilot did not help them code faster or write more correct code.We design our study for less experienced participants.For example, we developed a web interface that is much simpler than a professional IDE.The same study reports that their participants often struggled to validate LLM-generated code, and we avoid this by testing generated code for our participants automatically.
Since Copilot is a general autocomplete tool, one can use it in several ways: to produce code given code, to generate documentation from code, to turn natural language into code, and so on.Grounded Copilot [6] studies experienced programmers and reports that they prefer using it to turn natural language into code [6, Section 4.2.3].Thus our study design focuses on the natural language to code task, but with beginning programmers.
Non-experts and LLMs.Like us, several researchers have considered the impact of using Code LLMs for the text-to-code task with non-experts, specifically in educational settings.Our work is larger in scale than prior work (120 students from 3 institutions and 48 problems in 8 categories), which allows us to perform statistical analyses that require large sample sizes to be reliable.Moreover, our experiment design allows us to investigate key research questions that prior work has not been able to ask, such as identifying the prompting strategies that beginners use, determining how they modify prompts that do not work, and studying several factors that affect their success.
Prather et al. [78] study 19 students using Copilot for a final project in a CS1 course: building the game Minesweeper.They found that students struggled to use Copilot, even over the course of a week.We reach a similar conclusions with our study, with 48 problems that are much simpler than building a working video game.
Kazemitabaar et al. [48] develop CodingSteps, a web-based Python learning environment that allows users to query Codex.The paper compares 33 participants (10-17 years old) with access to Codex to 36 students programming independently, working on the same set of 45 programming problems over several weeks.Their findings indicate that Code LLMs may benefit student learning outcomes.However, because CodingSteps presents students with expert-written problem descriptions, their results do not shed light on whether beginners can write natural language prompts independently.They report that 32% of student prompts are verbatim copies of the expertwritten problem descriptions.In contrast, our study is carefully designed to avoid this problem by showing students input/output examples instead of natural language descriptions.We also investigate the strategies that students use to understand model output and modify their prompts.Kazemitabaar et al. [48] do not address these kinds of questions, partly because their students received feedback from instructors throughout the experiment.
Promptly [19] studies 54 students writing prompts for three CS1 problems.Our substantially larger scale (120 students and 48 problems) allows us to explore research questions beyond what they study, such as the how students change their prompting strategies, and demographic factors that influence success rates.Our paper also presents a detailed analysis of LLM output, such as the kinds of errors that appear in LLM-generated code, and the impact of non-determinism on participants' success.
Lau and Guo [52] interviewed 20 CS1/CS2 instructors in early 2023 about their perceptions of ChatGPT and LLM technologies.They report that instructors hold a diverse set of perspectives: some wanted to "ban it" and others felt urged to integrate these technologies into curricula to prepare students for future jobs that may require using LLM technology.The students in our study echo many of the concerns and desires raised by instructors in Lau and Guo [52].
It is also possible to use language models to assist students learning to program, without having the model write code for the student.For example, Geng et al. [31] use language models to localize type errors in OCaml, but not to correct them.Like our study, this work isolates the interaction mode in which students use Code LLMs; however, we study prompt writing and editing, while they study error detection and explanation.
Alternatives to inline code completion.Copilot and related tools suggest inline code completions, but there are other ways to interact with AI-assisted programming tools.Vaithilingam et al. [96] present new interfaces for Visual Studio that present code changes.Liu et al. [62] build a new interaction model, grounded abstraction matching, which targets spreadsheets and data frames, constraining the generated code to support grounding.These ideas are exciting parallel directions for Code LLM interaction in addition to the natural language prompting approach we study here.
Code LLMs beyond text-to-code.For a beginning programmer, feedback from an expert teacher or teaching assistant can be invaluable.However, access to expert feedback is limited.There is a long line of research that tries to address this shortage by developing systems that generate actionable feedback for students [37,40,82,90,94].Phung et al. [77] show that LLMs can help build these systems and generate higher quality feedback than prior rulebased approaches.In contrast to our human experiment, they evaluate on benchmark problems.Moreover, their system is intended to help beginners write code directly, whereas our experiment focuses on prompt writing.
Another body of work focuses on automated program repair [34], which can be used to fix trivial mistakes that frustrate beginners.Traditional automated program repair systems have required significant engineering for each programming language and problem domain.Joshi et al. [47] show that an LLM trained to generate code can be employed to repair simple coding mistakes.
Similarly, Leinonen et al. [55] report that Code LLMs are better at explaining code than beginning students, and Leinonen et al. [56] show that an LLMs explanation of a program error can be better than default error messages.This is further evidence that LLM technology may help students learn to write code directly.
Recent additional efforts include Finnie-Ansley et al. [25], who report that Codex is remarkably good at generating code from natural language prompts from a CS1 class and several variations of the Rainfall Problem; Dakhel et al. [17], who compare the quality of Codex-generated code to student-written code; and Babe et al. [3], who use student-written prompts to benchmark Code LLMs.Finally, Code LLMs have applications that go beyond natural-languageto-code, and researchers are using them as building blocks for a variety of other tasks [5,12,23,26,47,57,69,71,77,84,87,101].The aforementioned papers present new tools, benchmarks, and studies of LLM capabilities.But, they do not study users' abilities to prompt models, which is the focus of our work.
Using LLMs for non-programming tasks.Researchers are currently exploring a wide variety of applications for LLMs beyond computational tasks.While we do not survey the full range of such work, two recent papers are particularly relevant to our task.Zamfirescu-Pereira et al. [103] study non-experts prompting an LLM to produce recipes.Their participants actively avoided systemic testing, which we address by automating testing.Like them, we find that participants' mental models of LLMs are very different from how they actually work.Singh et al. [89] compare user interactions with a multimedia writing interface with LLM-generated audio, text, and image suggestions.Our post-study interview and survey was inspired by their exploration of participant's perceptions of AI.

STUDY DESIGN
Our work explores whether beginning programmers can effectively prompt Code LLMs.We investigate this question through a multiinstitutional [24], lab-based study, asking 120 students who completed a CS1 course to describe 8 out of 48 possible problems presented via input/output examples.
In this section, we discuss three major aspects of our study design: (1) Why do we use a controlled experiment?(2) How do we successfully present problems to students?(3) How do we select problems that are appropriate for students?
We discuss the logistics of implementing the study in Section 4.

Experimental Environment: In the Lab vs.
In the Classroom Studies of student interactions with programming tools can be grouped into three main categories: studies within the context of a course during the term, post-hoc analyses of educational data, or controlled, lab-based experiments.Post-hoc analyses are not currently possible, since there is a lack of available educational Code LLM data.We discuss the decision between a course-based study and lab-based study below.
There are many benefits to real-world studies conducted in a course context, including ease of access to participants and normalized educational background [78].It is easier to study how technology directly impacts learning by using it alongside instruction [48] or as an evaluative method [44].At the same time, these studies cannot be as easily controlled: participation may be optional (only around 12% of students chose to participate in Denny et al. [19]); participants may explicitly be learning through the task, making it hard to compare their responses across problems [48]; and in-depth interviews are challenging to conduct.
Lab-based studies benefit from greater uniformity in observations, which facilitates statistical analysis, and longer experimental sessions.We chose a lab-based experiment because our research questions focus on the usability of Code LLMs for beginning programmers and on their processes, rather than their educational outcomes.Specifically, the process of working with a Code LLM requires multiple, interdependent steps: (1) forming an intent, (2) crafting a prompt to describe the intent, (3) evaluating the quality of the LLM-generated code, (4) editing the prompt when the code is wrong, (5) editing the code manually, or (6) giving up and writing code manually (Figure 1).Our goal was to isolate processes (2) and (4).
Our study limits user interactions in order to isolate prompt writing and editing strategies.One key feature of our paradigm is that we automatically test the generated code.In most observational studies, programmers determine on their own whether the generated code is correct.This is itself an interesting process.However, studying this aspect of Code LLM interaction comes at the cost of studying prompt editing: if a programmer mistakenly accepts incorrect code, they will move on to the next task without editing.Prather et al. [78] report that many of their participants mistakenly accepted incorrect code.Beginning students are particularly likely to err in this way: they may struggle to understand generated code, and their lack of confidence in their own abilities may make them trust the automated system over their own judgment (an example of automation bias [18,30,32,91]).
Finally, a key contribution of our work is its scale: we study 120 participants across 3 institutions and 48 programming tasks, while previous studies have had fewer participants and problems.We recruit participants from three U.S. institutions: an R1 university (Northeastern University), a small liberal arts college (Oberlin College), and a women's college (Wellesley College).This selection increases the likelihood that our findings will generalize across institutions.Our scale allows us to explore how diverse factors, such as prior non-curricular programming experience, first-generation status, and mathematics coursework, affect participant success.These kind of statistical analyses require large sample sizes and work best with even observations of participants and problems, which are challenging to obtain in course settings.

How to Describe Problems to Students: Input/Output Examples vs. Written Descriptions
A key design decision for studies of Code LLM interactions is how to present the task.In classroom environments, students are usually given instructions for what to program via written descriptions.This makes sense, given that the student's goal is to write code.However, natural language presentation poses critical issues for our key research questions.In our study, the goal is to write natural language descriptions of problems, not to write code.A core goal is to understand how students approach the natural-language-tocode task.If the task is presented in natural language, students may simply reuse this text rather than putting the task into their own words; our results would no longer measure beginning programmer success, but instead expert description success.Prior work shows that this is a serious concern: in Kazemitabaar et al. [48]'s study of K-12 students, up to 49% of submissions for challenging problem categories were copied from the expert-written task description.Even if participants do not directly copy a description, its wording could influence how participants describe the task.One challenge for beginning programmers is recalling and applying technical vocabulary; presenting them with a natural language description of the task might remind them of terminology that they would not have recalled on their own.This would endanger our goal of assessing beginning programmers' abilities to prompt code generation models, since in many natural settings, they would not have an expert description to rely on.
We therefore rely on a popular alternative for describing program behavior: input/output examples (Figure 3).Students also could reference the function name and parameter names.Our participants had taken CS1 classes where natural language descriptions are frequently accompanied by input/output examples (see Appendix A.1.2),making this a familiar way of communicating program behavior.Several CS1 courses, and some of the assignments used in our CS1 courses, go beyond this and require students to construct their own examples or even practice test-driven development [21,27].However, our study does not require students to write their own tests.
Avoiding natural language presentation is critical in order to study how beginning programmers describe problems in their own words.However, it comes with two risks.First, the input/output paradigm may increase task difficulty, since participants must identify the key pattern on their own.Although understanding natural language descriptions of coding tasks is not always easy for beginning programmers, it is likely easier than our input/output paradigm.Second, input/output examples run the risk of underspecification [35,88] -there may be more than one program that performs the correct input-output mapping.To determine that the provided tests adequately described the problem, we confirmed that our provided test sets had 100% code coverage for a correct solution and performed mutation testing [46].We also calculated participants' success using only the provided test cases: if the generated code passed the provided tests, it was deemed correct, ensuring that the problem presentation aligned directly with the feedback to the user.
We feel that these potential issues pose less of a risk to our key research questions than the copy/paste or word bias risks posed by a natural language presentation.Other researchers have also used an input/output presentation paradigm in studying beginner interactions with Code LLMs [19].

Problem Selection: Previously Seen Tasks vs. New Tasks
The natural language-to-code task requires participants to describe specific programming problems.Previous work exhibits varied approaches to problem selection, from a single challenging problem in Prather et al. [78] to three simple problems in Denny et al. [19] to a set of 45 problems in 5 categories in Kazemitabaar et al. [48].
Our main goal was to select problems at an appropriate level for students who had completed only CS1.Since our research questions focus on student prompting processes, not learning outcomes, we chose problems at a similar level to what participants might be able to code independently.Asking students to solve new or more complex problem types increases the likelihood that the Code LLM will generate unfamiliar or difficult to understand code, making the prompt editing process more difficult.We therefore adapted Python problems specifically from CS1 course materials at each institution.We made small changes to facilitate input/output testing or adjust problem difficulty.Appendix A.1 contains two examples of how source problems were adapted.
We selected 48 problems balanced across eight conceptual categories from CS1 (Figure 2), similar to Kazemitabaar et al. [48], but with more categories and problems.Each individual problem was assigned to 20 students; we balanced the experimental lists to control for ordering effects, so that each participant solved one problem in each category, and the average difficulty of each problem list was roughly the same.To facilitate difficulty and category coverage, previous CS1 instructors were asked to provide additional problems as needed.Problems such as exp (Figure 3), for instance, require students to only recognize that numbers in a list are being squared.Other problems ask students to remember complex data structures (e.g.lists, dictionaries), but not the specific Python syntax for them.We further discuss student understanding of the problems in Section 7.2 and Appendix B.2.
In order to study interactions between Code LLMs and students, it is important to select problems that cannot be trivially solved by a Code LLM without any natural language description.Very common functions (for instance, shorten_url) can be solved from a function signature alone, regardless of the accompanying description.To validate our problems, we first checked that the model could not solve problems from their function/parameter names alone and, if they could, edited the names accordingly.We also solved each problem using the Code LLM to ensure that a working natural language description existed.Finally, to address the nondeterminism of Code LLMs, we ran each validation check multiple times to obtain a stable estimate of these results ( §5.2).

STUDY LOGISTICS
The previous section ( §3) described our multi-institutional experimental design.In this section, we discuss the logistics of participant recruitment and executing the study.

Charlie Interface
We built a web application for the experiment called Charlie the Coding Cow or Charlie.Charlie presents one problem per page, displaying the function signature and several input/output examples (Figure 3a).Participants write natural language descriptions in a text box.When they submit a description, the Charlie server prompts Codex with the function signature and their description formatted as a docstring (Figure 4).After Codex responds, Charlie shows students the Codex-generated code and displays whether it works on the given input/output examples (Figure 3b).
Charlie does not permit participants to edit the generated code, since we are focused on natural-language-to-code interactions.If the code fails, they can retry the problem or move to the next problem.For retry attempts, we pre-fill the text box with their last prompt to make editing easier.Finally, after every final attempt  at a problem, Charlie presents two forced-choice questions with thumbs-up / thumbs-down answers: Did Charlie generate correct code? and Would you have written this code yourself?.We included these questions to gather information about student perceptions of code style, since the model may produce working code, but in a style that is unfamiliar to students.Each student worked with Codex to solve 3 tutorial problems and 8 main problems.We used the Charlie character to provide distance from any AI system that students might already know.This suggested a representation that was not human and not robotic.Charlie also provides visual feedback: Charlie animates a "thinking" position while Codex generates a completion and appears in different forms when the code does or does not pass all tests.We made this design choice to mitigate frustration with waiting for the model to generate code, a source of annoyance in prior studies of Code LLM interactions [69].

Model Choice
When we began piloting in November 2022, the most capable Code LLM was the largest Codex model from OpenAI, code-davinci-002.Although code-davinci-002 was first released in 2021, on established Python programming benchmarks, it remains as good as gpt-3.5-turbo,which is the model presently used for GitHub Copilot's inline completions [104], the free version of ChatGPT, and  several other commercial products.Specifically, gpt-3.5-turboand code-davinci-002 score 48% and 46% respectively on the Hu-manEval Python programming benchmark [11,74], the most commonly used Python benchmark for Code LLMs.Since we started our study, several other LLMs have also appeared, including nonproprietary LLMs that are better for reproducibility ( §9.5).The best open models perform comparably to code-davinci-002; for instance, CodeLlama (34B) achieves 48% on HumanEval [85].This suggests that the model that we use is as capable at code completion as newer models used in practice.
There are larger models that are more capable, such as GPT-4, which achieves a HumanEval score of 67% [74].However, GPT-4 is significantly slower and higher latency than the alternatives, and low latency is essential for LLM code completion to be acceptable to users [69]; if participants have to wait more than a few seconds for the generated code, their frustration might lead them to move on rather than re-attempting the problem.
For consistency, we used the same Codex model throughout the study (code-davinci-002).It is important to note that Code LLMs perform best when their output is sampled; consequently, the model may produce different programs for the same prompt.We generated output using best practices for hyperparameter and sampler settings [13].

Participants
We recruited 40 participants from each institution (n = 120).Eligible participants were at least 18 years old, had taken CS1 at their institution between Fall 2021 and Spring 2023, and had not completed any subsequent CS courses.We recruited participants from March to July 2023 until reaching our sample size of 120.The pilot and main study received IRB approval.
Care for Participants.Our study design sought to balance obtaining accurate data with addressing potential discomforts and power dynamics.Potential discomforts for participants included frustration regarding their inability to complete a task, which could reinforce negative perceptions of self or CS.In the tutorial, we emphasized that our goal was not to evaluate their programming skills, but the collaboration with Charlie.Students were allowed to move on from a problem at any time, resulting in a variable number of attempts per problem.
We took several steps to address potential power dynamics between students and their professors.Recruitment was done through an interest form distributed by other faculty or staff.Scheduling was performed by a researcher at another institution.Finally, research sessions were never run by a professor at the same institution as the participant.For each problem, the frontend provides the participant with the signature and tests and asks them to write a description (prompt).This is then relayed to the backend, where the signature and prompt are sent to Codex via the API.The code completion from Codex is then run on our pre-defined tests.Finally, the results of running the tests and the code completion are presented to the participant in the frontend interface.

Study Execution
The study was conducted over Zoom with audio and video recording.Participants signed informed consent material ahead of the experiment and assented at its start.They were compensated with a $50 gift card for the estimated 75-minute study.
Main Task. Figure 2 (1) outlines the full study design.Students completed 3 tutorial problems to get familiar with the interface and see some possible Codex responses.We supplied participants with a working prompt for the first tutorial problem, then gave them a difficult problem so they could see a failure, and a final easy problem to solve independently.
The main experiment consisted of 8 problems in two blocks, the first untimed, the second timed.In the second block, students were limited to 5 minutes per problem.We included both timed and untimed blocks in order to balance the need to bound study duration with the desire to observe complete prompt editing cycles.
Participants were randomly assigned experimental lists, balanced by difficulty, using a modified Latin Square design.Four authors independently assessed the difficulty of writing prompts for each problem; we averaged these scores and developed six roughly equal lists (Figure 2).
Post-task Interview and Survey.After the main study, students completed a two-part survey, a semi-structured interview, and an optional debriefing session (Figure 2 (1)).The semi-structured interview was interleaved between two survey blocks to mitigate question ordering and priming biases.
The first part of the survey was designed to study student perceptions of Charlie and of AI more broadly.We adapted validated scales from previous work to understand student perceptions of the usability, trustworthiness, and friendliness of Charlie [7,20,51,99] and the mental workload of the task [36]. 2 We were also interested in whether students' ability to come up with effective prompting strategies might correlate with fixed versus growth mindsets about computing; we drew on Gorson and O'Rourke [33] to measure this.
The semi-structured interview asked 8 questions covering student editing processes, what they found hard or easy, how they envisioned their interactions with Charlie, and how they imagined Charlie worked.The specific questions were directly inspired by our overarching research questions.Researchers followed a standing script to ask each question -there are a total of 5 missing question responses across the possible 960 interview datapoints, likely due to researcher error or time considerations.In the optional debriefing, we explained the experiment and how Code LLMs work.
The second part of the survey focused on participants' backgrounds and demographics.These were the last questions of the study to mitigate possible stereotype threat [72].For questions related to identity (e.g., gender, race, spoken language at home), we followed best practices and solicited responses via open text boxes [92].We also asked questions about students' CS1 performance, experience with programming outside of CS1, high school & educational background, math background, major, and class year.
Pilot Study.In late 2022, we ran a pilot study with 19 participants to assess the study design and usability of the interface.Pilot participants were recruited from the same three institutions as in our main study, but were students who had taken more than one CS course.This small pilot allowed us to make sure the web platform was working correctly, identify any problems with specific tasks, refine our time estimates, and assess the quality of the automatic transcriptions of the interview recordings produced by otter.ai. 3uring the pilot, we identified one problem with ambiguous test cases, which we changed before the main study.Pilot participants solved an average of 5.5 out of 8 problems (an Eventual Success Rate of 68.8% using the metric described in §5.2).
Because the average pilot participant took 53 minutes, we increased the time estimate and compensation from $30 for 60 minutes to $50 for 75 minutes for the main study.We also added a hidden time limit to the first block of questions in case participants spent more than 50 minutes on this portion of the study; this issue never arose in the main study.

ANALYSIS
This section presents the analysis framework for §6, §7, and §8.We take a mixed-methods approach to this work.

Evaluation Plan
Qualitative analysis.We collected three types of data which lend themselves to qualitative analysis: (1) information about student experience and demographics, (2) free-response questions about future use of Charlie, and (3) semi-structured interview responses.We employed both inductive and deductive open coding towards consensus.Our aim was to identify common themes present in this specific dataset, rather than to develop a theory.Two researchers with previous qualitative experience conducted the analysis; Section A.2 contains details of the coding methodology.We present selected quotes from the surveys and interviews throughout.Quotations have been lightly edited from the automatically generated transcripts.This includes addressing grammar/punctuation, removing speech errors or filler words, and avoiding the disclosure of any identifiable information.Each participant's quote is accompanied by a pseudonym assigned to them during data collection.
Statistical analysis.We perform statistical testing with a significance level of =0.05 in order to determine whether observed differences in response measures are statistically reliable.For comparisons between two groups, we use Student's -test.For comparisons between multiple groups, we perform ANOVAs; in cases where there is no natural reference group, we use Tukey HSD tests to explore pairwise differences.We report Pearson's  for correlations between continuous variables and Kendall's  for correlations between continuous and ordinal variables.Where we are interested in multiple potentially interacting variables, we fit linear mixedeffects models with maximal random effects for participants and problems using the lme4 package in R [8].

Measures of Success
There are several ways to measure success when evaluating the natural-language-to-code task.The success rate is the fraction of all attempts on which the model generates a working program.Therefore, a participant who takes several attempts to solve a problem will have a lower success rate than another who succeeds in one try.We might also ask whether a participant is ever able to solve a problem; we refer to this as the eventual success rate.This metric considers only the participant's final attempt at each assigned problem.The eventual success rate metric is likely specific to this paper, as closely related work [19,48,78] studies different notions of success or does not permit controlled, repeated interactions.
Although success rates measure the correctness of the code that students saw during the experiment, LLM generation is nondeterministic. 4Therefore, studying success rates can be misleading: a participant may have just been lucky with a bad prompt or unlucky with a good prompt.For this reason, we also employ an alternative metric called pass@1, which accounts for non-deterministic generation [13].Since the debut of Codex, pass@1 has become the standard metric used to evaluate LLMs on the natural-languageto-code task, including GPT-4 [74], Code Llama [85], and other models [29,59,73].
Given a natural language prompt, pass@1 [13] is an estimate of the probability that the LLM will generate working code in 4 Greedy generation is significantly worse for coding tasks than non-deterministic generation [13].one attempt.In the LLM development literature, the accepted best practice for computing pass@1 is to query the LLM 200 times for the same prompt and test every generated program [13,29,85,102].Sampling 200 generations for all 2,000+ prompts generated as part of this study would be very expensive with the Codex API.Instead, we use a recently released open Code LLM called StarCoder [59] that is nearly as capable as the Codex model on Python benchmarks.Pass@1 with StarCoder will be slightly lower than Codex success rates because of model differences.However, pass@1 is a more stable measure of whether a prompt will succeed than success rate.We use pass@1 for the bulk of our analyses.

Positionality
All authors were affiliated with the institutions from which participants were recruited (Oberlin, Wellesley, or Northeastern) at the time of the study; we range from undergraduate students to tenured faculty.We developed the problem lists, problem difficulty ratings, and other elements of the study design within a shared educational context.The last three authors are course instructors for CS1.As described in §4.3, significant care was taken to address power dynamics between participants and researchers.Some authors also contribute to the development and evaluation of opensource Code LLMs.Overall, the potential incentives for the research team are complex, as we approach this work as both educators and researchers.We aspire to a neutral perspective on Code LLMs, while attempting to center the student experience.
This research studies students at three selective higher education institutions in the United States.Therefore, while we are able to generalize beyond a single CS curriculum, the educational context is specific: our findings may not generalize to other settings (e.g., community colleges, K-12 education) or cultural contexts.

RQ1: DO STUDENTS SUCCEED AT PROMPTING CODE LLMS WITH NATURAL LANGUAGE?
In this section, we present how well students do on our Code LLM prompting task and address RQ1: do students succeed at prompting Code LLMs with natural language?We explore differences between students that are linked to their ability to successfully describe problems to Code LLMs.

Basic Findings
Figure 5 presents the distribution of participants' success rates and eventual success rates.The average participant solved 4.7 out of 8 assigned problems.The mean eventual success rate (57%) is not high, and the mean success rate (24%) is even lower, since it decreases with every failed attempt.We find no significant institutional difference for either measure of success.
Participants often submitted a large number of failing attempts (Figure 5d): 153 problems (aggregated across participants) required three or more attempts.In fact, one participant succeeded at a problem only after 32 attempts; another gave up after 26 attempts.These results suggest that low success rates are not due to a lack of participant effort.Participants struggled to write natural language prompts for the LLM, and often achieved success only after many  Figure 5: Basic measures of student success at the natural-language-to-code task.Success rate is the fraction of all attempts by a participant that succeed.Eventual success rate is the fraction of last attempts at a problem by a participant that succeed.Pass@1 resamples the LLM several times to estimate the probability of success.We present these measures by institution.Figure 5a presents the means.Figure 5b and Figure 5c show the distribution of (eventual) success rates.Eventual success rates are higher than success rates, which is to be expected: Figure 5d shows that many students make several attempts at a problem before an eventual success or give up.1: Self-reported high school, language, and family background.repeated attempts.The challenging nature of this task is supported by comments from the students themselves ( §7.1).

Do Participants Find the Task Challenging?
In the post-survey, participants completed four items of the NASA TLX [36].Overall, students found the task mentally demanding (Table 2).The questions about mental demand (Q1), time pressure (Q3), and their own performance (Q4) correlate inversely with success rate.Students whose success rates were lower generally rated the task as more demanding (Kendall's =-0.16;=0.02); were less likely to say they were successful (Kendall's =-0.4;<0.0001); and reported higher levels of stress and insecurity (Kendall's =-0.27;<0.0001).

Who Succeeds at the Task?
Using data from the post-survey, we analyze the relationship between pass@1 rates and previous knowledge, prior programming experience, and demographics (see Table 1 for a summary of demographics).We find only two statistically reliable differences (see Appendix, Table 11 for the full statistical analyses): • Prior programming experience: About 1/3 of participants had no programming experience outside of CS1.The remaining participants had taken pre-college programming courses (24%), were in the next CS course (21%), or had coding experience outside of classes (29%).There is a statistically reliable difference (t-test;  = 0.02) in pass@1 for students who have only coded in CS1 (0.17) versus those with additional experience (0.24).• First-generation college students: 19.1% of participants identified as first-generation college students.We observe a statistically reliable difference in pass@1 for first-generation participants, who struggle more with the task than others (t-test; =0.04).
We examined other factors, but found no significant difference in pass@1 rates: • Computing intensive majors: 42% of participants were pursuing computationally intensive majors.We observe identical pass rates for both computing and non-computing majors.• International students: International and U.S. domestic students had similar pass@1 rates.• Household language: Our participants reported growing up in households where a diverse set of languages were spoken: only English (40.8%),English and other languages (34.2%), and without English (24.2%).We were surprised to find that pass@1 did not reliably vary by childhood language.However, all participants were from selective U.S. institutions that require fluency in English, regardless of childhood language exposure.• Public vs private high schools: 1/3 of participants attended private schools; this had no impact on pass rates.

RQ2: WHERE DO STUDENT DIFFICULTIES COME FROM?
Having shown that students find it hard to prompt a Code LLM in natural language ( §6), we explore why.In this section, we present quantitative and qualitative results that address RQ2: when students struggle with the task, where do the struggles come from?What are the most challenging aspects of the natural-language to-code task?
7.1 What aspects of the task do students say are hard?
In the semi-structured interview, we asked participants to reflect on challenges and issues they encountered.Three common themes emerged: difficulties in getting Charlie to understand them; issues with the generated code; and issues stemming from students' selfreported lack of knowledge or skill (Table 3).
Charlie Doesn't Understand Me.The most commonly raised issues related to Charlie's understanding of prompts (n=91); we divided these into subcodes.One of the most common of these was the sentiment that Charlie failed to understand good descriptions (n=23).For instance, redCoyote commented, "It was definitely difficult to have a concept of what you wanted written in your head, and then feel like you're articulating it well, but having it not work properly." Similarly, aquaLadybug reports feeling helpless when a good prompt didn't succeed: "if I was saying it [...] how I thought [...] is the best way to say it, but it still wasn't working, I had no idea where to go from there." Issues with Generated Code.Another major theme was issues with the generated code.Many commments related to perceived bugs in the generated code or difficulty debugging (26%).Students also mentioned finding the model's randomness frustrating (8%).khakiBee was alarmed to find that resubmitting the same prompt could generate different programs, commenting "You feel like you've made progress, and then because it did a different thing the next time, it's like, what do I change?I'm trying to change what I give to the cow.And then that should change what the cow is doing.But if I'm not changing anything, why is that changing?"Some students also experienced the opposite issue: despite changing their descriptions, the model generated the same incorrect function repeatedly.pur-pleCarp commented, "Sometimes I changed my [...] description and it just repeated the code the same.And it's just very frustrating".This highlights the difficulty of working with stochastic models: students expect the model output to be faithful to their descriptions.
Student Struggles.Participants also reported issues stemming from their own lack of knowledge.10% of students reported difficulty understanding a problem, and 8% reported difficulty in understanding generated code.yellowChipmunk said, "sometimes with the code, just given my knowledge, that's not necessarily the way I would go about coding the code.But I think to even understand it, I would have to know what the code is trying to do, which takes more time than me just trying to reword what I said".A handful (n=4) reported that forgetting terminology made it hard to write prompts.

Which Problems Do Students Say Are Hard?
Some categories of CS1 problems may be harder to solve with Code LLMs, either because the concepts are difficult or because they are difficult to describe.We examine pass@1 and eventual success rate by category as well as interview responses about which problems were challenging and easy.
We find that pass@1 and eventual success rates both vary by category (Table 4).We fit a binomial mixed-effects model to prompt success (1 if the prompt succeeded; 0 otherwise), with fixed effects of category, institution, and their interaction, and random effects  4: Pass@1 and success rates by problem category.Each category has six problems, and an equal number of students attempted each problem.The starred (*) problems were timed.Student Difficulty Ranking is done by ordering mean Eventual Success Rate from least to greatest, as that provides as measure of what percentage of students successfully solved a given task.
of problem and participant (see Appendix, Table 12).A statistically reliable difference in success was observed only for Sorting problems, which were the most challenging (=0.045).Participants from Oberlin struggled more in the Nested category compared to other students, but the effect is not statistically reliable (=0.063).
Interviews provide insight into their post-task perspectives.The most commonly mentioned easiest category was Math (n=21), whereas the most common for hardest was Nested (n=19), followed by Dictionaries (n=14).These do not match the ranking in Table 4, suggesting a disconnect between student performance and perceptions of difficulty.
A common theme that emerged related to the challenge of putting understanding of the problem into English (n=44).crimsonVole said, "the ones that had huge lists of like, strings, and integers, were really hard to solve, because they were really hard to describe for me." We differentiated this code both from students' ability to identify patterns (n=35) and their ability to write the code without Charlie (n=8).The opposite code, Easy to Describe, applied to 36 responses from the easiest question: "I felt like time ones because they're pretty straightforward.They're like [...] exercises that we do in my Intro CS class.And so I guess it will be easier for me to word, the description or my thinking process, like I guess that might be easier." (yelllowWeasel).
Three codes that related to student's lack of knowledge emerged, with 27 responses (see §7.4 for more perspectives).

What Role Does the Model Play?
LLMs can fail in surprising ways.We now explore the kinds of model failures that participants encountered.

Syntax errors. Contemporary Code LLMs generally produce
syntactically well-formed programs.However, 5.5% of student prompts led to Python syntax errors.We manually examined and categorized them: • 27 generations: Codex produces degenerate, repetitive text [43] or Python 2 print statements.These are model failures.• 81 generations: Codex could not generate a complete function within the 256 token limit (≈800 characters).Our problems are simple enough to be solvable in far fewer tokens, so increasing the token limit is unlikely to help.• 88 generations: Codex generates incomplete code after a complete function, even with standard stop tokens.
The latter two categories arise from a trade-off in system design: the first when the interface does not request enough tokens from the Code LLM; the second when it requests so many that the model generates extraneous additional code.Although these errors are infrequent, they are hard for students to deal with.In 22.4% of these cases (n=44), students gave up after seeing the syntax error.

When the Model Produces Different Programs From the Same
Prompt.Codex is best at coding when its output is sampled ( §4.2), but this stochasticity can frustrate students trying to modify prompts.In 107 cases (4.2%), a student submitted a prompt several times, and in most of these cases, Codex generates a new completion.A few of these are trivially different (e.g., different variable names), but most (n=86) are different functions.Some students pointed this out in the interview -beigeHalibut noted that they "usually would run a couple times, because Charlie is not very consistent with the answers.And sometimes it works.Sometimes it wouldn't work."

When the Model Produces the Same Program Despite
Changes to the Prompt.When the Code LLM produces an incorrect function, and a user edits their prompt, their intent is to have the LLM produce a different-hopefully correct-function.Frustratingly, this does not necessarily happen: sometimes the model repeatedly generates the same code despite edits to the prompt.We observe many instances where this happens (104 submissions, 11% of total): it occurs in most problems (36 of 48 problems) and is encountered by a majority of students (72 of 120 students).This often leads students to give up.In fact, out of the 340 problems where students gave up, 70 were cases where the participant edited the prompt and the LLM repeatedly generated the same code.

What Do Students Do When They
Encounter Unfamiliar Python?
Code LLMs are trained on online repositories of code and may generate code using language features that students have not seen before.
New Python Constructs.In their interviews, some students (n=5) report issues understanding code due to unfamiliar language features.oliveBear comments about the lambda construct for anonymous functions: "I've only ever seen [it]   replace, and try/except.List comprehensions are an interesting case because Wellesley teaches them, but Oberlin does not.When asked about generated code with list comprehensions, 9/24 (37.5%)Oberlin students indicated that it is similar to code they would write themselves, compared to 20/33 (60.6%)Wellesley students.Some students responded differently to the same completion (Figure 6).
Ratings of Final Completions.Students evaluated the correctness and naturalness of the final completion for each problem, producing 960 responses.For correctness, 61.8% of the time students indicated that Charlie's code was correct; the majority (543; 91%) are cases where all tests passed.However, naturalness responses were more mixed.Students indicated that Charlie's code was like code they would write themselves only 58.3% of the time.78.6% of such responses were made when the code passed all tests.Responses to these questions might diverge when the model generates correct code that is unfamiliar or approaches a problem differently, as well as in cases where the model's code is incorrect, but looks familiar to students.

RQ3: STUDENTS' MENTAL MODELS AND PROCESSES
This section addresses RQ3, presenting results related to participants' perceptions of the task, their mental models of Charlie, and their strategies for writing prompts.

How does Charlie work, according to students?
In interviews, students were asked how they thought Charlie worked (Table 5).Comments fell into two broad themes: descriptions of Charlie's knowledge, and descriptions of Charlie's processes.
Processes.Comments in the Translation theme (n=13) described Charlie in terms of a machine translation process (fuchsiaBeaver: "I thought of him as like a translator, like between English and code").Comments in the Sequential theme (n=13) described Charlie as working line-by-line through their prompt.This is plausible but incorrect: Code LLMs condition on the entire prompt at once.This mental model might lead students to focus on individual sentences, rather than how their prompt works as a holistic description.One student actually changed their mental model while answering: "it looks like he went line by line.Wrote some code for each line that makes sense to him [ then he returned a line of code, which makes me think that he wasn't going line by line." (khakiClam).
Charlie's Knowledge.Most students hypothesized that Charlie relies on keywords (n=46).A large group of students (n=30) had a vague keyword mental model.For instance, "I guess he probably looks for keywords, "if" and "else" and key coding words, Python words, and he probably has a knowledge of English" (wheatOtter).Another group (n=16) outline a more specific keyword lookup model, where Charlie uses keywords to retrieve relevant code from a dictionary or database.For instance, linenBobcat described Charlie as "using the code words, and doing it sort of line by line and trying to work from what was given and writing those words with what, like in a directory or some sort of data file, understanding which ones matched up to which functions and which commands." Students with this mental model emphasize the importance of using programming terminology, since they think Charlie may not be able to retrieve code without the right keywords.Some students develop this mental model after observing that their prompts succeed when they use coding words: "I noticed that if I put in more like, computerized words, I almost had a bit more control.At one point, I forgot to mention that the function returns something.So then when I mentioned that it returned something he put in a return statement.So that felt like very, like logical to me.[...] Charlie's looking for words that kind of line up with different functions, built in functions, and using those." (tanMinnow).These students correctly observe that sounding like a programmer is important, but explain this with an incorrect mental model.Some students did correctly identify Charlie as similar to an LLM such as ChatGPT (n=17) or Copilot/Codex (n=2).Success rates for this group were slightly higher (0.27 versus 0.22; =0.03).

What strategies do students develop?
The first two semi-structured interview questions asked students about their strategies for writing and editing prompts.We find that students do not have a clear understanding of how models work and that their incorrect mental models appear to affect the strategies they develop for prompting in ways that might be unproductive.

Editing processes.
Over a third of students (n=48) mentioned adding detail to their descriptions when they did not succeed (Table 6).Some students mentioned clarity as a goal in adding detail, Changes between the second-to-last and last prompt  like fuschiaBat: "I will go back and try to change the wording to make it more clear, and then try it again.And see if that changes anything.And then just try to repeat that process until it works." Others noted that their descriptions needed additional detail because they did not originally fully describe the problem, or as plumBeetle puts it, "I forgot to uppercase Aspen.And that was just my silly mistake.And I will just go back and edit or add changes that I want to add and wish it's gonna work the next time I guess." Considering participants' edits quantitatively confirms the popularity of adding detail.When we consider pairs of prompts that ultimately succeed, we find that students, on average, add 9.44 words (SD = 11.34) between their first and last prompt, and 5.36 words (SD = 8.87) between their penultimate and last prompt (Figure 7).
While adding details was the most common approach, participants mentioned other strategies, such as reordering (n=5) or removing detail (n=4).There are also eight attempts where rerunning the same prompt resulted in a success; we discuss these cases in §7.3.Students looked in different places for insight into how to edit their prompts.Some considered the generated code first (n=29), some the tests (n=30).Others considered both (n=8) or reread the problem (n=7).

Strategy changes over time.
Participants had a range of responses about how their prompting processes changed over time.Some students indicated that they never really developed a process (n=13), while others (n=14) discussed actively testing and adapting to Charlie's capabilities: "I first [...] was kind of seeing what vocabulary Charlie knew.Like if he knew computer science terms, or if I had to be less computer science-y" (beigeBass).
We present key trajectories in Figure 8. Overall, we observe a range of reported experiences.Some participants reported starting more human-like and ending more technical (Pythonic), while others said the opposite.For instance, tomatoBeetle reported, "To begin with, I was using less technical terms and then using more computer science terms near the end.I was thinking that would make Charlie work better, but there wasn't really any evidence behind that", Only trajectories between pairs are visualized.The size of the nodes is proportional to the total number of students who described their Start or End within that code.
while grayRabbit said, "I kind of treated it like I was just coding but saying things I would like use kind of like if statements and integers and stuff.But towards the end, I tried to focus more on how I could say what was going on at a higher level, so using more plain language versus specific coding language." A large group reported that their prompts became more detailed (n=35) and/or more technical (n=31), mirroring the finding above that students typically add detail when editing.For instance, tanBat reports, "My initial process was just to figure out what the code is doing and then just write generic descriptions, like without any coding language inside of it.But then when I saw that Charlie kept having problems, I started to go to more coding language." However, others took the opposite approach, and ended the study writing more human-like (n=11) or concise (n=16) descriptions.

Do Students Get Better at Prompting Over Time?
It is easy to argue that programming by prompting a Code LLM with prose is more natural than directly writing code and that Code LLM prompting is easy to learn.But how easy is easy?We investigate whether students improve at prompt writing over the course of the study.We explore this by comparing success rates for (1) students who attempted the problem first with (2) students who attempted the problem last.Our experiment design ensures that there are exactly 5 students who attempt each problem first and five more who attempt it last.We find no significant difference in success rates between the two groups, indicating that students do not observably improve at prompting within the 75 minute study.

What do students think about Charlie?
One of the most consistent findings in work on how experts use Code LLMs is that users enjoy using models [69,105], even when no concrete productivity or correctness benefits are observed [97,102].However, near-novices exhibit different motivations and relationships to technology than expert programmers.This makes it important to investigate how non-experts feel about these systems.
8.4.1 Charlie's competence and reliability.The post-task survey asks participants several sets of questions related to their perceptions of Charlie.They completed 5 items from Bartneck et al. [7] adapted by Wang et al. [99] and Druga and Ko [20] for non-robotics use.Participants generally give Charlie middling ratings for knowledge and competence.Participants take more extreme positions on Charlie's persona, in opposite directions: they rate Charlie as both friendly and machinelike.Students who experience lower success rates find Charlie somewhat less competent, but do not seem to find Charlie less friendly (Table 7).Students also completed 5 items from Körber [51]'s trust of automation survey.Overall, students see Charlie as somewhat reliable and somewhat interpretable (Table 8).
Students with higher success rates tended to rate Charlie as less error prone, easier to understand, and more reliable.

Would they use Charlie?
The post-survey asked about students' attitudes toward hypothetically using Charlie in (a) the CS1 course they completed and (b) their own future programming practice.We used a thematic analysis approach to analyze this data, as with the interview data (see Appendix A.2 for more details).Overall, two-thirds (n=83) stated that they would be interested in using Charlie in CS1.Many responses were variants of "Yes", but students who responded Maybe (n=13) or No (n=23) typically explained their reasoning.Half (n=19) of these suggested that tools like Charlie would inhibit student learning.For instance, aquaLadybug noted, 'If I had questions on how to program a particular thing, using something like Charlie could help me clarify any questions I had by testing out different descriptions.But if I completely relied on something like Charlie as a tool in such a class, I feel like the whole point of me taking the class is overlooked and at some point becomes redundant."Other students, including those who responded Yes, brought up how programmer skill level could play a role.tealHerring wrote, "Yes, but I would want to maybe only try it out towards the end of the course, when I've already learned the process of coding and would like to see how an AI could work to streamline the process." Other comments touched on academic integrity ("I don't think so unless my teacher explicitly endorsed it because I'm terrified of plagiarism!" -crimsonWorm).
More students supported using tools like Charlie in their own future programming practice (n=95).Maybe (n=20) and No (n=4) respondents again provided more explanation: two common themes included Charlie's limitations and usefulness for different kinds  8: Mean student responses to Charlie trust questions (1 = Strongly agree; 5 = Strongly disagree), adapted from Körber [51], and correlation with success rate.* indicates statistical significance. of problems: "If Charlie improved, then it should be able to generate simple functions for me, in which I don't have to repeat myself" (purpleCarp).

AI Attitudes
Students were asked whether they felt optimistic or pessimistic about AI's future impact on society.About two-thirds of students were optimistic; however, students pursuing a programming major (Computer Science, Data Science, or Media Arts and Science) were notably more optimistic than other students (80% optimistic compared to 63% of other majors).There was no difference in task performance between optimists and pessimists (pass@1 rate = 0.22 for both).
Students were also asked to compare the ethicality of Charlie with three other AI deployment scenarios.Most students found Charlie less ethically concerning in each comparison (Figure 9).Student responses to these questions did not differ reliably in relation to their success rate or pass rate.

DISCUSSION
In the previous sections we discussed our three main research questions -we summarize the findings together here: • RQ1: We find that some students can effectively prompt a common prompting strategy that students developed was to expand their prompts, making them more detailed and more Pythonic.Students viewed the model as fairly capable and somewhat reliable.However, they expressed a range of opinions about whether Code LLMs would be appropriate for CS1.
In this section we draw connections between our findings and related work and discuss their broader implications.

The Natural-Language-to-Code Task is Challenging
The emergence of LLMs have led some to conclude that this is the "end of programming" [65,100].In contrast, we find that beginners who can write code nevertheless struggle to write natural language prompts for LLMs.We carefully select problems that are similar (or identical) to those they completed to pass CS1.The average participant solves 57% of the assigned problems, but only after several repeated attempts and with automatic feedback on code correctness.Our study contributes to the existing work on beginner interactions with Code LLMs by measuring how well students can use Code LLMs to solve problems at their own programming skill level, rather than in the context of a learning activity, where students may not be expected to able to write the code themselves.Despite the fact that all of our participants had passed CS1, which required writing code to solve problems like those in our study, many of them struggled to write natural language descriptions to lead a Code LLM to solve similar tasks.On the whole, our findings reveal a somewhat higher level of difficulty in using Code LLMs than other studies [19,48,78], though it can be challenging to compare across diverse student populations, study designs, and problem types.Our results align most closely with those from Denny et al. [19]'s subsequent study of students with just two weeks of programming instruction.Although their study used only 3 problems and had less experienced programmers, they observed similar challenges: 86% of students eventually solved  their easiest problem, but only 65% solved their hardest task.This is close to the average eventual success rate that we observe.

Not a Panacea for Non-Expert Programming
Learning an effective process for how to prompt a Code LLM is the key to interacting successfully with it in the long term.Existing work on experts reveals different "modes" of interaction [6].Our findings suggest that unlike experts, near-novices do not develop well-defined strategies for how to prompt.Students added more detail to their previous prompts, even when it would have been better to start from scratch.In addition, students' prompting abilities did not observably improve during the study ( §8.3).Students' failure to develop effective strategies may also be linked to their incorrect mental models of how Code LLMs work ( §8.1).These results suggest that prompting, like most ways of interacting with code, needs to be explicitly taught to be used effectively.Kazemitabaar et al. [48] present a study of pre-college students that suggests Code LLMs can improve learning outcomes.They compare student performance with and without access to the Code LLM, and provide considerable support to participants, such as instructor feedback and access to expert-written descriptions of the problem.In three of their task categories, both students with and without access to a Code LLM were able to complete 100% of the tasks, making it difficult to understand the contribution of the Code LLM.In the two more challenging categories, students benefited from the Code LLM, but they also relied heavily on the expertwritten description (reusing it around 40% of the time).Together with our results, we take this to indicate that Code LLMs can be useful to beginners, but that writing prompts remains a barrier.This highlights the importance of understanding why Code LLMs and beginning programmers struggle to understand each other: Kazemitabaar et al. [48] argue that Code LLMs could positively impact student learning, but our results demonstrate a variety of ways that these interactions currently fail.
Our findings provide fine-grained evidence about student challenges that have implications for complete novices, as well as the beginners we study.The results in §8 highlight how effective prompting requires skills that complete novices do not possess.Figure 8 visualizes how students described their start and end approaches to editing, showing that many students who started out writing prompts as for a human transition into using more coding terminology by the end of the study.These participants picked up on a key property of Code LLMs: they are trained on expert-written code and documentation and expect natural language prompts to utilize coding terminology.The strategies that were most effective for our beginners would not be available to true novices.

Don't Assume a Mental Model of AI
Our study suggests that students have incomplete mental models of how Code LLMs work.Although participants knew they were interacting with an AI code generation tool and the majority (n=88, 73% in the post survey) had heard of GPT-3, Github Copilot, or Codex, when asked how they thought our system worked, only 19 students mentioned these models.A notable feature of responses was the number of detailed, but incorrect explanations.The majority of students who gave examples identified a keyword-based lookup strategy, like the dictionaries they had learned about in CS1.
These mental models fail to explain one aspect of Codex that students find frustrating: its stochastic responses.Students are familiar with errors that persist after editing their code.Code LLMs introduce a related but novel experience: submitting the same prompt and getting a different program ( §7.3).This does not occur in standard CS1 settings and cannot be explained by the database/dictionary mental model of Code LLMs that most participants described.Without a well-developed understanding of why this happens, students have simply added another unknown computational behavior to their coding experience.
We note that although Prather et al. [78] report that several of their participants described models as having sentience or agency, none of our participants did.This may reflect the growing public awareness of generative AI between their study and ours, resulting in more realistic attitudes about the capabilities of large language models in our population.Our students seem to understand what AI models can do, but not how they do it.

Implications for Educators
Recent work has shown that Code LLMs can solve CS exams or homework assignments given the educator's description of the problem [17,25].Our findings show that although Code LLMs can solve CS1 problems, CS1 students cannot necessarily prompt Code LLMs to solve CS1 problems.Our findings reiterate the importance of key skills taught in CS1: code comprehension, problem decomposition, and the ability to describe computational problems clearly.
While we do not study learning outcomes explicitly, we find mixed support for Code LLMs as pedagogical tools.The survey portion of our experiment included questions about participants' attitudes towards Code LLMs.About two-thirds of participants expressed interest in using similar technology in CS1.Some participants mentioned that the task helped them remember Python concepts that they had forgotten, or even learn new features (such as list comprehensions for Oberlin students).Others felt that it helped them practice describing technical tasks in natural language; Code LLMs could be used to provide feedback on Explain In Plain English (EiPE) questions [16,64], which many educators see as valuable, but difficult to use without automation [28].Recent work on students' perceptions of automatically-graded EiPE questions provides guidelines that may serve as a first step towards using Code LLMs as automatic backends [44].
On the other hand, a sizeable number of students did not support using Code LLMs in CS1.Some students expressed ethical concerns.Many questioned whether coming to rely on Code LLMs would diminish their knowledge of programming or their sense of fulfillment.Our survey data also highlights a key challenge of contemporary AI: explainability.Students gave Charlie higher ratings for capability than interpretability.Our findings here complement Sun et al. [93]'s exploration of Code LLM explainability needs identified by expert programmers, and Prather et al. [78]'s finding of students' "slow accept" mode, where students spent a lot of time reading code generated by Copilot and deciding whether or not to accept it.
By shedding light on how students feel about Code LLMs, our work augments Lau and Guo [52]'s investigation of CS educators' perspectives on Code LLMs.Our studies were conducted at a similar moment when Code LLMs had recently gained public prominence, but few educators or students had much experience with them.Our students and the educators in Lau and Guo [52] raise strikingly similar concerns about ethics and negative impacts on student learning.Denny et al. [19]'s subsequent experiment found similar concerns among currently enrolled CS1 students.
The large scale of our study also allows us to contribute data to the debate over equity in Lau and Guo [52]'s study, who show conflicting perspectives among educators: some felt that Code LLMs could strengthen the digital divide between students, while others felt that Code LLMs could improve diversity in CS.On the whole, our findings strengthens concerns.We show that students with extracurricular programming experience have an advantage, echoing Kazemitabaar et al. [48]'s finding that more experienced programmers benefit more from using Code LLMs.We also show that prompts written by first generation college students have reliably lower pass@1 rates.Educators should weigh the potential benefits of adopting this new technology against the possibility that it might exacerbate existing equity issues [41].
Finally, our students are ambivalent towards AI systems in general.Around two-thirds were optimistic about AI's impact on society in the future, similar to the proportion interested in using Charlie in CS1.This leaves a sizeable number of beginners who are concerned about AI or uninterested in its use in CS1.Our findings capture a nuanced portrait of how young adults perceive generative AI for programming, captured at a moment where generative AI was increasingly prominent in popular media.

Model Selection for Human-AI Interaction Research
One issue for studies such as ours is the rapid pace of research and development in machine learning.Running lab experiments with humans takes time.However, current proprietary models are often updated or deprecated with very little warning.This study used OpenAI's Codex, which provides state-of-the-art Code LLM performance but came with significant risks.In the middle of our study, OpenAI announced that Codex would be deprecated within a week, which would have seriously compromised our results; after much public concern, they eventually delayed the deprecation until early 2024.
The mismatch between the timescale of ML development and human-subjects research makes it difficult to complete studies using state-of-the-art models, which are largely proprietary.Based on our experience, we recommend not using proprietary models, although this may come with a trade-off in terms of performance, and imposes significant computational requirements for the research team (since alternatives require access to significant GPU resources).Nonetheless, we strongly suggest the use of open source models [59,85] in future work, and potentially for classroom use, to avoid sudden loss of access.This is an example of an ongoing equity concern for researchers and educators.

Timeliness
Conducting work with non-experts and Code LLMs in early 2023 captures a specific moment in the evolution of this technology.Our participant pool represents students who mostly completed CS1 before Code LLMs became commonplace.Collecting this data now is paramount to our understanding of baseline interactions with Code LLMs for students without previous exposure.In the future, the controlled background knowledge of this study will become increasingly hard to come by, both at our institutions and farther afield.
We also see our work as timely because of the struggles and strategies, or lack thereof, that we identify.As computing resources become increasingly directed towards Code LLM technology [58], work such as ours has the potential to impact how companies develop their models, tutorials, and interfaces.We find that nonexperts struggle to execute the full prompt and edit cycle, even with an interface that identifies output correctness.If this trend generalizes to other non-expert groups, Code LLM technology may strengthen the digital divide between expert and non-expert programmers, adding to the wide ranging list of ethical concerns about generative AI [9,49,61].

THREATS TO VALIDITY
A major challenge of studying human-AI interaction is that AI capabilities and popular awareness of them change quickly.ChatGPT was released between our pilot and main experiment; as a result, students' knowledge and experience with large language models underwent significant growth during our experiment.We observed a statistically significant improvement in task performance for students who took the study in the last month.This may spring from increased familiarity with large language models such as ChatGPT or from more recent exposure to CS1 material.
Although we recruited participants who had completed CS1 and no subsequent CS courses, their programming backgrounds were not homogeneous.Some participants had taken a prior programming course in high school or in college, and some were concurrently enrolled in a programming course.We study the effects of additional programming experience in §6.3.In addition, since we recruited students who had taken CS1 as early as Fall 2021, some participants reported having forgotten programming concepts or terms in the intervening time.
Several factors may have biased participants towards reporting positive perceptions of our system.While we ensured that the experimenter running the study was not a educator at the participant's institution, participants were aware that the study involved one of their professors and may have responded more positively as a result.In addition, students may have answered questions about text-to-code more positively because of the anthropomorphic qualities of our system design; several commented about the appealing affect of the Charlie mascot in post-study questions.Charlie may have also had an effect on students' level of task perseverance [54].Finally, novelty bias is always a potential concern when evaluating novel interfaces or systems, as-is self-selection bias for stand-alone studies.

CONCLUSION
We present results from a large-scale, multi-institution study of how near-novices interact with Code LLMs.Our novel experimental design allows us to isolate the prompt writing and editing tasks, by using a lab-based experiment in which participants write natural language descriptions of tasks and receive automated feedback on the correctness of generated code.
Our results suggest that students who have complete a single CS course find using Code LLMs challenging, even with tasks at an appropriate skill level.Our findings highlight the various barriers that they face, ranging from distilling their problem understanding into words, using coding terminology, understanding generated code, and grappling with the stochasticity of Code LLM output.We show that certain groups of students, most notably, first-generation college students, face additional difficulties, raising equity issues related to the deployment of Code LLMs in the classroom.We also illustrate how students' incorrect mental models of how Code LLMs operate inhibit their ability to develop effective prompting strategies.Moreover, our qualitative results provide insight into how beginning programmers feel about introducing Code LLMs in the classroom, bringing their voices into an key contemporary debate and complementing existing work on educators' perspectives.
Our findings suggest that Code LLMs do not signal the "end of programming": in fact, they highlight the many ways in which Code LLMs remain inaccessible to non-experts.We hope that our findings will motivate renewed effort towards democratizing programming by closing this gap.

A ADDITIONAL METHODOLOGICAL DETAILS A.1 Study Design
A.1.1 Pilot Study.In late 2022, we ran an IRB-approved pilot study with 19 participants from all three institutions.These students had completed CS1 and at least one additional course, so they were ineligible for the main study.Overall, we made few changes after the pilot.The most consequential were to add an additional 15 minutes (75 minutes total) to the study window, increase participant compensation, and implement word wrapping in the interface to prevent excessive scrolling.
A.1.2Problem Adaptation.Our problems were based on CS1 problems used at each of our three institutions.In most cases, we made small adaptations to the problems, both to make it less likely for students to recognize the exact problem, and to fit the constraints of the Code LLM task (i.e, changing printed output to returned output, avoiding library imports).
Figure 10 presents two examples of how we adapted problems.Figure 10a shows the original presentation of the problem that was adapted into mod_end.We added an additional parameter so that the function substitutes a given string for the 's' at the end of each string in the list.We also renamed the function.Note that in the original class setting, the problem was presented with three input/output pairs, as in our experimental design.
Figure 10b shows the original presentation of the problem that was turned into find_multiples.We changed the function to return the list of multiples rather than the number of multiples.As in our experiment, the original problem description contained three input/output pairs.A.1.3Problem Validation.By selecting from existing problems in the CS1 curricula, we ensured that the problems were at an appropriate difficulty level for our student population.In order to focus specifically on the human-model interaction, we also needed to ensure that the problems were an appropriate difficulty level for the code generation model: the model is capable of generating a solution, but only when it is appropriately prompted.
Because code generation models memorize common associations between function names and function bodies, it is important to ensure that the model cannot generate a passing implementation from the function name alone.We produced Codex generations from just the function signature for every problem, without any natural language prompt, and measured mean pass@1 rate.We renamed any functions with high pass@1 rates.For our final set of problems, the overall mean pass@1 for function signatures alone is 0.0519.The maximum pass@1 is 0.925, for the problem exp.This means that students generally need to provide a description of the function's intended behavior in order for the model to produce a correct implementation.
We also ensured that there was a prompt that would lead to a correct implementation for every problem.Each problem has an "expert" prompt written by one of the authors for which Codex produces a correct implementation.These prompts were not otherwise used as part of the experiment.
A.1.4Test Case Validation.We rely on unit tests to check the correctness of model-generated code.These tests also produce feedback for students about the model's generated code.We built an initial suite of test cases for each problem by taking tests from grading rubrics and other class resources.We used test coverage and mutation testing [46] to identify missing test cases and build more robust test coverage, while keeping the number of test cases per problem to a size that can be easily displayed.

A.2 Qualitative Analysis
As described in the main body of the paper, the analysis of the qualitative data was done by two researchers with previous qualitative analysis experience.The aim was to identify common themes in the data set, rather than build a generalizable theory.Below we outline the analyses performed on three types of data: (1) data about student experience and demographics, (2) free-response questions about future use of Charlie, and (3) the semi-structured interview responses.We provide the full codebooks, with definitions, for all data types as part of our Supplemental Materials at https://doi.org/10.17605/OSF.IO/V2C4T.

A.2.1 Student
Experience & Demographics.We used thematic analysis for the post survey questions, beginning with the Language, Major, and Experience questions.Codes were developed inductively -the two researchers independently developed codes and then iterated on a code set via conversation and consensus.We did not calculate inter-rated reliability for these questions, as their specific use was for quantitative analysis rather than for specific qualitative trends [66].Once the researchers arrived at a tentative codebook they independently coded and iterated until there was complete consensus on all codes for all data points as part of the post survey.This took one round to normalize code application (e.g., Computer Science was not coded as a Natural Science) and then a second round where the codes were complete, but typos were identified.
A.2.2 Free Response Questions.These questions (UseCharlie and Foresee) were coded second out of the three kinds of qualitative data.This process initially followed a similar inductive style to that described above.Due to the open-ended nature of these responses, both researchers then developed independent definitions for each code to provide clearer guidelines for inclusion/exclusion.They then met to merge their definitions and discuss any discrepancies.For instance, normalizing most definitions to start with "Mentions" and combining definitions or picking the more detailed.Then the researchers independently coded according to the consensus definitions.Arriving at consensus took two rounds.Two sets of codes were combined (two subcodes of Skill Level and two subcodes of Problem Difficulty) and Documentation/Code Understanding was re-coded due to clarifications in their definitions.The final round of coding identified only typos and unintentional omissions.Again, consensus was reached and inter-rated reliability was not calculated for these codes.
A.2.3 Semi-Structured Interview Analysis.We took a different approach to coding the semi-structured interview data than the postsurvey data, as the responses varied significantly in length and precision.The details of the codebook development are described below, but the following process was conducted for all 8 questions: outcome (1 if the prompt succeeded; 0 otherwise).The model included fixed effects of problem category, institution, and their interaction, and random effects of participant and problem.Treatment coding was used for institution, with Northeastern as the baseline category; deviation coding was used for category, since we were interested in whether any one category differed from the average problem difficulty.

B.2.2 Least-Solved Problems.
To understand where struggles arise, we manually examined student responses to two problems: laugh, which has one of the lowest number of student successes, and total_bill, which has a mid-range success rate.
A challenging problem: laugh.One of the least-solved problems in our study was laugh.The intended function takes a number  and produces a string of  "ha"s, where the initial "ha" has  "a"s, and each subsequent laugh has one fewer "a".
Only two students were able to eventually succeed at this task (orchidWalleye and magentaWeasel).However, a manual inspection of all initial student descriptions reveals only one serious misunderstanding of the task (tealPossum) -see Table 13 for all students' initial descriptions.
A mid-range problem: total_bill.The task in total_bill is to compute the total of a grocery bill, using a list of grocery items and a sales tax rate.Each grocery item is itself a list containing the name of the item, a quantity, and a price.One expert description that reliably generates a working program is Returns the sum of multiplying the second and third indices of each list in grocery_list, multiplied by 1 + sales_tax.Round to 2 digits.
We manually inspect all descriptions for this problem.Of the 20 students who attempted this problem, 12 eventually succeed.All of these students follow a similar path: their first attempt omits the rounding step, leading one of the tests to fail.A handful of students also omit or incorrectly describe the sales tax step initially.
What about the students who never succeed?One student initially misunderstands the task, writing: This function takes in a list of the item purchased, the price, the tax, and the overall sales tax.All of the prices and tax within the lists are added together.The sales tax is then multiplied by the outcome of the added prices, and then the result of the multiplication is added onto the total price.The total price is then returned as the output.(limeSalamander) The student has misunderstood a key detail in the structure of the lists: the two numbers are the quantity and price, so they should be multiplied, not added.Consequently, this prompt fails.However, their third description is accurate: This function takes in a list of the item purchased, the amount of the item purchased, the price for each item, and the overall sales tax.The amount purchased is multiplied with the price for each item, creating a total amount.The sales tax is then multiplied by the outcome of the total amount, and then the result of the multiplication is added onto the total price.The total price is then returned as the output.
Although the student initially misunderstood part of the problem, they are able to reread the input/output pairs and/or code, arriving at the correct interpretation eventually.However, their description still fails.This participant eventually runs out of time.The rest of the participants who never succeed submit accurate descriptions that omit key details, such as how to calculate the sales tax (6 participants) or the list positions of the price and quantity (5 participants).
Overall, the student prompts for total_bill demonstrate more issues in describing the problem than in understanding it.Although one participant misunderstands the task initially, they were able to quickly self-correct.

orchidWalleye
function adds 'a' to every 'h' based on input and will lower amount of 'a' until it reaches only 1 'a' after the 'h' 3 khakiBee take in a number and write the word 'ha' but with as many 'a's as the number 7 pinkPerch Produce a string, with each word starting with h and then however many a's the input says.Decrease the count of a's by one following the h for each word after.

5
orchidFlounder the input generates a string where the number corresponds to how many items are in the string.each item in the string also starts with the letter 'h' and the letter 'a' is added to the letter 'h' based on the number of the input.However, only the first item in the string has the number of 'a' equal to the input, the following 'a' are added to 'h' by subtracting 1 from the input.
1 beigeBass the code increases the number of the letters in "ha, " depending on the input in an increasing factorial way 1 tomatoFisher This function takes an integer and an input produces the word "ha" that number of times but the number of times "a" appears in each "ha" decreases by one until "ha"

crimsonVole
Takes in an integer 'n' input and outputs a string with 'n' words, 'h' as the first letter for each word, and 'n' number of 'a's after it, followed by 'h' as the first letter of the next word and 'n-1' number of 'a's after it and so on until we reach n = 1 1 lavenderPossum Given an integer, return a string in the form 'ha' where the integer determines the number of a's and repeat the same pattern until there is one a

lavenderBat
The input takes in a number, say n, and produces a string that has n words.the first word is formed of one "h" and n number of "a".The number of "a" decreases by one for each next word 8 magentaDolphin This function returns the number of laughs in a string, where a laugh is the character 'h' followed by any number of the character 'a'

linenBobcat
Counts the number of laughs, beginning with the given number of "a"s within it and descending by each laugh, totaling the given number of laughs.

grayVole
Takes size and uses recursion to produce that number of "ha" laughs with one less "a" with each "ha" until there is only one "a" left 8 thistleTrout Using the given number, add that number of "a"s after an "h".Count down the number by 1, and add that number of "a"s after another "h" and repeat.

5
Table 13: Initial descriptions of the laugh problem from all 20 students who encountered it.N submissions describes how many times the specific student attempted laugh before succeeding or giving up.

Figure 2 :
Figure 2: Study overview.(1) describes the overall student trajectory through the study.We split the post survey into two sections, divided by the semi-structured interview, to delay collecting demographic information to prevent self-bias.(2) outlines the 8 problem categories (4 timed versus 4 untimed) and the 6 problems per category.Students took individual trajectories through one problem in each category, as shown by the thin arrows.(3) showcases an example trajectory for students through the problems.Students spent, on average, 42.6 minutes (SD=10.6)completing the study, with an average of 26.6 minutes (SD=9.1) on the untimed section and 15.9 minutes (SD=3.3) on the timed section.
(a) An example task posed to a participant.The interface displays the function name and several input/output examples.Participants write and submit a description in the text box.During our study, 85% of students who attempted this problem wrote a successful description after a single CS1 course.(b) We run expert tests automatically and highlight ones that fail.Students are then able to either edit their description by pressing "Try Again" or move on to another problem.

Figure 3 :
Figure 3: The Charlie the Coding Cow interface.

Figure 4 :
Figure4: An overview of the experimental platform.For each problem, the frontend provides the participant with the signature and tests and asks them to write a description (prompt).This is then relayed to the backend, where the signature and prompt are sent to Codex via the API.The code completion from Codex is then run on our pre-defined tests.Finally, the results of running the tests and the code completion are presented to the participant in the frontend interface.

Figure 6 :
Figure 6: An example code completion for the problem expthis was generated by multiple different prompts.The completion was rated differently by Oberlin and Wellesley students, likely due to the list comprehension.

Figure 7 :
Figure 7: Histograms of the 282 prompts which lead to successes after 2 or more attempts.These represent trends in how students edit prompts.The figure on the (left) shows the number of words changed between a first prompt and last prompt.The figure on the (right) shows the final change that produces a successful final prompt.

Figure 8 :
Figure 8: Visualization of how students describe their editing trajectories.The left nodes represent how students described how they began their process.The right nodes represent how students described how they edited prompts at the end of the study.The codes are presented in pairs -Hard versus Easy, Concise versus Detailed, Humanlike versus Pythonic.Only trajectories between pairs are visualized.The size of the nodes is proportional to the total number of students who described their Start or End within that code.

Figure 9 :
Figure 9: Student perceptions of Charlie's ethicality as compared to other AI scenarios

•
Math courses: All but one participant had taken at least one college math course and half had taken 2+ courses.Single variable calculus was the most common math course.There is no statistically reliable difference between participants who had or had not taken 2+ math courses (t-test, =0.42).How insecure, stressed, or discouraged were you?Very low->Very high 3.1

Table 3 :
Thematic codes emerging from responses to What kinds of problems or issues did you run into working with Charlie?
in passing.And so if that hadn't worked, I wouldn't have known what the problem was because I myself don't know how to use that operator." Others mentioned map,

Table 5 :
...] Actually, no, I think he takes in the whole prompt and [...] figures out what to do with the prompt.Because I do remember [...] there were a couple where I give a paragraph and Thematic codes emerging from responses to How did you imagine that Charlie was working?

Table 6 :
Thematic codes emerging from responses to What did you do when you wrote a description, pressed Submit, and it did not work?Describe the steps you took to edit your description.

Table 11 :
Welch Two Sample t-tests to explore differences in pass@1 rates between demographic groups

Table 12 :
Full results of binomial mixed-effects model fitted to problem category and institution.