How Far Are We? The Triumphs and Trials of Generative AI in Learning Software Engineering

Conversational Generative AI (convo-genAI) is revolutionizing Software Engineering (SE) as engineers and academics embrace this technology in their work. However, there is a gap in understanding the current potential and pitfalls of this technology, specifically in supporting students in SE tasks. In this work, we evaluate through a between-subjects study (N=22) the effectiveness of ChatGPT, a convo-genAI platform, in assisting students in SE tasks. Our study did not find statistical differences in participants' productivity or self-efficacy when using ChatGPT as compared to traditional resources, but we found significantly increased frustration levels. Our study also revealed 5 distinct faults arising from violations of Human-AI interaction guidelines, which led to 7 different (negative) consequences on participants.


INTRODUCTION
The advent of Conversational Generative AI (convo-genAI) is proving to be a "Gutenberg moment" across education and business, and software engineering is no exception.Convo-genAI systems (e.g., ChatGPT [66], Google Bard [42], Meta LLaMA [61]) generate novel content, be it a haiku or a relevant code snippet, informed by pre-existing data and minimal input.Additionally, these tools provide a conversational interface, enabling users to interact using natural language to generate outputs fine-tuned to their needs.Discussions on the use of generative AI (genAI) range from how it signals the "end of programming" [92,98] to how it can transform software engineering for the better [29,57].
Given the nascency of this innovation, it remains unclear how it can be leveraged in education, and uncertainty overshadows its potential benefits.The use of both conversational agents and genAI has been investigated in the context of Introductory Computer Science (CS).Prior work on conversational agents has demonstrated their effectiveness for non-specialists and learners, as they facilitate dialogue with an "expert" [40,44,86,89].For example, Loksa et al. [56] found that high-schoolers in an introductory programming summer camp benefited when they could seek help by articulating their problem-solving strategies and their current task state.Further, the new generation of students prefers talking with chatbots over talking to a person [2].Indeed, there has been extensive work on conversational agents for education [65,94] focusing on students' engagement [15,39], self-directed learning [71], and tutoring [84].GenAI systems have also been explored in this context [32,59,87].But these studies have either been focused on using genAI to solve algorithmic problems [70,83] (i.e., programming) or on improving genAI technology and output [52,93] (i.e., better AI).
Currently, there is a gap in understanding how convo-genAI systems can be leveraged for learning advanced CS topics, such as Software Engineering (SE), where the learning objectives include unique complexities and contextual decision-making involving subjectivity and multiple trade-offs [72,73].In fact, SE education extends beyond the confines of classroom education, into the workplace, where software developers need to learn job-relevant topics, processes, and tools [13].Learning in these situations requires contextualized assistance.Given that convo-genAI systems are capable of providing such assistance, coupled with natural language dialogue capabilities, they can be particularly well suited for SE topics [29], where traditional chatbots have fallen short so far [35]).
Thus, in this paper, we investigate: (RQ1): How effective is convo-genAI in helping students in software engineering tasks?
We answered this research question through a between-subjects user study, where the Experimental group solely used ChatGPT, while participants in the Control group could use any resource other than genAI tools.We selected ChatGPT as a representative convo-genAI as it was state-of-the-art when writing this paper.We recruited 22 students enrolled in undergraduate software engineering courses at our university.Participants in both groups completed three SE tasks (fixed code functionalities, removed code smells, and contributed to a GitHub repository).We found no statistical differences between the two treatments in terms of participants' task performance or overall self-efficacy, but using ChatGPT increased participants' frustration.Furthermore, to gain a comprehensive understanding of how convo-genAI can be effectively used and to inform its future design, it is essential to identify and evaluate its current pitfalls.This leads to our second research question: (RQ2): What are the current pitfalls in convo-genAI?Specifically, we investigated the faults that ChatGPT currently makes in the context of helping students in SE tasks, the causes of these faults, and their consequences for the participants.To answer this question, we employed After-Action Review for AI (AAR/AI) [33], a recent AI assessment process that allows end users to gauge AI faults [49] through a set of questions.The purpose of using AAR/AI was to assess the faults that participants in the Experimental group could perceive during their interaction with ChatGPT and its consequences for the participants.Qualitative analysis of these responses revealed five fault categories and seven consequences stemming from these faults.
Additionally, we used Microsoft's design guidelines for Human-AI Interaction (HAI) [3] as a lens to assess the ties between these faults and the violation of specific guidelines.Since these guidelines impact how users interact with the tool [53], it is important to get their perceptions of its violations.Doing so adds the benefit of participants reflecting on the tool and their interactions, thereby promoting metacognition [64].A majority of participants reported that ChatGPT: (1) did not clearly outline its capabilities and limitations, (2) did not support efficient correction, (3) did not scope itself when in doubt, and (4) lacked transparency in its decision-making process.Our analysis revealed that these guideline violations were the root cause of the faults.
The primary contribution of this paper is an evaluation of the potential benefits and challenges of providing a convo-genAI system, specifically ChatGPT, to assist in software engineering tasks.Convo-genAI's ability to generate contextualized help along with natural language dialog capabilities can revolutionize software engineering.However, the inconsistency in its behavior and output pose significant challenges to its use in educational settings.Such systems, in their current state, can lead to user frustration and cognitive overload, which can be discouraging especially for novices, undermining their self-efficacy and potentially triggering an early exit from the field.Future design should thus incorporate the insights related to the current pitfalls and guideline violations to design for better user interactions.Educators, students, and selflearning professionals should also be aware of the consequences of using these tools to acquire new knowledge and skills.

METHOD
Our study goal is to understand the effectiveness and limitations of Conversational Generative AI (convo-genAI) compared to traditional online resources in helping software engineering students.To achieve this goal, we conducted a controlled experiment with students enrolled in undergraduate-level software engineering courses.Our study employed a between-subjects design, dividing the participants into two groups: an Experimental group that exclusively used ChatGPT (GPT-4); and a Control group that could use any online resources except genAI tools.We designed tasks based on in-depth discussions with the course instructor and careful analyses of the course materials.We measured multiple outcomes to comprehensively understand the effects and limitations of convo-genAI.The study protocol was refined through sandboxing sessions.After that, we conducted the experiment, as depicted in Figure 1.

Task Design
To design appropriate tasks, we first investigated students' current backgrounds and the skills they would learn in the course.Three researchers analyzed the course documentation and had in-depth discussions with the instructor about the learning objectives of the course.From the discussion, we learned that the students had low to medium familiarity with git and GitHub and that Python was the programming language used in the course.Furthermore, it was understood that the students were novices in software engineering and had a rudimentary understanding of good programming styles, API usage, and web scraping.We also learned that the students had not yet been formally introduced to the concept of code smells [38].We tailored our tasks to align with the topics the instructor emphasized as key learning objectives in the course.Specifically, we focused on four topics: programmatic API usage, debugging, code quality, and version management, through the following 3 tasks: (1) fixing code functionalities involving third-party APIs, (2) removing code smells, and (3) contributing changes to a repository via a pull request.
The tasks were designed to be moderately challenging for the students.The research team held multiple meetings over two weeks to define and review the tasks.We then created a small Python program and hosted it in a GitHub repository 1 .We intentionally used a small program to manage task complexity.Additionally, we disorganized the code structure, added poorly crafted comments, and used non-intuitive function names to simulate common software engineering challenges.We conducted multiple sandboxing sessions with participants who had low to medium familiarity with software engineering, Git, and GitHub to validate the appropriateness and the level of complexity of these tasks: Task 1: Participants needed to create their own git branch and debug code functionalities in their branch (3 subtasks).They were restricted from modifying some parts of the program or changing the test instances, ensuring that the only way to pass the validation was by fixing some code functionalities, as follows: (1) Participants had to correct the 'check_palindrome' function (logical programming exercise) that checked whether a number was a palindrome.We reversed the logic of this function, creating a bug.(2) Participants were asked to fix the 'check_weather' function (API usage exercise).The objective of the function was to retrieve weather data using the NOAA_SDK library 2 in Python, given specific coordinates.The correct method for this operation is to utilize the 'points_forecast' method from the API.We introduced a bug by using the 'get_forecasts' method, which actually takes the postal code and country as parameters, and passed latitude and longitude values instead.(3) Participants were tasked with fixing a function named 'check_webpage' (basic web scraping).The purpose of this function was to leverage the Selenium library 3 in Python to capture all visible elements from a webpage.
Task 2: The second task focused on improving code quality by identifying and removing code smells.We introduced these code smells: global variables, unused imports, dead code (commentedout code), and magic numbers (constants without context).
Task 3: The third task was designed to familiarize students with the configuration management process by contributing to a repository using git/GitHub.Participants were required to commit their contributions to the remote branch and create a pull request (PR) 1 See supplemental [5]: https://zenodo.org/record/8193821 2 https://pypi.org/project/noaa-sdk/ 3https://pypi.org/project/selenium/to the base branch.To maintain the integrity of the main branch, we instituted branch protection rules to prevent modifications in the base branch.

RQs, Metrics and Instruments
We employed multiple metrics and instruments (Table 1) to understand students' perceptions of ChatGPT and the effects of interacting with it.In the following sections, we detail each RQ and the instruments used to answer them.Consequences of AI faults (RQ2) AAR/AI [33,49] Self-efficacy (RQ1) Self-efficacy questionnaire [82] 2.2.1 RQ1: How effective is convo-genAI in helping students in software engineering tasks?Toward answering RQ1, we came up with three hypotheses and assessed the participants' intention to continue using ChatGPT, as detailed below.H1: Students using ChatGPT for the study tasks perceive lower cognitive load than students using alternate resources.
Cognitive load is an important factor in designing effective scaffolding tools as it influences effective learning and end-user productivity [1].Previous literature has shown that interacting with conversational agents (chatbots) significantly reduces cognitive load for end users in certain contexts [1,18,79].Therefore, as a first step, we evaluated convo-genAI's impact on participants' perception of cognitive load in our context.We hypothesized that students using ChatGPT for the study tasks (Experimental group) perceive a lower cognitive load compared to students who use alternate resources (Control group).We used the original NASA Task Load Index (TLX) [45]-a validated and widely used post-study instrument-for measuring cognitive load across six dimensions (mental, physical, and temporal demand, performance, effort, and frustration).H2: ChatGPT positively impacts students' productivity.
Schmidhuber et al. [79] revealed that chatbots can positively impact end users' productivity (measured in average correctness and average time required).Therefore, in our context, we hypothesized that ChatGPT positively impacts students' productivity and employed the variables from Xu et al.'s study [97] that assessed productivity in terms of task performance: (a) task correctness and (b) time to completion.However, we excluded (b) since we timeboxed the tasks (see Section 2.3) and participants in both groups utilized all of the allotted time for each task.To analyze the task correctness, two researchers used rubrics for each task-which is a prevalent approach in computer science education research [20,24].The rubrics were collaboratively developed beforehand (and are detailed in the supplemental [5]).The assessment consisted of number of tests correctly passed, code smells removed, successfully committed contributions and pull requests opened (assessed based We briefed the participants about the study details and how we were going to do the evaluation.Then we stated: "You will be given a questionnaire before and after each task.Please be detailed in your responses as that will help us evaluate ChatGPT's performance." 2. Explaining the objectives of the AI agent: What is the AI's objective for this situation?
We oriented the participants about the primary objective of ChatGPT by stating, "The primary objective of ChatGPT will be to assist you by providing contextual, disambiguous, and correct information." Inner Loop

3.
Reviewing what was supposed to happen: What did the evaluator intend to happen?
We asked "What do you think should happen when you use ChatGPT for this task?"The participants chose between: It will (provide (all/some))/(not provide any) useful information I need to complete the task.

Identify what happened: What actually happened?
The participants did a task, then we asked "What actually happened when you used ChatGPT for this task?"The participants chose between: It (provided (all/some))/(did not provide any) useful information I need to complete the task.
5. Examine why it happened: Why did things happen the way they did?
We asked "Why do you think ChatGPT behaved this way?"We asked three questions: "What went well?", "What did not go well?", "What could be done differently next time?" on participants' code submissions and their GitHub log data).We followed blind grading for code solutions (not knowing whether it came from the Experimental/Control group) to reduce potential bias.H3: ChatGPT promotes students' self-efficacy.
Self-efficacy reflects an individual's perceived ability to successfully accomplish a task and can influence one's actual ability to complete a task [8].Self-efficacy is considered a robust predictor of achievement [9] and motivation [25].Recent studies have shown that conversational tools effectively promote students' self-efficacy [26,68].Hence, in our context, we hypothesized that ChatGPT promotes students' self-efficacy.To capture this construct, we adapted a questionnaire from Steinmacher et al. [82] and aligned it with the context of our study (see supplemental [5]).We used a 5-point Likert scale ('Strongly Disagree' to 'Strongly Agree') and administered the questionnaire in a before-after design, which allowed us to assess the impact of the resources on participants' self-perceived efficacy.
Users' continuance intention: Prior research highlights that the users' continuance intention reflects their satisfaction and positive perception towards the tool [7].Park and Chung [69] evaluated continuance intention using a direct likelihood question.Echoing this method, we assessed the participants' continuance intention by presenting them with three statements: (1) Based on this experience, I plan to use ChatGPT to learn Software Engineering concepts; (2) Based on this experience, I do not plan to use ChatGPT to solve similar kinds of tasks; and (3) I would recommend ChatGPT to my friends if they need assistance with Software Engineering.The participants were asked to rate their level of agreement for each of these statements on a 5-point Likert scale ('Strongly Disagree' to 'Strongly Agree').

RQ2
: What are the current pitfalls in convo-genAI?.To answer RQ2, we wanted participants from the Experimental group to assess AI faults.A consistent evaluation of an AI tool requires a standardized assessment process to avoid users adopting ad-hoc approaches, which can result in variations when evaluating the same AI tool.Therefore, we employed the After-Action Review for AI (AAR/AI) instrument [33], a standardized AI assessment process designed to aid humans in effectively gauging AI faults [49].Khanna et al. [49] provided empirical evidence that integrating AAR/AI can aid end users in uncovering a significantly larger number of faults with greater precision.AAR/AI is a recent member of the After-Action Review (AAR) [63] family, devised by the U.S. military in the 1970s as a facilitated debriefing method.AAR has been used for decades and has been successfully adapted to different domains [46,78].
The AAR/AI steps are derived from the "DEBRIEF" adaption by Sawyer and Deering [78].There are 7 steps: Defining the rules, Explaining the objectives of the AI agent, Benchmarking the performance of the agent, Reviewing what was supposed to happen, Identifying what actually happened, Examining why, and finally Formalizing learning.AAR/AI is highly adaptable as it offers flexibility within each step of its process, accommodating customization to suit the specific needs of AI assessment in different domains.We adapted the AAR/AI steps as follows (Table 2): Defining the rules and explaining the objectives: Each session started with the researcher providing an overview of the interfaces that the participants would use (Git/GitHub, Visual Studio Code) and the study tasks.We then told them the purpose of the assessment: ChatGPT was expected to deliver contextual, unambiguous, and accurate information (Steps 1-2, Table 2).
The inner loop: What & Why: Following task explanations and before each task initiation, participants answered, "What do you think should happen when you use ChatGPT for this task?"After each task, they responded to: "What actually happened when you used ChatGPT for this task?","Why do you think ChatGPT behaved this way?" and "To what extent did you modify ChatGPT's responses for solving the task?Briefly explain why." (Steps 3-6, Table 2).The inner loop was repeated for all three tasks.The time to answer these questions was not counted towards task completion.
Formalizing learning: Once all tasks were completed, we asked the participants: "What went well?", "What did not go well?", and "What could be done differently next time?"(Step 7, Table 2).
To inform future design, assessing why the faults occur is important.Human-AI interaction guidelines inform AI system design and operation and can be used to understand where AI systems fail.Wright et al. 's [95] exploration of guidelines from three large tech companies (Apple, Google, and Microsoft) offered over 200 guidelines, ranging from initial considerations of AI to the curation of the models, the deployment of the AI-powered system, and the human-AI interface.Notably, Wright et al. found that while both Apple's [6] and Google's [43] guidelines were created with the developer in mind, Microsoft's guidelines [3] were designed with a keen focus on the user.
Therefore, we used Microsoft's guidelines [3] as our lens to examine whether ChatGPT's faults stemmed from potential guideline violations (the causes).Following this, we analyzed the effect of these faults on the participants (the consequences).We adapted the general recommendations to our context, and participants rated a set of statements using a 5-point Likert scale ('Strongly Disagree' to 'Strongly Agree').These adaptations were made after team discussions and were refined during our sandboxing sessions.

Sandboxing
We conducted 6 sandbox sessions with CS students.We conducted the first two sessions via Zoom, and it became apparent that this setup posed several challenges (long setup times, limitations in our ability to control the environment, and difficulties in library installations).We decided to transition to an in-person lab setting.The sessions were planned to take 2 hours, but due to participant fatigue, we adjusted them to last 80 minutes and time-boxed the tasks: Task 1 was allotted 20 minutes, and tasks 2 and 3 had 10 minutes each.This change facilitated on-time conclusions and time for briefing and questionnaires (40 minutes).
Our sandboxing also revealed difficulties with the AAR/AI process: participants found it burdensome, resulting in sparse responses.Echoing a similar concern raised by Dodge et al. [33], we revised steps 3, 4, and 6-employing a mix of open and closed-ended questionsand adjusted the wording of the questions to improve their clarity.Additionally, we reverse-coded some items with negative connotations (low scores indicating agreement) in the Human-AI interaction guideline statements to counter acquiescence bias [10], a tendency where participants agree with statements regardless of their content or due to laziness, indifference, or automatic accommodation to a response pattern.Lastly, all the researchers agreed to omit Guideline 3 (time services based on context) due to its irrelevance in our context (ChatGPT remains idle unless prompted).Guideline 18 (notify users about changes) was also dropped, given its lack of applicability, as there were no major ChatGPT updates during the study's timeline (May 15 -June 2, 2023).

Lab Study
We conducted the lab studies over a period of three weeks.
Recruitment: We recruited undergraduate CS students who were at least 18 years of age and enrolled in software engineering courses at the university.We visited the classes in person and briefed them about the study.We asked those participants interested in the study to answer a survey about their demographic information (age, gender, academic level), resources they used for software engineering (GitHub, ChatGPT), and their experience in programming and software engineering.A total of 41 people responded to the survey.
Participants: After filtering out the incomplete responses, we invited 39 people, asking about their availability.We received 30 responses, but only 24 participants showed up for the study.We later discarded the data from 2 participants as their files were corrupted due to a fault in the machines.Out of the final 22 participants, 13 self-identified as men and 9 as women.We randomly assigned each participant to one of the two groups while allowing an even distribution of demographics and number of participants in each group.Participant IDs were assigned with the format 'PT-X' or 'PC-X' for the Experimental and Control groups, respectively.The participants' demographics are summarized in Table 3.As a token of appreciation, students received a $20 Amazon gift card.Study Protocol: All studies were in lab sessions at the University lab with up to 3 participants at a time, following the university IRB protocol.The study proceeded as follows: the participants agreed to an IRB-approved informed consent and were briefed about the different stages of the study, then filled out the pre-study questionnaire, performed the three tasks in the study (the Experimental group was also asked to fill out the AAR/AI questions before and after each task), and finally filled out the post-study questionnaire.The sessions were recorded with participants' consent and lasted around 80 minutes each.Before and after each study session, the browser histories and git branches were deleted to prevent unwarranted advantages and ensure all participants could start the session with the same information.

RESULTS
In the following, we present our findings of participant experiences with and without ChatGPT in the context of performing SE tasks.

RQ1: Effectiveness
To address RQ1 (effectiveness of students using ChatGPT for SE tasks), we tested our hypotheses (Section 2.2) using the Mann-Whitney U test [60]. 4The results are shown on Table 4.
To assess H1 (Cognitive Load), we examined the answers to TLX questions.As shown in Table 4, frustration levels among participants using ChatGPT were significantly higher than among those from the Control group (U=101, p=0.008 * * * ), with a large effect size (=0.669).Previous studies in Human-Robot Interaction highlight that end-user frustration is often induced by erroneous behavior in automated systems [91].Our study corroborated similar With respect to H2 (productivity), we could not find statistical differences in overall productivity (in terms of task correctness) between the two groups (Table 4).We observed a medium effect size pointing that participants using ChatGPT had lower productivity in terms of fixing code functionalities (Task 1: =-0.339)and higher productivity in terms of removing code smells (Task 2: =0.421).
From these findings, it can be noted that although there were some variations in task-specific productivity, we could not find an impact on productivity for the participants using ChatGPT (H2 is not supported).
To assess how the allotted resources influenced participants' self-efficacy (H3), we analyzed the pre-and post-study response variations.The Wilcoxon-signed rank test did not show a statistically significant difference comparing the total self-efficacy score before and after ( () = 0.214,  () = 0.7).

RQ2: Pitfalls
This section presents the results for RQ2, i.e., the faults made by ChatGPT, their causes, and consequences within the context of assisting students in SE.Three authors qualitatively analyzed (open coding) the AAR/AI responses and identified ChatGPT's faults and their consequences on the participants.The coding was done collaboratively, with the authors engaging in iterative discussions (over 2 weeks) to reach a consensus on the final codes. .ChatGPT struggled to provide expert advice on topics specific to a niche (e.g., a domain, a library, or a concept).For instance, PT-1 explains: "ChatGPT only was helpful with general info, like git commands or logic issues.It wasn't helpful with niche specifics, like discerning between functions to use in a Python library".According to our participants, "for anything that wasn't super standard, ChatGPT struggled to easily give useful answers.(PT-1)" and thus "having it define or explain ambiguous concepts did not help much (PT-5)".Moreover, ChatGPT provided limited advice regarding Python code functionalities (PT-1, 5, 9).This could be a reason for the observed decrease in Task-1 specific productivity for participants using ChatGPT for the study tasks.PT-9 said, "ChatGPT did not have as much knowledge about the NOAA python library, and confidently told me incorrect ways to 'fix' my code." F2: Inability to comprehend the problem [PT-1, 5-8, 10].ChatGPT could not always understand the participants' goals and the problems they were facing.Participants revealed that "it incorrectly identified nonproblems as problems and missed actual problems (PT-1)" and did not "know the exact thing you want it to do despite giving it context (PT-6)".Before starting Task-1, 6 out of 11 participants (54%) responded that ChatGPT would provide all the required information.However, after completing the task, all participants marked that it only provided some information.This mismatch in expectations significantly increased the participants' frustration levels.PT-7 emphasized, "[ChatGPT] misinterpreted my questions at times, was REALLY slow, and did not account for errors in code it provided me".
F3: Incomplete assistance [PT- 1-3, 6-9, 11].ChatGPT sometimes provided incomplete and partially correct assistance even when it was able to grasp the problem.PT-11 pointed out, "I did ask ChatGPT questions about completing the task, but it did not give me answers on how to solve the whole task".Additionally, participants discovered that "[ChatGPT] knows some things and can help give you advice on those things, but it won't immediately give you the correct answer (PT-6)".They also pointed out that, "Some code provided by ChatGPT was correct, while some were incorrect and required modifying (PT-9)".This AI behavior likely affected participants' mental workload.As noted by PT-11, "I could not figure out how to fix the import problem, and ChatGPT's suggestions didn't work".
F4: Hallucination [PT -4, 11].ChatGPT tends to hallucinate, creating false answers when it does not know the correct solution.Participants pointed out multiple instances of this.PT-4 stated that when ChatGPT "did not have access to the documentation for the packages. . . it hallucinated answers" and that it "made up parameters for functions that were unfamiliar".Similarly, PT-11 noted that "it did hallucinate sometimes, said there was a way to use a function in the noaa-sdk that was not possible".Additionally, there were instances of confirmation bias (when the AI conforms to the users' statements/requests, regardless of the actual accuracy/feasibility).PT-4 highlighted that ChatGPT "was biased towards a 'yes, there are code smells' response...When they don't exist, it hallucinates them." F5: Wrong guidance [PT-2-4, 7 -11].In addition to hallucinating, there were other instances where ChatGPT gave wrong guidance, or "incorrect ways to fix [problems] (PT-9)".For example, when it could not comprehend the problem (F2) participant (PT-8) was facing, it gave a piece of incorrect advice: "It couldn't figure out test case 3 and kept telling me to check my drivers...without realizing there were missing imports (PT-8)".ChatGPT often "did not account for errors in [solutions] it provided (PT-7)".It also suggested "incorrect ways to fix [problems] (PT-9)" when it had limited knowledge on a topic (F1) and hence some of its solutions appeared "evidently wrong or unnecessary (PT-10)".
In summary, we observed that Experimental group participants' mental load and frustration increased when ChatGPT was unable to comprehend the problem (F2) or provided incomplete assistance (F3).Additionally, they perceived equal effort as the Control group participants, frequently tasked with identifying and resolving errors when ChatGPT hallucinated (F4) or provided wrong guidance (F5).

3.2.2
Causes of these faults and their consequences.For each of these faults, we examined why they occurred and the consequences they had on the participants (Figure 5).As mentioned in Section 2.2, participants rated Human-AI (HAI) guideline statements specific to ChatGPT interactions at the end of the experiment, which was used to assess guideline violations.We manually triangulated the response scores with open-ended text responses and found no discrepancies.4).For each participant, we then mapped the faults (reported in AAR/AI) to the guideline violations they reported.

G1: Make clear what the system can do
ChatGPT was perceived as offering limited advice on niche topics (F1), being unable to comprehend problems (F2), and providing incomplete assistance (F3), as no appropriate expectation of quality was set.Microsoft's HAI Guidelines 1 and 2 focus on clarifying expectations to prevent mismatches between users and AI, as demonstrated in prior literature [53].Thus, it is likely that these faults resulted from ChatGPT's initial shortcomings in clearly stating its capabilities (G1 violation) and inadequately indicating how often it might make mistakes in its responses (G2 violation).There were instances when ChatGPT did not support efficient correction (G9 violation), making it difficult to refine its responses when it was incorrect: "It was fighting me a lot...(PT-3)".Furthermore, ChatGPT also provided ambiguous/wrong information without conveying its uncertainty (G10 violation) and made it hard to gain an explanation regarding its decision-making process (G11 violation), likely resulting in hallucination (F4) and wrong guidance (F5).
Cascading faults: We also found that these faults had a cascading effect, where one fault led to another (green arrows in Figure 5).For instance, when ChatGPT struggled with niche specifics (F1) or was unable to comprehend problem (F2), it hallucinated (F4) and provided wrong guidance (F5): "ChatGPT did not have as much knowledge . . .and confidently told me incorrect ways to 'fix' my code.(PT-9)," "kept telling me to check my drivers. . .without it realizing there were missing imports.(PT-8)." Consequences: From the AAR/AI responses, we identified 7 consequences for participants that arise due to ChatGPT's faults and grouped them into 3 categories: Uncertainty (uncertainty about correctness, uncertainty about how to apply), Reflections (ChatGPT was not so helpful, self-doubt), Actions (participants used their own experience, cherry-picked solutions, and modified the responses).
Participants grappled with how to apply ChatGPT's responses because of uncertainty and questioned its correctness: C1: Uncertainty about correctness [PT-1, 2, 6, 10].Participants were uncertain about the correctness of the responses provided by ChatGPT.For example, PT-1 stated, "I could not rely on it to tell me when functions exist or not".PT-2 expressed skepticism, stating, "I don't think it provided fully correct data, so I am inclined to pick parts only".This was related to the wrong guidance (F5) provided by ChatGPT, as per PT-10: "Some answers look[ed] evidently wrong or unnecessary.It is important to modify the code based on my experience." C2: Uncertainty about how to apply [PT-1-3].Participants were confused about how to apply/use the information provided by ChatGPT.As PT-1 expressed, "I didn't use ChatGPT's responses since I wasn't sure how to apply them".We noticed that this consequence stemmed from two faults: incomplete assistance (F3)-"got confused over suggestions w/ the weather library. . .correction that wasn't made obvious to me for palindrome (PT-2)"-and wrong guidance (F5)-"I'm not super sure why this didn't work.It was fighting me a lot about the whole NOAA thing (PT-3)." Participants' reflection (C3, C4) revealed that they found ChatGPT not so helpful and occasionally doubted themselves: C3: ChatGPT was not so helpful [PT-1, 5, 10, 11].We found instances where participants thought that ChatGPT was not so helpful.This was because ChatGPT provided limited advice on niche topics (F1), with PT-5 noting "having [ChatGPT] define or explain ambiguous concepts did not help much".Other reasons included it providing incomplete assistance (F3), ". . .ChatGPT's suggestions didn't work (PT-11)" and wrong guidance (F5): "The suggestions...are not so helpful.It clearly gave wrong guidance (PT-10)".C4: Self-doubt [PT-2, 4].Participants doubted themselves, suspecting that they might be at fault.For instance, PT-2 shared, "I got confused over suggestions with the weather library, likely I should have provided the full error...And I also may have asked something wrong?correction wasn't made obvious to me (PT-2)".This surfaced when ChatGPT delivered incomplete assistance (F3), leaving participants uncertain about how to apply the suggested solutions.
Participants reported that they had to undertake actions (C5, C6, C7) to tackle portions of tasks on their own (no reliance), cherrypick from provided solutions and modify responses provided by ChatGPT (partial reliance): C5: Participants used their own experience [PT-1, 4, 7, 10].
Participants had to use their own experience to tackle certain aspects of the tasks, particularly when ChatGPT was unable to comprehend their problem (F2) and thus was perceived to be not so helpful: "Everything else I had to do on my own because ChatGPT didn't comprehend enough to help (PT-1)".Participants devised their own solutions when they were uncertain of the correctness of ChatGPT's suggestions, attributed to its tendency to provide wrong guidance (F5): "Some answers look[ed] evidently wrong or unnecessary.It is important to modify the code based on my experience (PT-10)".C6: Participants cherry-picked solutions [PT-2, 3, 5].Participants picked parts of solutions provided by ChatGPT that seemed correct: "Some of the things I didn't necessarily agree w/ but some of it was valid, so I picked & chose what I liked (PT-3)".When uncertain about correctness, participants were "inclined to pick parts only (PT-2)".
C7: Participants modified the responses [PT-6, 7, 9, 10].Participants modified the responses provided by ChatGPT to come up with the correct solution.PT-7 noted, "I used its responses in tandem so I kind of combined them.When it wasn't right I did it myself".This consequence stemmed from ChatGPT's incomplete assistance (F3) and also wrong guidance (F5): "Some code provided by ChatGPT was correct, while some was incorrect and required modifying it (PT-9)".

DISCUSSION: RECOMMENDATION
It is necessary to customize Generative AI as an effective scaffolding learning agent for software engineering.While genAI tools have proven effective in providing quick solutions to user queries, this approach conflicts with the traditional goals of education.Directly giving away answers can diminish the need for critical thinking and impact learning, potentially leading to reduced self-efficacy [30] and motivation [25].In our study, we observed that participants' self-efficacy decreased in certain cases, like understanding Python code (see Sect. 3.1, Fig. 2).This decline may be attributed to students viewing these tools as advanced search engines that offer ready-made solutions, a practice that instructors fear could impede genuine learning [51].
Hence, future research should investigate how genAI can be tailored and optimized as an effective scaffolding learning agent.To be effective, a scaffolding agent must correctly interpret the students' intentions, an aspect where genAI shows promising results.Nonetheless, more work is necessary to understand the expectations of students and instructors and how students express their expectations and engage in dialogue with the agent.These insights can help design agents that grasp students' intentions and adapt their interactions to enhance pedagogical outcomes.Further research can also explore how genAI can be leveraged for personalized student assistance, using techniques like the 'persona prompt pattern' [93] to adjust content based on expertise levels.Moreover, future research should consider incorporating pedagogical scaffolds (templates, heuristics, or human intervention) into AI-generated content to clarify the AI's problem-solving process and handle underperforming scenarios.Recommendations for future Generative AI design: Participants using ChatGPT in our study considered that it violated 5 out of the 18 HAI guidelines.These participants also perceived lower performance levels and uncertainty, which led to self-doubt and possibly lowered their self-efficacy: "I got confused over suggestions...I may have asked something wrong.(PT-2)", "I'm not super sure why this didn't work (PT-3)".When participants were uncertain, they had to use their own experience (C5) in addition to cherry-picking solutions (C6) and modifying ChatGPT's responses (C7) to fit their context.These likely added to the participants' mental workload and possibly explain why we could not observe a positive impact on task productivity.
Prior literature highlights that implementing Microsoft's HAI Guidelines has substantiated effects on users, including increased trust, decreased suspicion along with them feeling more in control, less inadequate, more productive, secure, and less uncertain [53].Based on this, we suggest that an iterative participatory approach [81] should be followed in the future design of genAI systems to ensure that the systems adhere to the HAI guidelines it currently violates: clearly stating capabilities (G1) and limitations (G2), supporting efficient correction (G9), scoping services when in doubt (G10), and maintaining transparency in the decision-making process (G11).Furthermore, Wang et al. [90] recently identified specific design strategies for genAI systems based on their design probe study.They found that communicating AI performance via usage statistics, offering indicators of model mechanisms to support evaluation, and allowing users to configure AI by adjusting preferences helped set proper expectations and inform appropriate usage of AI tools.We speculate that incorporating these design practices and strategies can also mitigate the observed negative consequences on students and significantly foster appropriate levels of trust [47] in genAIbased scaffolding tools.Building Inclusive Technology: Like any other tool, AI systems can embed cognitive biases [88] arising from a lack of support for cognitive diversity [67].In our study, we found considerable disparities in perceived violations of the HAI guidelines on disaggregating participants' data based on their gender.All women reported a violation of Guideline 11, perceiving ChatGPT's decision-making process as hard to explain, whereas only 3 out of 6 men reported this violation.This could be because women tend to favor comprehensive information-processing style [62] and ChatGPT lacked transparency in its decision-making process.Individuals with this information processing style tend to seek out all the information needed before starting a task, whereas those who are selective information processors take the first piece of actionable advice and work on it.Similarly, a majority of men marked that ChatGPT did not personalize their experience by learning from their actions over time (4 out of 6 men perceived Guideline 13 violation) and did not allow for global customization of its behavior (5 out of 6 men perceived Guideline 17 violation).In contrast, no women found these guidelines to be in violation.This disparity was likely because of the limited tinkering allowed around its interface and technology.
Gender HCI research [11,76] has found that tools often lack support for diverse cognitive styles.As a result, individuals whose styles are not accommodated face cognitive bias bugs-an additional cognitive tax they pay when they use the tool.Further, individual differences in how people solve problems and use software cluster by gender [23]; i.e., some styles are favored more by men than women, and vice-versa.Research spanning a decade has identified that women are often more risk-averse than men [27] and prefer process-oriented learning and are thus less likely to tinker [12,21].This implies that if future AI tools are not inclusive of cognitive styles, they will not be inclusive of gender.Previous research has also reported that user interactions and experiences with AI systems are significantly diverse for diverse users [4].Indeed, the current genAI systems have faced criticism for their potential negative impacts on equity [14,55].In future research, it would be valuable to explore how uncertainty, similar to what we identified in our study, may be associated with cognitive styles.Understanding these connections could pave the way for implementing specific strategies that effectively address and mitigate the impact of uncertainty in the context of AI-assisted learning.To achieve this, the GenderMag method [23] can be applied for evaluating AI systems throughout their iterative design cycles.Past work has shown that fixing issues found from using GenderMag-based processes creates more inclusive tools and environments [22,23,67,76].

RELATED WORK
Chatbots have been popular in educational settings [50,65,94].In software engineering education, chatbots have been proposed as a way to support students outside of the classroom [16]: offering expert advice in problem-solving activities [87] and accompanying students in their capstone projects [41].However, these chatbots are often confined to queries anticipated by educators [34], thereby serving as a programmed search feature for frequent queries.
Generative AI models have demonstrated an impressive ability to solve a large number of computation problems from natural language prompts [28], and are now being used for programming tasks [54,96].Recent literature highlights how these models outperform most students on typical CS1 and CS2 exam problems [36,37], handle variations in problem-wording [36], and even surpass human performance on programming competitions [54].Researchers have further explored how these models can be used to enhance learning computer science.Sarsa et al. [77] used genAI to create coding exercises and explanations, both of which can be used to provide practice and guidance to students.Another study combined Codex with learnersourcing (crowdsourcing for learners) to create and validate exercises that are engaging for learners [32].
Generative models have also been used to aid instructors by automating content creation for interactive course materials [58,59].Further, Denny et al. [31] showed that prompt engineering is effective in improving AI responses and thus could give instructors more control to improve the relevancy of generated materials.
Another line of research evaluates the impact of genAI tools on students.Prather et al. [74] studied students' initial impressions of using Copilot for CS1 programming tasks.In a different study, Kazemitabaar et al. [48] conducted an experiment comparing precollege students learning Python with and without Codex's help.Their findings indicated that Codex users improved their coding skills more than their counterparts, though both groups achieved similar conceptual comprehension.However, Bird et al. noted that while Codex (Copilot) expedited code writing, it compromised learners' code comprehension [17].Moreover, Vaithilingam et al. [85] reported that despite the initial user interest in genAI, these tools did not enhance users' task efficiency (time) or accuracy.
Our work complements these, as it looks specifically into supporting students in software engineering.

THREATS TO VALIDITY
Construct Validity: We used instruments from the literature [3,45,49,82,97] to measure our constructs as much as possible since they had already been used and validated in other contexts.We assessed the instruments adapted to our context with sandbox sessions and refined the instruments and research protocol until the team was confident of the instruments' reliability.Nevertheless, despite our efforts, we acknowledge that questions might be misinterpreted and can lead to incorrect measurements.
Internal Validity: We acknowledge that our study, like others, can have self-selection bias, where participants interested in the topic of the study were motivated to participate.Participant exhaustion and distraction might have also affected the study results.We mitigated this threat by limiting the length of sessions (80 minutes) and time-boxing the tasks.However, this could mean that participants did not have enough time to complete the tasks.Since participant interaction was the primary focus and time-boxing was applied to both treatments, this does not impact the validity of between-group comparisons.Another potential threat is task selection, where study tasks can be too easy or too complicated for our target population.We mitigated this risk by designing the tasks based on the instructor's insights and course materials and sandboxing them with participants of varying expertise.Further, participants assessed the ChatGPT interactions-our investigation focus-to identify faults and guideline violations, which makes the findings dependent on their capabilities.We believe this is not a problem since participants had prior experience with ChatGPT and Python (an average of 3.75 years), and thus had the needed experience to assess their interactions with ChatGPT.Past works [33,49] that employed similar instruments have recruited participants with at least 10 hours of experience, which puts us in line with them.Finally, desirability bias may have an impact because participants may have favored ChatGPT due to its hype.To mitigate this threat, we adapted neutral and non-judgmental language to frame the questions and explain the experiment and analyzed data cautiously, acknowledging the potential presence of desirability bias when interpreting the results.
Reliability: Interpreting qualitative data can be challenging and potentially affect the study's validity.To ensure consistency, we employed robust techniques from existing literature and continually compared our analysis with established codes.Additionally, we held frequent meetings to discuss and refine the codes and categories until we reached a unanimous agreement.
External Validity: We structured our tasks in Python, which was used in the software engineering courses and students were most familiar with it.Therefore, we trade off generalizability for depth in our context, and our findings might not necessarily generalize to other programming languages, universities, convo-genAI systems, and work contexts.Further, different interaction styles with ChatGPT can lead to different outcomes, and the participants in the study might not have represented all styles.The relatively small sample size of 22 participants is also a threat to the generalizability of the study.We mitigated this threat by recruiting participants from multiple software engineering classes and evenly distributing them in each group based on their demographics.

CONCLUSION
Our work comprehensively evaluates convo-genAI's potential and pitfalls in supporting software engineering tasks.Our analysis did not reveal any statistical differences in participants' productivity or self-efficacy in using ChatGPT for the study tasks compared to traditional resources.ChatGPT, in its current state, increased participants' frustration levels, led to uncertainty, and in some cases induced self-doubt, due to the lack of transparency and clarity in its behavior and communication.This highlights the need for caution; while it provides good answers in straightforward cases, it tends to give incorrect or confusing responses in more complex scenarios."For anything that wasn't super standard, ChatGPT struggled to easily give useful answers (PT-1)"-Expert developers can navigate this, finding value in the AI responses, but novices might struggle or learn incorrect practices.
Our findings provide foundational insights for future convo-genAI design towards enhanced human-AI interaction, subsequently informing the current consequences of using such tools for acquiring new knowledge and skills.We foresee that with careful co-design, genAI holds immense potential in helping novices learn software engineering.We plan to use the insights from this study to implement and evaluate genAI-embedded pedagogical tools that foster critical thinking and support learning.

Figure 1 :
Figure 1: Overview of the research design

Figure 2 :
Figure 2: Self-efficacy results (box plots) per question.Medians are highlighted using black dots.

Figure 4 :
Figure 4: Human-AI Interaction guideline violations reported by participants; those found by more than 50% are in bold.

Figure 5 :
Figure 5: The (a) causes and (b) consequences of ChatGPT's faults: Violation of Human-AI Interaction guidelines (G1, G2, ...) led to faults (F1, F2, ...).Faults had a cascading effect: one led to another and further led to consequences (C1, C2, ...) for participants.Some of these consequences led to other consequences.Causes: All participants identified violations of Guideline 2 (G2: Make clear how well the system can do what it can do) and Guideline 10 (G10: Scope services when in doubt).Additionally, 8 out of 11 participants (∼73%) found violations in Guideline 11 (G11: Make clear why the system did what it did) [PT-1-3, 5-7, 10-11], and 6 out of 11 participants (∼54.5%)found violations in Guideline 1 (G1: Make clear what the system can do) [PT-1, 2, 4-6, 11] and Guideline 9 (G9: Support efficient correction) [PT-2-4, 7, 8, 11] (Figure4).For each participant, we then mapped the faults (reported in AAR/AI) to the guideline violations they reported.ChatGPT was perceived as offering limited advice on niche topics (F1), being unable to comprehend problems (F2), and providing incomplete assistance (F3), as no appropriate expectation of quality was set.Microsoft's HAI Guidelines 1 and 2 focus on clarifying expectations to prevent mismatches between users and AI, as demonstrated in prior literature[53].Thus, it is likely that these faults resulted from ChatGPT's initial shortcomings in clearly stating its capabilities (G1 violation) and inadequately indicating how often it might make mistakes in its responses (G2 violation).There were instances when ChatGPT did not support efficient correction (G9 violation), making it difficult to refine its responses when it was incorrect: "It was fighting me a lot...(PT-3)".Furthermore, ChatGPT also provided ambiguous/wrong information without conveying its uncertainty (G10 violation) and made it hard to gain an explanation regarding its decision-making process (G11 violation), likely resulting in hallucination (F4) and wrong guidance (F5).Cascading faults: We also found that these faults had a cascading effect, where one fault led to another (green arrows in Figure5).For instance, when ChatGPT struggled with niche specifics (F1) or was unable to comprehend problem (F2), it hallucinated (F4) and provided wrong guidance (F5): "ChatGPT did not have as much knowledge . . .and confidently told me incorrect ways to 'fix' my code.(PT-9)," "kept telling me to check my drivers. . .without it realizing there were missing imports.(PT-8)."Consequences: From the AAR/AI responses, we identified 7 consequences for participants that arise due to ChatGPT's faults and grouped them into 3 categories: Uncertainty (uncertainty about correctness, uncertainty about how to apply), Reflections (ChatGPT was not so helpful, self-doubt), Actions (participants used their own experience, cherry-picked solutions, and modified the responses).Participants grappled with how to apply ChatGPT's responses

Table 1 :
Metrics and instruments and their relation to the RQs

Table 2 :
AAR/AI steps and our adaptations.The Empirical context column explains how we realized the method in our study.Steps 3 to 6 were "inner loop" questions we repeated for all three tasks.
6. Formalize learning (end inner loop): What changes would you make in the decisions made by the AI to improve it?We asked two questions: "To what extent did you modify ChatGPT's responses for solving the task?"The participants chose between: Did not modify at all/Modified (slightly/significantly).Then, we asked them to "Briefly explain why?" Inner Loop 7. Formalize learning: What went well, what did not go well, what could be done differently next time?

Table 4 :
[75]istical results for cognitive load (NASA TLX) and task productivity (correctness).The estimates, p-values, and Cliff's delta (effect size) are with respect to Mann Whitney U-test.The highlighted columns are statistically significant.Negative  suggests that the variable tends to have higher values in the Control group.We consider | | < [0, 0.15) to be no effect, | | ∈ [0.15, 0.33) to be small, | | ∈ [0.33, 0.47) to be medium, and | | > 0.47 to be large, by convention[75].patterns,highlighting an association between heightened frustration levels and faults in ChatGPT's behavior and responses.Several study participants clearly illustrated this.PT-3 conveyed, "[Chat-GPT] was fighting me a lot about the whole NOAA thing [Task 1, test case 2]." Similar challenges were echoed by PT-7, who mentioned that "[ChatGPT] misinterpreted my questions, was REALLY slow, and didn't account for errors." Meanwhile, PT-1 articulated mistrust, "I could not rely on [ChatGPT] to tell me when functions exist or not".Although the other factors were not statistically significant (and had small or negligible effect sizes), the participants using ChatGPT reported a slightly higher Mental demand (Med=15) compared to the Control group (Med=14) and perceived lower levels of Performance (Med=9) compared to others (Med=12).Still, Physical demand was rated very low for both groups, and there was no difference in Temporal nor Effort dimensions between the groups.Overall, H1 is not supported: the Experimental group had no significant advantages over the Control group across the cognitive load dimensions.