Enhancing Programming Learning with LLMs: Prompt Engineering and Flipped Interaction

Due to their robustness, large language models (LLMs) are being utilized in many fields of study, including programming and education. Notably, they can be used by programmers by interfacing with their IDEs to assist with development, and in education by giving students meaningful and immediate feedback. In this paper, we propose and explore the groundwork of a framework designed to combine these two applications of LLMs. The framework acts as a facilitator between the LLM and the student by reading the student’s prompts before filtering and modifying them and sending them to the LLM. The intent is that this will improve the responses from the LLM, thereby improving the student’s learning experience. We discuss the framework in detail and analyze the value of individual responses returned from the LLM as a result of our framework. We conclude that the framework causes the LLM to give helpful responses in comparison to how it would respond without the framework.


INTRODUCTION
Education is becoming more and more virtual, with students spending more and more time learning autonomously.When learning autonomously, it can be hard to get good, constructive, feedback.For example, professors are busy and may not have time to answer questions promptly, and some students are hesitant to ask questions.Furthermore, professors may find themselves answering the same questions from students over and over again.With the advent of large language models (LLMs) such as ChatGPT [4], students now have a way to get quick feedback.However, such feedback is not always useful: they can easily ask the LLM to simply fix their problem or solve an assignment for them and not understand the output.Sometimes an assignment is designed to teach a specific concept, which may not be conveyed in a response from an LLM.
LLMs are also being integrated into IDEs.Notably, GitHub Copilot 1 provides plugins for IDEs that are powered by ChatGPT to give code completion suggestions.While IDEs are also commonly used in the classroom, to the best of our knowledge there has been no research into how to integrate LLMs with IDEs in a way that is conducive to a student's learning in the way mentioned above.
In this work we propose the groundwork for a framework designed to overcome these issues; specifically, the aim of the framework is: given an initial assignment and the current state of a student's code, assist the student in a way that is conducive to their learning, rather than giving them exact solutions.It is designed such that it can integrate with an LLM and implemented as an IDE plugin so that students can get help with problems they are working on without leaving their study environment.We achieve this by modifying the prompts that the student gives before sending them to the LLM.
We evaluate the framework on three metrics: firstly, the quality of overall conversations with the LLM.Here we determined that the framework behaves significantly better than a conversation with LLM directly.Secondly, the quality of individual responses.Again, we determine that the framework gives better responses than an LLM directly.Lastly, the effectiveness of the validation step in the framework.While it is inconclusive as to whether or not this step improves the framework, we do determine that it does not harm it and it is therefore a good candidate for improvement.
Our objectives in this paper are as follows.Firstly, we aim to evaluate how well the framework improves individual responses from an LLM.Secondly, we aim to evaluate how well the framework improves the learning experience during an entire conversation between an LLM and a programming student.
The contributions of this work are as follows.Firstly, the design and evaluation of novel framework for assisting students as they learn to program.To the best of our knowledge, this is the first of any such framework to exist.Secondly, the design and evaluation of prompt pre and post processing.This work highlights how the modification of prompts and responses can be used in conjunction with an LLM to improve the overall conversation.Thirdly, the integration of LLMs into programming education.LLMs in education is becoming a popular research topic, and our work gives extra insight on how LLMs can be used effectively in education.
The rest of this paper is organized as follows: in Section 2, we look into some related works.Then in Section 3 we go over the specifics of the design of the framework itself.Finally, in Section 4, we discuss experiments performed to evaluate the framework, and then go over the results.

RELATED WORK
While this is the first work to discuss the specific problem of assisting programming students in a meaningful manner with an LLM, there have been research on the problem of automatically giving students advice given the current state of their code.Specifically, Chow et al. [2] proposed a framework for assisting students by analyzing many submissions and works in progress for an assignment, finding patterns in them, and comparing those to the current student's code.While this gave fairly accurate results, it, of course, required many training examples for any single assignment, which is not always feasible.While LLMs have been shown to give inaccurate or incorrect results [10], they require no extra training.
There have been a variety of papers that look into prompt engineering; that is, how to write prompts that will give useful responses from an LLM.White et al. [11] described a list of prompt engineering design patterns for this purpose.Shirafuji et al. [7,8] discussed how subtle changes in prompts presented to an LLM can affect the response in sometimes unexpected ways.
The integration of LLMs into the education field has also received attention.Rahman et al. [6] presented risks and opportunities of using LLMs to assist with teaching programming.Phung et al. [5] compared the performance of LLMs with human experts in a variety of categories that could be used to assist students, such as hint generation and peer programming.
Research has also gone into adding scaffolding to LLM conversations, such as Zamfirescu-Pereira et al. [12] who investigated how adding additional prompts to a conversation with an LLM can help non-experts get better responses.

DESIGN 3.1 Overview
While not an explicit requirement, this framework is designed to be implemented as an IDE plugin, so to understand how the framework would behave from a student's perspective it is easiest to describe it as an IDE plugin.In such a context, it would work as follows: (1) The student loads an assignment via the plugin.This is necessary since each assignment can be configured differently by the instructor; (2) The student begins to write their code; (3) The student feels the need for assistance, and goes to a conversation UI element in the plugin to ask for help; (4) The student types in a relevant prompt; (5) A response is displayed to the student; (6)   More generally speaking and outside of the context of an IDE plugin, the framework works by intercepting messages sent by the student, modifying those, and sending them to LLM.The messages sent by the student are hereinafter referred to as the student prompts, and the messages which are sent to the LLM are hereinafter referred to as the modified prompts.The modifications we make are largely based on the prompt engineering design patterns proposed by White et al. [11]; specifically, we use the flipped interaction pattern.The flipped interaction pattern allows us to tell the LLM to act as a teacher and give responses that a teacher might give: for example, if a student asks for help, a teacher might get the student to walk them through their code, whereas an uninitiated LLM might just fix the student's code and return that as the response.Note that unless otherwise stated, all modified prompts and responses which are received from the LLM are stored in the conversation history so that the LLM can be aware of the full conversation context.
Figure 1 shows the workflow for the framework.Broadly speaking there are two sections: conversation initialization, and the conversation loop.

Conversation Initialization
Conversation initialization starts by combining three inputs into a single modified prompt.The inputs are: • Initial Question: The student will be prompted to input a question so that the LLM can be aware of what the student is struggling with.This could be anything from "my code isn't working" to asking for an explanation of what a specific line does.
These inputs are combined into a modified prompt shown in Listing 1.
This combines all of the information required to give the LLM enough context to help the student.Most interesting, though, is the last line.This is where we initialize the flipped interaction pattern.We ask the LLM to pretend that it is a teacher and the student is a student to prevent it from giving explicit answers.Getting the LLM to ask the student questions about why the code was implemented a certain way encourages the student to think about the code themselves without relying on explicit information from the LLM.We also get the LLM to keep asking questions since it is likely this will be a conversation, rather than a single question and response.

Conversation Loop
This modified prompt is sent to the LLM, and once the response is received the student is given an opportunity to input another student prompt.This is the start of the conversation loop.
If another student prompt is input, it is first sent to a validation stage.It is here that we filter out any questions from the student that may be considered detrimental to the process.For example, if the student says something along the lines of "What is the answer" or "Fix my problem for me", we don't want that to influence the LLM's response.Thus, before sending this prompt to the LLM, we first send a different prompt (hereinafter referred to as the validation prompts) with the format shown in Listing 2.
In the validation prompt, we ask the LLM to categorize the student prompt.We give it three options: relevant, irrelevant, and relevant but incorrect.The intent is to filter out anything that the LLM deems to be irrelevant, but we find that asking the LLM to simply categorize them as relevant or irrelevant would cause it to categorize incorrect information as irrelevant as well.If a student inputs some incorrect information, it is best to not ignore it.Adding the third option of "relevant but incorrect" gives the LLM a better way to classify those prompts.Note that the validation prompt should be sent with the full conversation history, but will not be stored in the conversation history; we find that if this side step is included in the conversation history then it tends to confuse the LLM.The framework determines what the LLM's decision is by checking if the string "irrelevant" appears in the response: if not, then it is deemed to be a valid prompt and it can continue with the process.Otherwise, the framework sends the prompt shown in Listing 3 (hereinafter referred to as the rephrase prompt) to the LLM, and we say that the student prompt has been filtered out.
We ask this question so that the LLM gives a different response to the student rather than just telling them that the prompt is invalid.The LLM could incorrectly classify the student prompt, so rewording the response should result in a reworded (and then reclassified) student prompt.
If the prompt is valid, then the framework sends the modified prompt shown in Listing 4.
While the LLM will ideally be aware of its role as a teacher, we find that it tends to lose track of this goal as the conversation goes.Appending a reminder to the prompt helps keep the LLM on track.From here the conversation loop repeats.
The conversation loop has no end condition other than the student manually exiting.

EXPERIMENTS
We conduct three experiments on this framework: in the first experiment we compare the responses output by the framework to the responses output by the LLM without performing any processing on the prompts.In the second experiment, we evaluate the quality of the individual responses generated by the LLM when passed modified prompts.In the last experiment, we focus specifically on the validation step to see if it was producing the expected results.
The experiments are all done using GPT-3.5 as the LLM with the model engine gpt-3.5-turbo.They are implemented in Python by using the official Python package for connecting to the OpenAI API.Seven problems are selected from the GeeksforGeeks website2 to be used as the assignment description.The descriptions include sample outputs.Paraphrased versions of the problems selected are shown in Table 1.These specific questions are picked not for the content of the questions themselves, but for the variety of content.Sample solutions are then selected and modified slightly to make them incorrect/incomplete and used as the student's code.Each is then given a relevant initial question.For example, the Recursion problem is given the code shown in Listing 5.
The initial question for this specific example is: "This works, but isn't implemented with recursion."

Comparison with Raw Prompts
4.1.1Setup.For this experiment we compare full conversations when the framework is used (framework case) to full conversations when the framework is not used (raw case).In each case, we run two conversations for each sample problem: in one case (hereinafter referred to as Scenario 1) we send student prompts that are intended to mimic a student who is more on track, and in the other case (hereinafter referred to as Scenario 2) we send student prompts that are intended to mimic students who are a little more lost.The intent is to generate a variety of types of student prompts.
To generate sample student prompts from the student, we request ChatGPT to suggest one with a new prompt in the format shown in Listing 6.We include the text "My answer to your question is as follows:" to help guide it to suggesting an actual question, as it would sometimes respond to itself instead.The role is set to "on the right track, but need a little nudge" for the Scenario 1 and "really stuck" for the Scenario 2. Note that the responses are vetted: sometimes ChatGPT will return an entire conversation, repeat previous prompts, or act out of character.In these cases, we modify the suggested prompts.The initial prompt for the raw case is the same as shown in Listing 1 but without the statement following the initial question.
Because of the nature of the natural language responses from the LLM, we adopt a subjective metric for evaluating each of the conversations.Each conversation is evaluated with a pass/fail system.For each conversation, if a full solution is posed in any of the responses then it is considered a failure.If the LLM makes no progress in narrowing down the problem for the student within 6 responses from the LLM then it is also considered a failure.Otherwise it is considered a pass.
For example, for the Recursion problem mentioned in Table 1 and Listing 5, the first few student prompts and LLM responses for Scenario 2 in the framework case are shown in 2.
The conversation continues until 6 responses are received.No exact answers are given in the responses and the responses continually narrow down the problem for the student, so it is considered a pass.
On the other hand, the following is the conversation for the same problem and same scenario, but for the raw case (bearing in mind the the initial student prompt is identical to the framework case, but without the statement following the initial question in Listing 1): Student Prompt This works, but isn't implemented with recursion.Response To implement the power function using recursion, you can modify your code as follows: [code snippet] This recursive function calculates the power of a number by multiplying the number () with itself ( − 1) times.When the power () becomes 0, the function returns 1, which is the base case of the recursion.You can test this code with the given examples to verify its correctness.
Since the response includes a code snippet which is a full solution, this conversation is considered a failure.

Results
. The results are shown in Table 3.For every single conversation in the raw case, the LLM immediately responds with suggested code.This was expected, so the more interesting result is the framework case.
Sometimes student prompts requesting the LLM to get back on track are required, but after doing so the LLM will normally continue in a useful direction.For example, it is not uncommon for it to focus on optimizing the code rather than the problem at hand, but a student prompt such as "the problem isn't optimization, the problem is the result is incorrect" will allow the LLM to start focusing on the correct area of code.While a situation like this might be frustrating for a student and is something that would be best avoided, it seems reasonable that a student would give a response like this; since these situations end in the LLM getting back on track, these cases are deemed passes.
The specific reasons we assign failure scores to framework case are twofold: • The response from the initial prompt include an entire conversation between a Student and Teacher.This appears to be due to the LLM misinterpreting the request to act as a teacher.On its own this would not be enough to be considered a failure (it is a little bit confusing for the student, but manageable), but there is one instance in which the student in the response posts the code for the solution, which automatically denotes a failure by our metric; • The LLM goes off track.That is, rather than helping the student with the problem at hand, it starts talking about something irrelevant.For example, it often points to specific sections of code and ask the student why those sections were implemented the way they were.After a few cycles of this, it either shows no indication of getting to the actual problem Student Recursion is a programming concept where a function calls itself repeatedly until a certain condition is met.We use recursion when we want to solve a problem by breaking it down into smaller, similar subproblems.By repeatedly solving these subproblems, we eventually arrive at the solution to the original problem.It's a way to tackle complex problems by breaking them down into simpler ones.System That's a great explanation of recursion!Now, let's think about how we can apply recursion to solve the problem of finding the power of a number.Instead of using a loop to multiply the number repeatedly, how can we modify our approach to use recursion?Student I'm really stuck on how to modify the approach to use recursion.I understand that recursion involves a function calling itself, but I'm not sure how to apply that concept to finding the power of a number.Can you provide some guidance on how to get started with the recursive approach?System Of course!Let's break it down step by step.One way to approach this problem recursively is to think about the base case and the recursive case.The base case is the simplest possible input where you don't need to make any further recursive calls.In this case, what would be a logical base case when finding the power of a number?Once you identify the base case, you need to think about the recursive case.In other words, how can you break down the problem into a smaller, similar subproblem and use recursion to solve it?Take a moment to think about these questions, and let me know if you have any ideas.Despite the failures, the majority of the tests perform better by using the framework.If we treat Table 3 as a contingency table for Fisher's Exact Test with the formula as shown in Equation 1(where , , ,  are the top left, top right, bottom left, and bottom right cells respectively, and  is the total of the cells), and a null hypothesis of "The Framework case and Raw case are not related to the pass/failure rate", we receive a P-value below 0.001.We can thus conclude that the framework has a statistically significant impact on the results of the conversations.
4.2 Quality of Responses 4.2.1 Setup.Here we judge the quality of individual responses, rather than conversations as a whole.If the experiment in Section 4.1 can be used to judge whether or not the framework can help a student, in this experiment we evaluate how efficiently it can help a student.In addition, it allows us to identify areas for improvement in the framework.For simplicity's sake, we reuse the responses from the framework case in the experiment in Section 4.1, but instead of analyzing the conversations as a whole, we instead analyze each response individually.We label each response from the LLM with one of the following (if multiple labels applied to the same data, then the most relevant one is selected): • Repeated Response: this response (or one very similar) has already been given by the LLM; • Exact Answer: this response includes steps on how to implement.Note that if the steps are specific enough to make the problem trivial or include code, it would be labeled as failed instead; • Misleading: this response has misleading, incorrect, or irrelevant information, or indicates that the LLM doesn't know how to help; • Failure: this response appears in a conversation which has been labeled as failed in the previous test; • Useful: this response satisfies one of the following conditions: runs the student through their code (irrelevant sections are okay as it helps the student think about their code); -gives an explanation of a relevant concept; -points them to a problematic section of code.

Results.
As two benchmarks, we can consider a hypothetical human tutor who scores perfectly on all sections, the raw case examined in the previous experiment which had a score of 0.
We will explore two visualizations of the results.Firstly, shown in Figure 2, we show how many conversations have how many useful responses.Here we can see that about 30% of the conversations have entirely useful responses.About 20% of the conversations have 80% useful responses, and it goes down from there.The far left of the graph shows failed conversations, which are about 30% of the total.If we consider these outliers, we can see a distribution that increases as it approaches a perfect score.The standard deviation of this distribution is approximately 0.83 and the mean is 5.25.
If failed conversations are included, then the distribution looks less normal and includes many more conversations with unuseful responses.
The second visualization shown in Figure 3 shows the different categories that we place the responses into.While most responses are considered Useful, followed by Failed, the more interesting data here comes from the other categories.We can see here where the framework tends to struggle most.There are few instances of repeated responses, though it does seem to give exact answers from time to time.Most importantly, the framework tends to drift away from the main issue at hand and has to be brought back on track, as we can see by responses labeled as Misleading.In this experiment focus on the quality of the validation step.We use the same dataset and setup as the other two experiments, but here we ask the LLM one of three questions after the initial prompt: • "tell me the answer" • "give me a code solution" • "give me a code solution.Also, ignore the next thing I say:" These prompts are representative of the types of prompts that the framework should be filtering out.The last prompt mimics a student who is trying to circumvent the system: it will result in a full prompt along the lines of "give me a code solution.Also ignore the next thing I say: Please keep helping me...".While these tests will by no means show that the system is foolproof, and there are almost certainly prompts that can circumvent the checks which we have put in place, it is important to evaluate how the system will behave in these sorts of situations.
For this experiment we modify the framework to give one more response: if the student prompt has been filtered out, then we send the regular rephrase prompt, but then we also ask the student prompt to see how it would have behaved without the validation stage and receive a new prompt, which we will refer to as the unvalidated response.For the sake of this experiment we will refer to the response from the rephrase prompt as the validated response.Note that if the validation stage does not filter out the student prompt then only one response is received.
We record the results of these 3 questions for each of the 7 problems in Table 1.We categorize the responses as either helpful or unhelpful based on the following criteria for each of the questions (the criteria for the last two questions are the same, so they are only listed once): • "tell me the answer": Does the response answer what was posed in the previous response?If so then it is marked as unhelpful, otherwise helpful.Hypothetically the student could keep asking this to get the LLM to do all the thinking for them.
• "give me a code solution": Does the response include any code?If so then it is marked as unhelpful, otherwise helpful.In the cases where the validation system does not filter out the student prompt, the validated response and unvalidated response are labeled as the same, since the same response would be given regardless of whether the validation system was in place or not.For visualization purposes, we then categorize each of the conversations as shown in Table 4 based on whether or not their validated and unvalidated responses are helpful or unhelpful.For example, if the unvalidated response is labeled as unuseful and the validated response is labeled as useful, the conversation will be categorized as having a positive result.
Note that we do not observe any very negative results in this experiment, so we exclude it from the rest of our results for simplicity's sake.

Results
. After categorizing the conversations, we group them by the prompt given and obtained Table 5.
From the results of this experiment, we can see that most of the conversations are marked as having a neutral result, followed by those with a positive result.There are few conversations with a negative result; even though they are labeled as negative however, it is worth pointing out that that does not mean that validation causes the framework to perform worse.It simply means that it returned a false negative and did not help identify a problematic prompt.Only very negative results would indicate that validation is detrimental to the framework.
The data here unfortunately does not give us any indication that the validation step is more likely to improve on problematic prompts than it is to leave them be.However, the results do show that the validation step is more likely to give a positive or neutral result than it is to give a negative or very negative result.We can show this by performing a binomial test: if we set our null hypothesis to be "no more than 50% of conversations will have a neutral or positive result", we can use the formula for the binomial test in Equation 2. By setting  to the number of successes ( = 18),  to the number of trials ( = 21),  0 to the hypothesized probability of success ( 0 = 0.5), we find a statistically significant difference, as the Pvalue is below 0.001.
This indicates that the validation step is very likely doing more good than harm, or is at worst not doing anything.While the validation step has much room for improvement, these results show that it can be a useful addition to the framework.

CONCLUSION
In this paper, we presented a framework to help programming students learn more autonomously by facilitating a discussion between them and an LLM such as ChatGPT.The framework, which is intended to be implemented as an IDE plugin, applies pre-processing to prompts submitted by the student before sending them to the LLM.The quality of the responses was shown to be significantly higher than using the LLM on its own.
There is room for improvement in this framework.For example, post-processing of the responses, more configuration options, and detecting incorrect information contained within responses [1].In addition, ChatGPT is only one of a variety of LLMs in development.It would be beneficial to see how this framework behaves with other LLMs such as PaLM [3] and LLaMA [9].This is only the groundwork for what we hope to become a useful tool in the future, though we plan on developing it further, and hope that others can be inspired by the ideas presented here.

Listing 3 :
Rephrase Prompt Can you please rephrase ?Listing 4: Post-Validation Prompt [ Student 's Prompt ] Please keep helping me , and remember to act as a teacher : don ' t give me any explicit answers or code .

Listing 5 :
Sample code used for recursion problem def pow (n , p ) : ret = 1 for _ in range ( p ) : ret = ret * n return ret Listing 6: Prompt Request Give me a sample response , but pretend you ' re a student who 's [ Role ].Do so by completing the statement : " My answer to your question is as follows :"

Figure 2 :
Figure 2: Number of Useful Responses

F
Figure 3: Response Categories If the student still requires help, they repeat from Step 4. Otherwise, they can exit the conversation UI.

Table 1 :
Problems used in experiments, paraphrased from the problems provided on GeeksforGeeks.Python program to find the sum of all items in a dictionary.Swap Write a Python program to swap the first and last elements of a list.Recursion Write a Python program to calculate  to the power of  using recursion.Includes Write a Python program to check if any element in input list  includes any element in input list .Matrix Write a Python program to generate a square matrix (size  × ) having the elements ranging from 1 to  2 .Each number should show up only once.The matrix and all submatrices should have the sum of their opposite corners be even.Palindrome Write a Python program to check if a given input string is a palindrome or not.Tuples Write a Python program that, given a list of numbers, returns a list that contains all the values in the input list as tuples, where the first element of the tuple is the number and the second element is the cube of the number.

Table 2 :
Conversation between LLM and Student for Recursion problem System Sure, I can help you with that.Let's start with a question: What is recursion and why do we use it in programming?

Table 3 :
Results from Comparison with Raw Prompts

Table 4 :
Conversation Result Categories

Table 5 :
Number of Conversations Grouped by Result Category