A Comparative Study of AI-Generated (GPT-4) and Human-crafted MCQs in Programming Education

There is a constant need for educators to develop and maintain effective up-to-date assessments. While there is a growing body of research in computing education on utilizing large language models (LLMs) in generation and engagement with coding exercises, the use of LLMs for generating programming MCQs has not been extensively explored. We analyzed the capability of GPT-4 to produce multiple-choice questions (MCQs) aligned with specific learning objectives (LOs) from Python programming classes in higher education. Specifically, we developed an LLM-powered (GPT-4) system for generation of MCQs from high-level course context and module-level LOs. We evaluated 651 LLM-generated and 449 human-crafted MCQs aligned to 246 LOs from 6 Python courses. We found that GPT-4 was capable of producing MCQs with clear language, a single correct choice, and high-quality distractors. We also observed that the generated MCQs appeared to be well-aligned with the LOs. Our findings can be leveraged by educators wishing to take advantage of the state-of-the-art generative models to support MCQ authoring efforts.


INTRODUCTION
Multiple-choice question (MCQ) tests are one of the most popular types of assessment in education [3,52].However, crafting high-quality MCQs that accurately target the intended learning objectives (LOs) requires valuable expertise, is time-consuming and, hence, expensive.This is especially true in technical domains such as computing education where developing effective MCQs poses distinct challenges, e.g., those related to inclusion of pieces of computer code.With changes in technology, growing interest in programming education, and low barriers for students to share past assessments, the demand on instructors to author novel highquality MCQs has never been higher.Recent developments in large language models (LLMs), such as generative pre-trained transformers (GPT), show tremendous potential for addressing this challenge.Leveraging the capabilities of LLMs, educators could potentially (semi-)automate the generation of MCQ assessments.
We developed a novel LLM-based (GPT-4) pipeline to automate generation of MCQs for higher-education Python programming courses.The novelty of our approach lies in making use of a highlevel course context and detailed module-level LOs for producing well-formed high-quality MCQs that use clear language, have plausible distractors, and are well-aligned with the LOs.Since understanding the quality and effectiveness of automatically generated MCQs is of utmost importance we performed a rigorous evaluation of 651 automatically generated and 449 human-crafted questions.If high-quality automated MCQ generation proves feasible it could significantly reduce the time and effort educators currently spend on developing assessments.
To investigate if and how GPT-4 could generate high quality MCQ assessments for higher education programming courses, we analyzed the following research questions: RQ1: To what degree do the generated MCQs meet typical quality requirements?Specifically, do they: (i) provide sufficient information in clear language; (ii) have a single correct answer with (iii) high-quality distractors; and (iv) contain syntactically and logically correct code?RQ2: How well are the generated MCQs aligned with the specified module-level LOs?
By carrying out this work, we provide the following contributions to the computing education research community.To our best knowledge, this is: C1: One of the first studies employing and evaluating LLMs in automatic generation of MCQs for programming classes.C2: One of the first studies generating MCQs not from short pieces of course materials, but from LOs. C3: One of the most extensive (1,100 MCQs) and detailed evaluations of generated MCQs including alignment with LOs.

RELATED WORK
As manually constructing MCQs requires significant effort, researchers have focused heavily on the task of automated question generation.Most work in this field focuses on a specific MCQ element, i.e., the stem (question), the key (correct answer), and the distractors (incorrect options).The most widely researched task is question answering (QA) [2,49]; this is equivalent to key generation, although most work in question answering do not place themselves within this context.The automatic generation of stems is related to generating freeform questions (QG).Initially, these concentrated on reading comprehension tasks, relying on both traditional NLP methods [14] and neural networks [15] to yield meaningful questions [12].More recent systems rely on sequence models [57] or attention-based methods like LLMs [5,7,11,27] (GPT-2 [38], BERT [10]).Additionally, there have been attempts to tackle QA and QG tasks jointly [51].Kurdi et al. provide a comprehensive overview of automated free response and MCQ generation for educational purposes [21].
In contrast to the work on generating stems, current work on automated distractor generation (DG) focuses on cloze tests [16,37,39] and reading comprehension [13,60], as well as MCQs for domains outside of computing such as biology [48].These systems use the stem and/or the key to generate plausible (i.e., similar) distractors.
Often, the systems generate several more distractors than necessary and select the best through a ranking system [25].Given that DG in technical domains poses distinct challenges, research recognizes the potential importance of additional resources such as domainspecific ontologies [20].Recently, LLMs such as BERT [17] or GPT-2 [32] have been utilized in DG.
There is existing work on complete end-to-end MCQ generation.Rodriguez-Torrealba et al. propose an end-to-end pipeline that generates all three elements of an MCQ, with QG, QA, and DG [40], using Google's T5 transformer model.They note that evaluation is difficult, only evaluating 10 questions.Leaf is another T5-powered end-to-end MCQ generation pipeline [55], but has no evaluation.Unlike these systems, which focus on each task separately, our work generates the stem, distractors, and key in a single generation step through GPT-4.Additionally, while existing systems rely on textual context to generate reading-comprehension-like questions, [4,30] we focus on generating a question well-aligned with the provided LO, allowing for easier large-scale evaluation.Cheung et al. adopt a single-pass generation approach using ChatGPT2 to generate MCQs.They rely on medical textbooks to provide context [6], noting that around 40% of the questions are usable as-is without prompt engineering.Finally, Nasution uses the same approach to generate higher education biology MCQs without relying on a reference text, although their analysis focuses on student usage rather than instructor usability, which we focus on [31].Our work is distinguished from [31] by rigorous prompt engineering focused on question quality and alignment, alongside a rigorous evaluation (Section 4).
While the above methods focus on other areas of education, MCQs also play an essential role in computing education.Hence, there have been attempts on efficient generation of MCQs in this area as well.We could not find papers that extend Traynor and Gibson's early attempt [54] to automatically generate MCQs for CS1, as it appears the focus rather shifted towards approaches based on learner-sourcing [8].That is with the exception of the very recent work of Tran et al. who used GPT-3 and GPT-4 to generate isomorphic variations of pre-existing MCQs [53].Our work is most likely the first attempt to generate novel MCQ assessments for higher education programming classes using LLMs.However, there is pre-existing work on answering MCQs with GPT-3 and GPT-4 in computing education [43][44][45].
This has been inspired by many other instances where LLMs have already demonstrated remarkable effectiveness in generating novel language artifacts.These include course readings [22], code explanations [23,28,29,41], model solutions [9,35,42], feedback [34], or responses to help requests [18,26,46,47].The above examples are a testament to the capabilities of the state-of-the-art LLMs in generating grammatically correct, professionally sounding language which makes the artifacts seemingly indistinguishable from the materials generated by human experts.However, such automatically generated learning resources may often lack some of the deep qualities that are necessary for them to be effective [36].Hence, in our work we perform a rigorous evaluation of the generated MCQs.

DATA SET
For the experiments in this paper (Section 5), we assembled a dataset of 246 module-level LOs coming from four Python programming 3  and two introductory data science 4 courses.Three of the courses also contained MCQs (529 in total) that we collected in order to compare them to the automatically generated ones (Section 5).We assumed that these MCQs were created manually.Two of the authors associated each MCQ with the best aligned LO from the corresponding course module.We excluded 51 of the MCQs from our experiments because we could not reasonably assign them to a single corresponding LO.An additional 29 MCQs were excluded from our study because they required more than one correct choice whereas we only focus on MCQs with single-choice keys.Hence, the resulting dataset included 449 human-crafted MCQs.Using the fine-tuned BERT classifier [24] (Section 4 has details), each LO was categorized into one of the six levels of the revised Bloom's taxonomy [19].Table 1 shows the distribution of the LOs across courses and Bloom's taxonomy levels as well as the number of MCQs collected from each course.Note that some of the LOs extracted from the courses were not well-defined (e.g., no action verb) and, hence, it was not possible to categorize them in terms of Bloom's taxonomy.

MCQ GENERATION
Figure 1 shows the overall architecture of the MCQ generation pipeline.To generate an MCQ, we supply only high-level information about the course, the course unit (module), and the targeted LO.We then combine internal MCQ design resources with the user's input into the prompts, which are submitted to GPT-4.The below sections elaborate on each of the steps involved in the MCQ generation pipeline in greater detail.

User Input
The notable feature of the proposed MCQ generation pipeline is that the user is expected to provide only high-level information about the course and the module.This is in stark contrast to the large majority of other MCQ generation systems described in Section 2. While those systems use a piece of text (e.g., a paragraph from a textbook) to generate an MCQ, our system utilizes a specific module-level LO to generate an MCQ that is well-aligned with that LO.This allows us to more carefully align MCQ generation with assessment of student achievement of the intended LO.Besides the specific module-level LO and the module it comes from, the user is expected to provide a course title (see Table 1 for titles of the courses used in this study), a short course description with the list of course-level LOs, and the list of course modules.Using this particular context as well as following the best practices for prompting an LLM [6,31] to generate an MCQ are the prominent features of the presented pipeline.

Design Resources
We curated a set of static resources to support various stages of effective MCQ generation.We focused on the following elements: • MCQ principles -A set of research-validated and generally accepted principles that guide authoring of high-quality MCQs.For example, this includes that distractors should be plausible and limited in number (often just two).We only provided a concise description here but the MCQ Principles section of Figure 2 shows an extensive excerpt.• Bloom's taxonomy -Definitions of the six levels of the revised Bloom's taxonomy [19], i.e., remember, understand, apply, analyze, evaluate, and create.This taxonomy helps educators articulate LOs that focus on concrete actions and behaviors, and target distinct levels of cognitive processes.For LOs to guide the selection of assessments, they must be measurable, i.e., it should be possible to evaluate whether learners attained the intended LO.We also include information on the aims and uses of Bloom's taxonomy.The Bloom's taxonomy section of Figure 2 shows an extensive excerpt.• Question type system -From an informal analysis of the collected dataset of human-crafted MCQs (Section 3), we define five types of programming MCQs: recall, fill-in the blank, identify correct  Providing an LLM with this type of information to create effective MCQs is one of the core contributions of this work.Unlike Leaf [55], our methodology allows for training-free generation of questions and distractors, ensuring effective generation without necessarily worrying about out-of-domain generalization.It is reasonable to assume that this holds for mainstream domains, such as introduction to programming, where plenty of related materials were included in the dataset used for pre-training of the LLM.However, this approach may not be directly applicable in highly specialized domains with less abundant content.

LO Bloom's Taxonomy Level Classifier
We fine-tuned a binary BERT classifier [24] for each Bloom's Taxonomy category on 21,380 LOs from 5,558 university courses.BERT (bidirectional encoder representation from transformers) [10,56] is a popular LLM notable for its fine-tuning capabilities on a downstream task.We used the models to predict the Bloom's Taxonomy level (i.e., remember, understand, apply, analyze, evaluate, or create) of the generated LOs.The prediction is then utilized in the LO Mapping to Question Types step of the pipeline (see below), and it is also embedded in the user message part of the prompt as shown in Figure 3.

LO Mapping to Question Types
We use the automatically predicted Bloom's taxonomy level of the provided LO (see above) to determine the appropriate MCQ types for the LO (see Table 2).We map each taxonomy level to one or more question types and a question is generated for each type: • Recall -Given minimal to no code, students are asked about basic programming concepts or technical details.• Fill in the Blank -Given a code snippet with some sections removed, students are asked to select an option that successfully replaces the blank to create syntactically and semantically correct code.• Scenario Based -Given a scenario or situation, students are asked to identify appropriate tools, methods, or packages best suited to accomplish the task prescribed.• Correct Output -Given a code snippet, students are asked to trace program execution to determine either intermediate or final outputs.• Code Analysis -Given a code snippet, students are asked to spot errors or to build on or use the code in a new manner.
These question types allow us to target specific levels of cognitive processes, e.g., preventing the generation of MCQs involving complex code tracing for LOs asking students to remember a concept.Finally, high-quality MCQ examples are selected based on the question type in order to ensure that the proper cognitive processes are targeted in accordance with the provided LO.The corresponding examples of the high-quality MCQs of that type are retrieved from the design resources.

Prompt
GPT-4 utilizes the so-called system part of the prompt to provide the overall context of the conversation.From the design resources, we retrieve a set of MCQ principles and best practices.We provide the definition of the predicted Bloom's taxonomy level, examples of the selected question type, the course description, and the expected output format of the generated MCQ as well.Figure 2 shows an extensive example of the system part of the prompt.
The user message part of the prompt is the piece of text to which the model is expected to respond (within the context established You are a learning engineer support bot focused on creating top quality multiple-choice question assessments.
A multiple-choice question is a collection of three components aimed at testing a student's understanding of a certain topic, given a particular context of what the student is expected to know.The topic, as well as the context of the topic, will be provided in order to generate effective multiple-choice questions.The three components of a multiple-choice question are as follows: a Stem, a Correct Answer, and two Distractors.There must always be only one correct answer and only two distractors.
The stem refers to the question the student will attempt to answer, as well as the relevant context necessary in order to answer the question.It may be in the form of a question, an incomplete statement, or a scenario.The stem should focus on assessing the specific knowledge or concept the question aims to evaluate.
The Correct Answer refers to the correct, undisputable answer to the question in the stem.
A Distractor is an incorrect answer to the question in the stem and adheres to the following properties.
(1) A distractor should not be obviously wrong.In other words, it must still bear relations to the stem and correct answer.
(2) A distractor should be phrased positively and be a true statement that does not correctly answer the stem, all while giving no clues towards the correct answer.
(3) Although a distractor is incorrect, it must be plausible [...] (4) 4. A distractor must be incorrect.It cannot be correct, or interpreted as correct by someone who strongly grasps the topic.
[...] Use "None of the Above" or "All of the Above" style answer choices sparingly.These answer choices have been shown to, in general, be less effective at measuring or assessing student understanding.
Multiple-choice questions should be clear, concise, and grammatically correct statements.Make sure the questions are worded in a way that is easy to understand and does not introduce unnecessary complexity or ambiguity.Students should be able to understand the questions without confusion.The question should not be too long, and allow most students to finish in less than five minutes.This means adhering to the following properties.
(2) Avoid code that is longer than 20 lines for questions, and longer than 10 lines for the correct answer and distractors.
(3) If you refer to the same item or activity multiple times, use the same phrase each time.(4) Ensure that each multiple-choice question provides full context.In other words, if a phrase or action is not part of the provided topic or topic context that a student is expected to know, then be sure to explain it briefly or consider not including it.(5) Ensure that none of the distractors overlap.In other words, attempt to make each distractor reflect a different misconception on the topic, rather than a single one, if possible.( 6) Avoid too many clues.Do not include too many clues or hints in the answer options, which may make it too obvious for students to determine the correct answer.These options should require students to use their knowledge and reasoning to make an informed choice.[...] Blooms' Taxonomy and Action Verbs: Multiple-choice questions must be well aligned to the learning objectives they are intended to assess students' knowledge on.This implies that they must assess skills at the right cognitive level corresponding to the Bloom's taxonomy categorization of the learning objective.Bloom's Taxonomy offers a framework for categorizing the depth of learning, and it provides guidance on selecting appropriate action verbs when writing learning objectives.Here are the six levels of Bloom's taxonomy and their definitions: • Remember -This level involves retrieving, recognizing, and recalling relevant knowledge from long-term memory.
• Understand -At this level, learners construct meaning from oral, written, and graphic messages through interpreting, exemplifying, classifying, summarizing, inferring, comparing, and explaining.
• Apply -This level requires learners to carry out or use a procedure through executing or implementing it.
• Analyze -At this level, learners break material into constituent parts, determine how the parts relate to one another and to an overall structure or purpose through differentiating, organizing,[...] • Evaluate -This level involves making judgments based on criteria and standards through checking and critiquing.
• Create -At this level, learners put elements together to form a coherent or functional whole, or they reorganize elements into a new pattern or structure through generating, [...]

Course Context
Below is a brief description of the Practical Programming with Python course.
Description: Students learn the concepts, techniques, skills, and tools needed for developing programs in Python.Core topics include types, variables, functions, iteration, conditionals, data structures, classes, objects, modules, and I/O operations.Students get an introductory experience with several development environments, including Jupyter Notebook, as well as selected software development practices, such as test-driven development, debugging, and style.Course projects include real-life applications on enterprise data and document manipulation, web scraping, and data analysis.

Question type: Recall
A recall multiple-choice question often contains minimal code, if at all, in its stem.They may assess a student's understanding of basic programming concepts or include some technical details.It should be conceptual while containing specific knowledge of the course content and learning objectives.
In the context of an Introductory Programming with Python course, these questions typically ask about Python syntax and principles, built-in functions, or standard libraries.They may also evaluate students' understanding of fundamental programming concepts such as coding conventions and object-oriented programming (OOP) principles.

Output Format
Output your multiple-choice question in an easy-to-parse json dictionary format, where the stem is the key, and the correct answer and distractor choices are values.Be sure to clearly distinguish which choice is the correct answer and which are distractors.The question generated should have exactly 2 distractors and 1 correct answer (3 choices in total).If there is code in the stem, please set "code_in_stem" to True.If there is no code in the stem, set "code_in_stem" to False.
by the system part of the prompt).We include the user-provided course name and the course module, the specific module-level LO, the predicted Bloom's taxonomy level, and subsequently mapped question type into the user prompt.Figure 3 shows an extensive example of the user message part of the prompt.

MCQ Generation Step
We use GPT-4 (gpt-4-0613), which is one of the most advanced LLMs released by OpenAI as of the writing of this paper [33].As for the model parameters, we set temperature to its default value of 1.0.We keep top_p, frequency_penalty, and presence_penalty at their default values of 1.0, 0.0, and 0.0 respectively.We impose a max_token limit of 2,000 tokens (i.e., the maximum length of a generated MCQ).
As the response to the prompt, consisting of the system part and the user message part, the GPT-4 model generates an output in the expected JSON format.The response contains the stem, the key, and the distractors, i.e., all the needed constituents of an MCQ.One of the choices is marked as correct.

EXPERIMENTAL DESIGN
To evaluate the quality of the automatically generated MCQs, we generate several MCQs for each of the 246 LOs in our dataset.We generate a single question per question type mapped to the LO's Bloom's taxonomy level as described in Table 2.When we could not assign a Bloom's taxonomy level we simply generate MCQs of all types.Following this process we obtain a sizeable dataset of 651 automatically generated MCQs.Table 2 shows the distribution of the MCQs per both their question type and Bloom's taxonomy level.
We developed a rubric consisting of six criteria shown in Figure 4.The first five criteria of the rubric target RQ1, while the last criterion targets RQ2.Seven students and six CS instructor annotators (all authors of this paper) applied the rubric to the generated (651) as well as human-crafted (449) MCQs, i.e., 1,100 MCQs in total.and Fleiss  scores for each of the six rubric items (Figure 4).

MCQ-Type RMB UND APP ANL EVL CRT N/A Total
Each annotator was asked to complete 250 annotations 5 .We required annotators to attempt answering the question before applying the rubric items.The overall inter-rater agreement in terms of Fleiss  was 0.22 which corresponds to a fair agreement.The  statistic is known to severely underestimate the agreement in situations when one of the labels is clearly dominant [59].This is the case of our data where for each rubric item there is always a clearly dominant answer.To address this issue we also compute Gwet's AC1 which is robust in such situations [58].Table 3 reports the AC1 and  scores for each of the six rubric items.The measured Gwet's AC1 range from 0.62 to 0.96.Each MCQ was annotated by at least one student and one instructor.Overall, 3,076 annotations for the 1,100 MCQs were produced, i.e., a little less than 3 annotations per MCQ on average.To resolve disagreements we used the following rules in the presented order: (1) Majority Vote; (2) Instructors' annotations had precedence over students'; (3) An annotator who correctly answered the MCQ had precedence over the one who answered incorrectly; (4) The least favorable evaluation.
Then, the automatically generated MCQs were compared to the human-crafted ones in terms of the six categories from the developed rubric.To test for statistically significant differences we employ Fisher's exact test (extended to multiple categories through approximation).The reported ratios are based on the counts of the labels after the disagreement resolution described in Section 5.

MCQ Quality (RQ1
Figure 4 shows the results of the experiments described in Section 5.The first five criteria relate to RQ1.From those, we observe that the generated MCQs appear to be of comparable quality to the human-crafted ones.A notable issue is that 4.9% of the automatically generated MCQs had multiple correct answer choices (compared to only 1.1% of the human-crafted MCQs).The quality of the distractors appears to be another rubric item where the MCQ generation pipeline struggles.GPT MCQs were more likely to be annotated as having distractors which gave away the correct answer (4.0% vs 0.9% human).We performed the Fisher exact test to test for statistical significance.The presence of the correct answer ( = 0.002) and the presence of obviously wrong options ( = 0.002) criteria had a statistically significant difference between automatic and human generated questions.This corresponds to the above described differences related to the presence of multiple correct choices in the automatically generated MCQs and distractors which gave away the correct answer.Hence, we conclude that the generated MCQs provide sufficient information in clear language (RQ1.i) and contain syntactically and logically correct code (RQ1.iv).That is they do not appear to differ from the human-crafted MCQs in terms of the mentioned quality criteria.On the other hand, the generated MCQs are somewhat lacking when compared to the human-crafted ones when it comes to having a single correct answer (RQ1.ii) and high-quality distractors (RQ1.iii).

MCQ-LO Alignment RQ2
The results of evaluating RQ2 are also shown in Figure 4-the last rubric item.We observed that the automatically generated MCQs appear to be noticeably better aligned with the LOs as compared to the human-crafted ones.This is confirmed by the Fisher exact test as well ( < 10 −9 ).It appears that human generated MCQs were quite often related to the LOs but did not target the appropriate cognitive level (Bloom's taxonomy) or simply were too different to be considered a viable assessment for the LO (20.5% vs 12.0% auto-generated).Often, the human-crafted MCQs did not relate to the LOs at all (12.0%vs 4.8% auto-generated).This finding is discussed in great detail below.

DISCUSSION
Generated MCQs were more likely to have multiple correct answers.This was mostly observed in MCQs generated for the LOs at the Apply and Create cognitive levels of Bloom's Taxonomy.Consider the example shown in Figure 5.While the answer choices comprise three different methods of converting a string to a list and reversing it, all three of them accomplish the same task with no side-effects.Hence, all of the options are correct.Note that only the first one was marked as correct by GPT-4.While not too prevalent Given the string s = "Hello, World!" you need to create a piece of code that will return a list of the words in s but in reversed order while preserving the original string.Which one of the following code snippets achieves this goal? A. reversed_s = s.split('') reversed_s.reverse()B. reversed_s = ".join(reversed(s))reversed_s = reversed_s.split('') C. reversed_s = s.split('')[::-1] Figure 5: An example generated MCQ where all the choices are correct answers.Only the first choice (A.) was marked as correct by GPT-4.
In Python, the _____ loop is used when we want to iterate over a sequence (like a list, tuple, set, or string) or other iterable objects.Iterating over a sequence is called traversal.for item in iterable: # execute some statements A. for B. while C. do  Compared to the human-crafted ones, the generated MCQs were more likely to have obviously-wrong choices.This was the most pronounced in the Fill-in-the-Blank and Scenario Based MCQ type.Of the MCQs that were annotated as having options that "give away the correct answer", the majority gave the answer away in the question stem.This issue is somewhat less pressing than the one described earlier as it does not render the MCQ completely invalid.While ideally an instructor would recognize the issue and edit the MCQ this cannot be relied on.Figure 6 shows an example MCQ where the code snippet included in the stem gives away the correct answer.
GPT-4's ability to produce effective MCQs was close to human performance.In MCQs where GPT failed to meet a quality requirement this was usually the only issue with the question.Based on our analysis, the most serious issue with the generated MCQs appear to be implausible or plainly incorrect distractors along with revealing the correct answer in the question stem.Future work should focus on prompt engineering techniques to mitigate these issues.Importantly, 81.7% of all the generated MCQs passed all of the evaluation criteria.This suggests that less than 1 in 5 generated questions would require instructors' edits.
The proposed pipeline is designed to generate a single MCQ for a single LO, and in the vast majority of cases, the MCQ was well-aligned with the LO.We observed that the alignment of the generated MCQs was vastly superior compared to the alignment Learning Objective Discuss the importance of writing comments and how to write correct comments.

Generated MCQ
In Python programming, why is it important to write comments in your code? A. Comments are functional elements of the code that can affect program execution.B. Comments help to improve the readability and maintainability of the code by explaining the function and intention of parts of the code.C. Comments are used to debug if there is an error in the program execution.

Human-crafted MCQ
Clean code is easy to maintain.A. True B. False Figure 7: An example where the generated MCQ is better aligned with the LO than the human-crafted one which only appears to target the module's topic and not the LO itself.
of the human-crafted ones (e.g., Figure 7).This is likely due to the fact that educators may often focus more on the alignment of an MCQ with a module's topic and less on the alignment with an LO.Achieving the alignment between LOs and assessments is often challenging for educators.This provides an excellent motivation for a support system such as the one presented in this paper.
During our analysis, we considered Bloom's Taxonomy, MCQtype, LO, and course title as potential factors for MCQ-LO alignment.We observed that relative to the other courses, MCQs generated from the LOs in the Python Essentials 1 course scored slightly worse on alignment.We found that the failure cases for GPT MCQ alignment occurred independently of these variables.While our results are promising, there are some caveats.As we scraped MCQs from the internet, we needed to manually associate MCQs with LOs.This likely explains some part of the observed difference in MCQ-LO alignment between the generated and human-crafted MCQs.
When evaluating the MCQs the human raters were asked to answer the questions.While such a setup does not replace the evaluation of the questions in real classroom settings it may provide tentative insights into the difficulty of the generated MCQs compared to the human-crafted ones.Overall, the human-crafted MCQs appear to be more challenging than the generated ones.The student raters answered 71.5% of the generated MCQs correctly (62.6% human-crafted) while the instructor raters handled 80.1% of the generated questions successfully (76.3% human-crafted).It is important to emphasize that the raters were not instructed to put effort into answering the questions correctly.Hence, the reported rates need to be treated with caution.

IMPLICATIONS FOR TEACHING PRACTICE
Our results suggest that LLM-powered tools can generate MCQs that have comparable quality to MCQs generated by humans while achieving better alignment with LOs.Programming instructors can use LLM-powered tools to reduce their workload, enabling them to focus more on student engagement and curriculum enhancement.Deploying such tools could make updating and revising assessments more efficient.
Of particular note is the finding that automatically generated MCQs seem to be better aligned with the LOs than those generated by humans.This could mean that use of automatically generated MCQs might provide more accurate representations of students' mastering of the intended LOs, potentially leading to more accurate evaluations.Moreover, it indicates that LLM-powered tools could be used not only for assessment generation but also as an invaluable tool for curriculum design and planning [50].
Given the potential of automated MCQ generation, teaching practice could benefit enormously from incorporating the LLM-based tools into the assessment design process.As with the introduction of any novel tool, care should be taken to correctly deploy and use such resource, and to handle its output responsibly (e.g., editing generated MCQs that have issues).

LIMITATIONS AND THREATS TO VALIDITY
While we attempted to make the assessment of the quality of automatically generated MCQs as objective as possible via a welldefined rubric, we acknowledge potential bias from the limited pool of human raters.Additionally, the human generated questions were created before the study and, hence, the instructors authoring the MCQs were not aware of the rubric that we used to evaluate the questions in this study.It is plausible that if the rubric was available to them they could have authored the MCQs to better satisfy the requirements.Finally, as we manually paired humancrafted MCQs to LOs, it is possible we ourselves caused part of the observed MCQ-LO misalignment.
Our study has not evaluated the utility of auto-generated MCQs in live classrooms, and, hence, their pedagogical impact is not known.Although, the LLM-powered system was able to generate questions that seem comparable to those written by humans, it remains to be seen how effective these auto-generated MCQs are at assessing student learning.Additionally, we did not compare the difficulty of the generated MCQs to that of the human-crafted ones.We focused on the alignment between an MCQ and a corresponding LO.However, we did not consider the alignment between an MCQ and the learning content.These important investigations are left for future work.
Our research focused only on Python programming courses at the higher education level, which may limit the generalizability of the results to other programming languages or education levels.While Python is widely used in introductory programming courses, it is possible that LLMs may perform differently when generating MCQs for other languages or more specialized domains.Since the evaluated approach relies on the LLM's "knowledge" of the domain it is quite likely that it would not generalize well to highly specialized domains.

CONCLUSIONS AND FUTURE WORK
This paper provides promising evidence suggesting that systems powered by LLMs can automatically generate high-quality MCQs for Python programming courses.The generated MCQs are comparable in quality to those created by human educators (RQ1) and exhibit strong alignment with LOs (RQ2).Overall, this study demonstrates the feasibility of high-quality automated MCQ generation that has the potential to significantly reduce the time and effort educators currently spend on developing assessments Future work needs to explore the effectiveness of LLM-generated MCQs within real-world classroom contexts.This will allow a better understanding of the uptake, utility and discriminative power of generated MCQs.Additionally, future work needs to be done to better understand the novelty and diversity of MCQs generated with LLMs.Currently, it is not easy to understand how capable this system is at generating entire quizzes and entire quiz pools for particular LOs.Furthermore, we do not assess the refinement capabilities of GPT-4 for this task.Given that GPT-4 is a chat completion model, providing further feedback should lead to enhanced generation, and we do not study these capabilities.There is also the need to understand the effects of different parameter settings, such as temperature [1].Lastly, extending this work to other programming languages and determining the effects on MCQ quality and alignment with LOs would be an important avenue to explore.

Figure 1 :
Figure 1: MCQ Generation Pipeline.The diagram describes the automatic MCQ generation process flowing from left to right.The input provided by the user is processed-the Bloom's taxonomy level of the provided LO is predicted which leads to determination about the appropriate type of the questions and corresponding MCQ examples.These are combined with the user input and resources from the design resources to form the system and user prompts that are submitted to the GPT-4 model outputting the generated MCQ (stem, key, distractors).

Figure 2 shows
additional details (Output Format section).
[...] If any of the multiple-choice items contain code, please format the code snippet as shown below: ```python def test(): return "Correct Format" ``F igure 2: The System Part of the Prompt.The figure shows extensive excerpts from the system part of the prompt showing the main constituents: MCQ Principles, Bloom's Taxonomy, Course Description, Question Type Examples, and Output Format.The colored stripes on the left and the colored badges match the colors of the pipeline constituents from Figure 1.The [...] tokens mark places where the text has been abridget to fit on the page.The purple text is dynamic (data dependent).
Generate a top quality Recall multiple-choice question for the course Practical Programming in Python on the unit: Python Basics and Introduction to Functions.Your generated question should target the following learning objective: Explain what Python is and how to use it to run single-line expressions as well as small multi-line programs.Your generated question should also be at the understand level in Bloom's taxonomy.

Figure 3 :
Figure 3: The User Message Part of the Prompt.The figure shows an example user message part of the prompt showing the main constituents: Question Type, Course Name, Module (unit) Name, Learning Objective, and Bloom's Taxonomy Level.The colored stripes on the left and the colored badges match the colors of the pipeline constituents from Figure 1.The purple text is dynamic (data dependent).

Figure 4 :
Figure 4: MCQ Evaluation Results.The MCQ evaluation rubric is shown on the left.On the right, the automatically generated MCQs (GPT) are compared with the human-crafted ones.The top part of the figure shows the results for the five rubric items focused on the quality of the MCQs (RQ1).The bottom part shows the rubric item addressing the LO-MCQ alignment (RQ2).The reported ratios are based on the counts of the labels after the disagreement resolution described in Section 5.

Figure 6 :
Figure 6: An example generated MCQ where the code snippet included in the stem gives away the correct answer (A).

( 4 .
9% of generated MCQs), this issue is serious and requires human intervention to be fixed.Future work should focus on mitigating it.

Table 1 :
Dataset of manually created MCQs.The first three columns list the names of the courses, the number of modules per course, and the number of MCQs.The remaining columns report the number of LOs from different levels of Bloom's taxonomy (RMB -remember, UND -understand, APP -apply, ANL -analyze, EVL -evaluate, CRT -create, N/A -unassigned).

Table 2 :
Automatically Generated MCQs.The table shows the distribution of the automatically generated MCQs per question types (rows) and the levels of Bloom's taxonomy (columns).