From Human Days to Machine Seconds: Automatically Answering and Generating Machine Learning Final Exams

A final exam in machine learning at a top institution such as MIT, Harvard, or Cornell typically takes faculty days to write, and students hours to solve. We demonstrate that large language models pass machine learning finals at a human level on finals available online and automatically generate new human-quality final exam questions in seconds. Previous work has developed program synthesis and few-shot learning methods to solve university-level problem set questions in mathematics and STEM courses. In this work, we develop and compare methods that solve final exams, which differ from problem sets in several ways: the questions are longer, have multiple parts, are more complicated, and span a broader set of topics. We curate a dataset and benchmark of questions from machine learning final exams available online and code for answering these questions and generating new questions. We show how to generate new questions from other questions and course notes. For reproducibility and future research on this final exam benchmark, we use automatic checkers for multiple-choice, numeric, and questions with expression answers. A student survey comparing the quality, appropriateness, and difficulty of machine-generated questions with human-written questions shows that across multiple aspects, machine-generated questions are indistinguishable from human-generated questions and are suitable for final exams. We perform ablation studies comparing zero-shot learning with few-shot learning and chain-of-thought prompting using GPT-3, OPT, Codex, and ChatGPT across machine learning topics and find that few-shot learning methods perform best. We highlight the transformative potential of language models to streamline the writing and solution of large-scale assessments, significantly reducing the workload from human days to mere machine seconds. Our results suggest that rather than banning large language models such as ChatGPT in class, instructors should teach students to harness them by asking students meta-questions about correctness, completeness, and originality of the responses generated, encouraging critical thinking in academic studies.

in several ways: the questions are longer, have multiple parts, are more complicated, and span a broader set of topics.We curate a dataset and benchmark of questions from machine learning final exams available online and code for answering these questions and generating new questions.We show how to generate new questions from other questions and course notes.For reproducibility and future research on this final exam benchmark, we use automatic checkers for multiple-choice, numeric, and questions with expression answers.We perform ablation studies comparing zeroshot learning with few-shot learning and chain-of-thought prompting using GPT-3, OPT, Codex, and ChatGPT across machine learning topics and find that few-shot learning methods perform best.We highlight the transformative potential of language models to streamline the writing and solution of large-scale assessments, significantly reducing the workload from human days to mere machine seconds.Our results suggest that rather than banning large language models such as ChatGPT in class, instructors should teach students to harness them by asking students meta-questions about

INTRODUCTION
Can a machine learn machine learning and help teach it?We automatically solve, explain, and generate new questions in machine learning courses, reducing the efforts of instructors and TAs from days to seconds.We evaluate students on meta-questions that validate these models' correctness, completeness, and originality.
This work presents a dataset of machine learning final exams with 646 question parts and a benchmark of baselines using a variety of language models with different prompting schemes.The best of these performs at a human level.In university-level STEM courses, students complete assignments, including problem sets and labs, and exams throughout the course.Recent work has developed accurate methods to solve STEM problem sets [4] using language models and few-shot learning.However, final exams remain challenging: final exams test the cumulative understanding of material learned over a semester and evaluate the students' depth and breadth of expertise.This work is the first to present a structured dataset of machine learning finals and a benchmark of baseline methods for answering them.
Final exams differ from problem sets in several ways, and the experience of solving each varies.First, finals are long, containing around nine questions with around seven parts each.Final exam questions are also multifaceted and multi-stepped: different parts of a single question require applying different concepts and problem-solving skills, and parts may build upon each other.While weekly problem sets focus on a single topic, finals span topics from the entire semester.Further, final questions are often story-based problems that may require mathematical modeling.Due to the time constraint of these exams, finals are also designed to test core understanding and application of course material over rote calculations or hand computations, which are more often found on problem sets that help students build up a theoretical intuition for that week's material.Thus, asking a machine to answer questions from finals allows for testing whether the model is able to learn a breadth and depth of topics beyond problem sets.
One contribution of this work is the dataset of final questions derived from three courses: Introduction to Machine Learning at MIT and Cornell and Machine Learning at Harvard.These are undergraduate courses with hundreds of students each semester, making them the largest undergraduate courses offered.At MIT, Introduction to Machine Learning is a core class in the computer science program.The prerequisites for the course are Linear Algebra and Fundamentals of Programming or Introduction to Algorithms.The class typically consists of weekly exercises, labs, quizzes, homework, a midterm, and a final exam.The final exam questions have many parts, each posing a new problem, and each question in the dataset corresponds to one part.The questions test a variety of topics and require different solution types.Solutions are primarily open-ended questions with some true/false and multiple-choice questions on theory, math, and code implementations.Due to the diversity of the final questions, our dataset uniquely assesses advanced problem-solving and reasoning skills in machine learning, math, and natural language processing.
We propose several baseline methods that solve these problems, by zero-shot and few-shot learning using GPT-3, Codex, OPT, and ChatGPT, and adding chain-of-thought prompting.We find that few-shot learning methods perform best.As shown in Table 1, the best-performing methods pass the final exams, and their grade is comparable with human grades of students on the same machine learning finals evaluated by the same human graders.We verify that our results are not due to overfitting by testing the method on three finals that are available online after the model was trained.We generate new final exam questions indistinguishable from humanwritten questions from other questions and course notes.
In summary, the key contributions of our work are: (1) A new dataset of machine learning final exams.
(2) A benchmark and baseline for answering final exam questions.(3) A comparison of different methods and their solve rates.(4) The automatic grading of checkable question types forming a usable benchmark.(5) The generation of new questions from other questions.(6) The generation of new questions from course notes.(7) The development of meta-questions about correctness, completeness, and originality.(8) A survey showing generated questions are indistinguishable from human-written questions.

Related Work
It is often thought that humans are generalists, whereas machines are specialists.However, large language models such as GPT-3 [1], Gopher [11], PaLM [3], BLOOM [12], and ChatGPT [9], also called foundation models, are generalist learners.Specifically, in our setting, while humans care about the number of topics in an exam and therefore find finals more difficult than problem sets, foundation models effortlessly scale to many topics without re-training.Language models may be pre-trained on text and fine-tuned on specific datasets, for example OpenAI's Codex [2] and ChatGPT [9] allow for generating programs and answering questions from text at a human level.There are several ways to improve the mathematical reasoning ability of language models: (1) using chain-of-thought (CoT) prompting [6,17], (2) using the top-k ranking solutions [7] and merging them by voting [16] or least-to-most prompting [18],  and (3) using program synthesis and few-shot learning to generate code that answers questions [4].Much of the prior work focuses on high school or middle school level material [10].The first work to tackle university-level machine learning course problem set questions [15] used a transformer and GNN architecture and heavily relied on data augmentation.This resulted in overfitting and did not scale up to other types of questions or courses.Probability and statistics course problem-set questions have been answered [14] by probabilistic program synthesis with human performance.Problem-set questions from the core university math courses [4] have been automatically solved using few-shot learning and program synthesis at a human level.Other work considers university-level course questions across a variety of domains [5] and identifying theorems [13].Prior work on question generation includes question-answer pair generation based on a text passage [10] and question text generation based on other questions [4].

DATASET
We present a new dataset of 646 question parts from a dozen final exams of MIT's and  The breakdown of questions, parts, points, and non-image points by each semester and topic are shown in Tables 2 and 3.Each question in a final exam consists of multiple parts.Questions are written by providing set-up and context information first, followed by the question parts (which may come with additional information).Set-up and context information may contain (1) story elements (ex., character names and motivations), (2) relevant definitions and equations, and (3) data points.We format questions in the dataset by concatenating the question context, any context or solutions from prior parts of the question required for answering the part, and the part's context and question.We split the questions into their corresponding parts.Questions consist of text, mathematical notation, and images.Mathematical notation is represented in the dataset by L A T E X and images by screenshots from PDF documents.The types of question answers are diverse.A few are multiplechoice or true/false questions.Most are open-ended, for which the evaluation requires modeling the problem, mathematical manipulation, or writing code.Many questions require explaining the answer.
We use twelve final exams from different semesters for data curation.The PDF files of the exams are publicly available online.We use a tool [8] for an initial transcription, and curators then evaluate and manually correct the input questions, and verify the correctness of each input question.
We extract questions and solutions for all parts of all types of questions, including those that rely on images.We curate nine exams from publicly available PDF documents.The three MIT exams between 2021 and 2022 are after the model training, therefore the model does not overfit their solutions.The aggregate average grades are available to the students and do not contain any personally identifiable information.There are three duplicate questions between exams which we keep.
GPT-3 zero-shot uses the question as-is, whereas GPT-3 zeroshot with CoT uses the suffix "Let's think step by step." after the question to encourage multi-step output.Codex zero-shot uses the prefix "Write a program that answers" before the question within Python comments denoted by triple quotes """ to encourage Codex to write code.GPT-3 few-shot finds the closest questions in the embedding space, measured by cosine similarity, and uses them and their corresponding answers before the new question as examples in the prompt.Codex few-shot finds the closest questions in the embedding space, as measured by cosine similarity, and uses these questions and their corresponding code as examples.
For students, a good study technique is to use previous final exams to review and practice for their upcoming final.We model this method by few-shot learning using the question-answer pairs (for GPT-3) or question-code (for Codex) with the closest question embeddings from previous finals.We implement this by considering all the exam questions, marking each question by its semester and year, and using only previous semesters' questions for fewshot learning.The MIT Fall 2017 and Spring 2022 exams contain three duplicate questions, and we handle these same questions the same way humans do by allowing few-shot learning in MIT Spring 2022 based on successful Fall 2017 zero-shot answers.It is reasonable that if a student studies all previous exams, there may be 8.5% of repeated question points.Since MIT Fall 2017, Harvard Spring 2015, and Cornell Spring 2017 are the first final exams in the corresponding universities, and we do not perform few-shot learning on these.
3.1.1Comparison with Open Language Models.We also evaluate our dataset on an open-source language model, Meta's OPT-175B.OPT-175B is a model consisting of 175 billion parameters.Tables 4 and 5 compare the results of GPT-3, ChatGPT, Codex, and Meta OPT.We evaluated OPT on only 163 question parts since OPT was limited to handling questions under 256 characters in length.We implement the inference for the OPT-175B model using Alpa.Alpa is a particular framework designed for training and inference of large models.For the hardware, we use an 8x A100 PCIE cluster.The model requires about 560 GB of VRAM in our run case, and each example takes nine minutes for inference.

Grading
3.2.1 Human Grading.The questions are of different types: multiplechoice, numerical, expressions, and open-text.We grade answers and aim to keep all factors equal in grading human and machine answers.Human and machine answers are graded based on the number of points allocated to each question part, giving complete, partial, or no credit for each answer.We approximate partial credit by assigning half credit.The course staff graded student final exams, which included graduate TAs and instructors.The same graduate TAs and the instructor that graded the student answers also graded the machine answers.Grading instructions are the same for grading student answers as for machine answers.

Automatic
Grading.We label each question's answer type into one or two categories out of four options -multiple-choice (MC), numerical, expression, or open.We consider answers multiplechoice if the test-taker is presented with an enumerated list of choices, numerical if the answer is a number, expression if the answer includes variables or other notation, and open if the answer calls for free-response text.We categorize questions that have additional questions nested within them by the multiple relevant categories.This is often the case when a question with one of MC, numerical, or expression is followed by a follow-up question asking the student to explain their previous answer.The breakdown of the questions is as follows: 98 are multiple-choice, 84 are numerical, 81 are expressions, and 204 are open.The 'Non-Open Points' column of Tables 6 and 7 show the answer type breakdown by the number of points.Table 6 shows the number of question parts that do not rely on images, the number of points that do not rely on images and the number of non-open question points in Introduction to Machine Learning finals for each semester.Table 7 shows the breakdown by topic.Our automatic grading uses string matching and regular expressions.In the case of multiple-choice results, we check that the output of the code is equal to the solution.In the case of numerical answers, we look for a matching integer or real number.
From Human Days to Machine Seconds: Automatically Answering and Generating Machine Learning Final Exams KDD '23, August 6-10, 2023, Long Beach, CA, USA

Performance
Table 4 shows the machine grades by semester, and Table 5 shows the machine grades by topic, excluding question parts that rely on images.We compare the average grade (out of 100) of GPT-3 with zero-shot (ZS), GPT-3 with ZS and chain-of-thought (CoT) prompting, GPT-3 with few-shot (FS) learning, GPT-3 with FS and CoT prompting, Codex with ZS, Codex with FS, ChatGPT, and OPT with ZS.Fall 2017 is the first semester, so few-shot learning results based on previous semesters are unavailable (NA).Spring 2020 and Fall 2020 did not have final exams due to COVID-19.Three final exams were available online after GPT-3, ChatGPT, and Codex were trained, ensuring that the model is not overfitting content it has seen previously.The results consistently demonstrate that for zeroshot, ChatGPT is best and that few-shot learning methods perform best across semesters and topics, as marked in bold.

Limitations
Our dataset consists of all question parts and their solutions, including images.However, our baseline methods do not handle questions that rely on an image containing the information required to solve the question since GPT-3, ChatGPT, and Codex do not handle images.Tables 6 and 7 show the breakdown of the number of question parts and points of questions that do not rely on image information for answering the question.On average, 27.55% of the question parts, which make up 30.32% of the points in final exams, are questions that rely on image information.The points attributed to the non-image parts are tallied, recorded, and used to calculate non-image percentage grades.

Generating New Questions
Creating new, high-quality questions by course instructors and TAs is crucial to assessing student learning.However, it can often be a demanding, time-consuming process that requires significant effort from those involved.These new questions must vary from past questions while testing the same core concepts.We evaluate the potential of using language models with few-shot learning to write exam content accurately and efficiently.

estions to estions.
We use the dataset of exam questions to generate new questions automatically.We use questions from our dataset as prompts to create new high-quality questions not present in our dataset.We create a list of various questions in our curated dataset and use the resulting list to prompt the model to create a new question.

Notes to
estions.We use the course notes to generate new programming questions automatically.We first pre-process the course notes to extract key concepts and information from the text, and then append this information to the prompt to create new high-quality programming questions.This automatic, scalable approach ensures that the generated questions closely align with the course material, which is particularly important in educational settings where the quality and relevance of questions are critical for a challenging and engaging learning experience.One significant concern may be the correctness of the generated programming question and answer.To verify the accuracy of the generated questions and answers, we tested our approach on a textbook and generated 150 questions automatically.Due to the correctness of the text, we found that all of the questions and answers were correct and are subsequently used by educators to assess students' understanding of programming concepts.
Figures 1 and 2 demonstrate the results of this process.The Appendix consists of new generated questions from other questions and the closest question from our dataset as measured by the cosine similarity of the embedding of each question, as well as new generated questions from course notes.These new questions are diverse and qualitatively similar to questions on previous MIT final exams.This provides an efficient way for course TAs and instructors to generate new final exam questions.

Meta-Questions
There have been concerns about using language models in the classroom and about the models' correctness, completeness, and originality.Instead of banning usage in class, we embrace them by forming new meta-questions regarding correctness and completeness.Specifically, in addition to standard questions, we suggest a new class of final questions that consists of questions and their answers by the language model.The students are asked to identify whether the answer is correct for each question.If the answer is correct, then the students are asked to explain why.If the answer is incorrect, then the students are asked to write the correct answer and provide a complete explanation.By teaching students how to use language models in an educational setting, they develop the skills they need to navigate and critically evaluate the material.Learning to prompt language models also helps students complete tasks more quickly and accurately, increasing their productivity and efficiency.

Student Survey
To evaluate the machine-generated questions, we conducted an anonymous online student survey comparing them with the humanwritten questions in terms of quality, appropriateness relative to the course, and question difficulty.We surveyed 15 students who have taken the Introduction to Machine Learning course or its equivalent.The survey was optional and included informed consent, with the following description: "We are conducting a survey to assess the quality and difficulty of automatically generated questions for an introductory machine learning course final.You will be presented with a series of questions, either human-written (taken from an actual course final exam) or machine generated, but you will not be told the source of a given question.For each question, you will be asked (a) whether you think the question is humanwritten or machine-generated, (b) whether the question is appropriate for the given course final, and finally (c) how you would rate the difficulty of the question.Please carefully read each question and answer to the best of your ability".
We randomly sampled one generated question and its closest (measured by cosine similarity) original, human-written question for each of the twelve machine learning topics.Students were asked Table 4: We benchmark different baselines for each semester, excluding question parts that rely on images.We compare the average grade of GPT-3 (text-davinci-002 and 003) with zero-shot (ZS), ChatGPT, GPT-3 with few-shot (FS) learning, GPT-3 with ZS and chain-of-thought (CoT) prompting, GPT-3 with FS and CoT prompting, Codex with ZS, Codex with FS, and OPT with ZS.MIT Fall 2017, Cornell Spring 2017, and Harvard Spring 2015 were the first semester for each university, so few-shot learning results based on previous semesters are unavailable (NA).The result of the best-performing method for each semester is marked in bold.

Semester
GPT-3 2 ZS GPT-3 3 ZS ChatGPT ZS GPT-  to read these 24 questions in the survey, mixed and presented randomly, and then answer three questions for each: (1) "Is the question human-written or machine-generated?", (2) "Is the question appropriate or not appropriate for the specific course final?", and (3) "What is the question's difficulty level on a scale between 1 (easiest) and 5 (hardest)?".We ask the students to provide ratings and not to solve the questions.The results of our survey are as follows: Out of the human-written questions, students identified 56.11% of them correctly as human-written and 43.89% incorrectly as machine-generated.Of the machine-generated questions, students identified 45% of them correctly as machine-generated and 55% of them incorrectly as human-written.The difficulty ratings were between 1 (the easiest) and 5 (the hardest).Students rated machine-generated questions with a difficulty level of 2.55 with a 1.11 standard deviation and rated human-written questions with a difficulty level of 2.85 with a 1.12 standard deviation.Students rated machine-generated questions as appropriate 82.6% of the time and human-written questions as appropriate 85.0% of the time.
The conclusions we draw from the survey are that (1) survey participants considered human-written questions to be as likely to be human-written or machine-generated, and similarly, machinegenerated questions were considered equally likely to be machinegenerated as human-written, (2) survey participants considered the machine-generated questions slightly easier than human-written Generated Question: "We have a data set of images, each of which is either a picture of a dog or a picture of a cat.We want to train a binary classifier to distinguish between the two classes.Which of the following is a better choice of loss function for this task: (a) The 0-1 loss, which gives a loss of 1 if the classifier predicts the wrong class and a loss of 0 otherwise.(b) The cross-entropy loss, which is given by -log(p) if the correct class is predicted and -log(1-p) otherwise, where p is the predicted probability of the correct class.(c) The hinge loss, which is given by max(0, 1-s) if the correct class is predicted and max(0, 1+s) otherwise, where s is the predicted score for the correct class." Closest Question: "Consider the following 2D dataset in (x,y) format: ((1,-1), +1), ((1,1), +1), ((1,2.5),+1),((2,-2),-1), ((2,1),+1),((2,3),+1),((5,-1),-1),((5,-2),-1).Break ties in distance by choosing the point with smaller 1 coordinate, and if still tied, by smaller 2 coordinate.Compute the leave-one-out cross validation accuracy of the 3-nearest-neighbor learning algorithm on this dataset." Table 6: The number of question parts that do not rely on images, the number of points that do not rely on images, and the number of non-open question points, in finals for each semester.Prompt: Write a medium-level programming question and answer based on: "Bootstrap aggregation is a technique for reducing the estimation error of a non-linear predictor, or one that is adaptive to the data.
• Construct new data sets of size by sampling them with replacement from D • Train a predictor on each one: ˆ • Regression Case: bagged predictor is • Classification Case: majority bagged predictor: let ˆ ( ) be a "one-hot" vector with a single 1 and K − 1 zeros, so that ˆ ( ) = arg max ˆ ( ) .Then which is a vector containing the proportion of classifiers that predicted each class k for input ; and the predicted output is ˆ bag ( ) = arg max k ˆ bag ( ) ."Generated Question and Answer: "Implement a function in Python using the bootstrap aggregation algorithm for regression and classification.The function should take the following parameters as inputs: • dataset: a dataset D of size .
• n_models: an integer representing the number of models to be trained .
• train_func: a function that trains a single model given the training data.
• predict_func: a function that takes a trained model and makes predictions on a test dataset.The function should return the weighted average of all models' predictions for regression and the majority vote of all models' predictions for classification." Cornell's Introduction to Machine Learning courses and Harvard's Machine Learning class.Our dataset covers the finals given at MIT for semesters of Fall 2017, Spring 2018, Fall 2018, Spring 2019, Fall 2019, Spring 2021, Fall 2021, and Spring 2022, Harvard Spring 2015 and Spring 2021, and Cornell Spring

Figure 1 :
Figure 1: Example of a question generated from other questions.The closest question among the example questions is very different from the new generated question.

Figure 2 :
Figure 2: Example of a programming question generated from course notes.

Table 1 :
Human and machine grading of human and machine solved final exams.Mean human and machine grades (out of 100) on MIT Introduction to Machine Learning final exams by semester.Non-image grades consider question parts that do not contain images that are required for solving the question.

Table 2 :
The number of questions and parts in the final for each semester of machine learning courses.

Table 3 :
The number of questions and parts in all finals for each topic in the machine learning courses.Topics can have half-questions attributed to them if a question has some parts under one topic and the other parts under another topic.

Table 5 :
We benchmark different baselines for each course topic, excluding question parts that rely on images.We compare the grade of GPT-3 (text-davinci-002 and 003) with zero-shot (ZS), ChatGPT with zero-shot (ZS), GPT-3 with few-shot (FS) learning, GPT-3 with zero-shot and chain-of-thought (CoT) prompting, GPT-3 with FS and CoT, Codex with zero-shot, Codex with fewshot learning, and OPT with ZS.The question parts on loss functions rely on image information and are therefore unavailable (marked NA).The result of the best-performing method for each semester is marked in bold.