Solving Proof Block Problems Using Large Language Models

Large language models (LLMs) have recently taken many fields, including computer science, by storm. Most recent work on LLMs in computing education has shown that they are capable of solving most introductory programming (CS1) exercises, exam questions, Parsons problems, and several other types of exercises and questions. Some work has investigated the ability of LLMs to solve CS2 problems as well. However, it remains unclear how well LLMs fare against more advanced upper-division coursework, such as proofs in algorithms courses. After all, while known to be proficient in many programming tasks, LLMs have been shown to have more difficulties in forming mathematical proofs. In this paper, we investigate the ability of LLMs to solve mathematical proofs by using Proof Blocks, a tool previously shown to efficaciously teach proofs to students. Our results show that GPT-3.5 is almost completely unable to provide correct solutions (11.4%), while GPT-4 shows a significant increase in correctness (64.8%). However, even given this improvement, current models still struggle to correctly order lines in a proof. It remains an open question whether this is a temporary situation or if LLMs will continue to struggle to solve these types of exercises in the future.


INTRODUCTION
Large language models have quickly revolutionized computing education [10,36].Tools such as GPT-4, Codex, and Github Copilot can write code from English language prompts [9], solve first and second semester programming assignments [15,16], generate assignments [41], interpret programming error messages [24], and more.Students are already using them to write code and help solve their programming homework assignments in both helpful and unhelpful ways [37].Researchers have been quick to point out the risks in student utilization of these tools, such as over-reliance and not recognizing inherent biases [2,10].
Although it is clear that LLMs can easily solve much of the programming curriculum, one key area not explored yet is that of proofs.As a core part of a discrete mathematics course, proof writing remains a critical piece of computer science curriculum [45], forming a foundation for many upper-level algorithms courses.When GPT-4 was introduced, one of the key results researchers at OpenAI touted was its performance on math problems compared to .This continued a trend of the latest model's increase in capability in solving math equations, similar to the increase seen from GPT-3 to GPT-3.5 [21].How this performance increase can be applied to a computer science curriculum remains unseen.One recent (static) attempt to provide automated feedback for proofwriting by Poulsen et al. implemented Proof Blocks [35].Proof Blocks, similar to Parsons Problems [14], is a drag-and-drop interface for arranging lines in a mathematical proof [35].Recent work has shown that LLMs can be used to solve Parsons problems [38].In this paper, we extend previous work on both Proof Blocks and solving Parsons problems via LLMs by benchmarking GPT-3.5 and GPT-4 against 128 Proof Blocks questions.
In our work we are guided by the following research questions: RQ1 How do GPT-3.5 and GPT-4 perform in solving Proof Blocks problems?RQ2 Are there Proof Blocks problems that are challenging to solve for GPT-3.5 and GPT-4?
This article is organized as follows.In Section 2, we discuss related work.Section 3 presents our methodology, including the data used in the study and how solutions generated by LLMs were evaluated.We present the results of the study in Section 4, which are then discussed in Section 5. Section 5 also outlines the limitations of our work.Section 6 concludes the work, presenting answers to our research questions and outlining avenues for future work.

RELATED WORK
In this section, we discuss related work on the use of generative AI coding tools, proof blocks in general, and using AI for proofs.

Use of Generative AI Coding Tools
Generative AI tools such as ChatGPT and Copilot are widely expected to drastically change the landscape of programming [1,4,8,10] and software engineering [7] education.Such tools have proven proficient in producing code for CS1 [15], can work with Parsons problems [38], and can also deal with more advanced CS2 problems [16] and object-oriented concepts [6].Despite the fact that these tools are not perfect, their progress in the last two years has been rapid and this improvement shows no signs of abating.A new book aimed at university-level introductory programming even advocates learning AI-assisted programming from day one [31].These tools are also capable of more advanced tasks such as code competition problems [25], are being used by professional developers [3], and are expected to add trillions of dollars to global GDP [11].
These tools raise several challenges and opportunities for computing education beyond simply allowing students to generate code and the obvious academic integrity concerns [2,10].In particular, they can help instructors with many tasks [26].For instance, they can generate programming questions including test cases and solutions [41], provide feedback to students [22], be used for grading [27] and for answering help requests [19], and classify and answer student questions on discussion boards [48].
As students begin to use these tools, several of the threats identified by researchers have come into clearer focus.For instance, Prather et al. observed students using Github Copilot, which is a generative AI coding tool that can produce highly accurate code suggestions [37].They found that students often do not understand the code automatically generated by the tool simply because they did not write it.They also found that students will quickly accept incorrect code suggestions and tinker with that code before discovering they don't need it and deleting it, only to start the process over again.Their final finding was that some students used the AI tool to help them toward their goal, discovering new ways to achieve it, and doing so more quickly.Other user studies have found similar results [20,46].Although Proof Blocks are a different kind of exercise than open-ended code writing, there are similar concerns.Any kind of LLM-based tool for Proof Blocks would need to ensure that students do not become over-reliant on it and that it does not overwhelm them with feedback they don't understand (this is especially possible with proofs).The present work focuses on benchmarking Proof Blocks against LLMs, but future work that utilizes LLMs to provide feedback, hint generation, or even exploratory features, must keep these concerns in mind.

Proof Blocks
Understanding and constructing mathematical proofs is an essential aspect of the discrete mathematics curriculum, yet it is a very difficult topic for many students [18,28].Although there are many individual aspects that can be challenging for learners [43], even when all prerequisite knowledge is known, students are often unable to put the different elements together to correctly construct a proof without appropriate scaffolding [47].To address this, Poulsen et al. introduced the idea of 'Proof Blocks' in 2021, which was a novel software tool leveraging a drag-and-drop mechanism to enable learners to assemble proofs from pre-written lines [34].Inspired by the idea of Parsons problems [12,14], the goal of Proof Blocks was to provide scaffolding to students learning to write mathematical proofs, in much the same way as Parsons problems provided scaffolding to students learning to write code.However, unlike a Parsons problem in which code lines typically must be arranged into a unique order, Proof Blocks problems are more flexible in the sense that only those lines in the proof that depend on other lines must appear before them.
Research exploring Proof Blocks has shown their significant potential for improving learners' proof comprehension and for fostering efficient learning [32,35].Students in the early phases of learning about proof by induction learned just as much from reading lecture notes and using Proof Blocks as they did by reading lecture notes and writing proofs from scratch, and did so while saving significant time [32].Not only do students believe that Proof Blocks accurately represent their ability to write proofs, but when used as test questions they provide approximately the same amount of information about student knowledge as do written proofs [35].
To facilitate efficient grading of Proof Blocks, Poulsen et al. describe an auto-grader that uses a dependency graph to capture the relationships between subsets of blocks, and which grades as correct any topological sort of the graph [35].The autograder allows for swift feedback to students and provides extensive opportunities for problem generation.Work on autograding Proof Blocks problems has been expanded to include partial credit grading, an aspect that poses computational challenges due to the vast solution space, and the expense of calculating the difference between an incorrect solution and a model solution.One novel algorithm for computing such an edit distance exhibited enormous performance improvements, up to two orders of magnitude, when compared to a naïve approach.This algorithm can also be used to provide feedback to students when solving Proof Blocks problems, and could be applied to other problems such as Parsons problems [33].Figure 1 shows an example of a proof blocks problem taken from https://www.proofblocks.org/.The figure also gives an idea of how the Proof Blocks user interface looks like for students when they are solving Proof Blocks problems.

AI for Mathematical Proofs
Mathematicians and computer scientists have chased the promise of automating the construction of mathematical proofs for years.The earliest history starts with heuristic search algorithms over a search space limited by a formal theorem proving language [39].More recently, these search algorithms have been supplanted by reinforcement learning approaches, but even the best attempts have less than a 50% success rate in proving the lemmas and theorems in benchmark data sets [17,40].Researchers have sought improvements to these heuristic search algorithms using LLMs [49].
In parallel, researchers have worked on creating language models to solve math problems formulated as the informal mathematical language typically used by mathematicians rather than in formal theorem proving languages, an approach that has proved extremely difficult for many years [30,44].Most recently, GPT-4 has posted surprisingly good metrics in this area with huge improvements from GPT-3.5 to GPT-4 in all its math related benchmarks, which included AP Calculus exams, the quantitative portion of the GRE examination, and problems from the AMC series of high school mathematics competitions [29].None of the above present benchmarks on questions like the mathematical proofs used for introductory discrete mathematics courses that are taught by the math and/or computer science departments of most universities.Thus, it remains an open question what the success rate of any AI powered system would be on such questions.

METHODOLOGY 3.1 Data
3.1.1Proof Blocks problems.As our data set, we obtained from the authors all Proof Blocks problems that have been used in the data sets for prior publications [32][33][34].Questions were originally written for Discrete Mathematics courses in computer science departments at two large research universities in the midwestern United States, one public and one private, as well as for a research study designed to measure learning gains of students using Proof Blocks.The dataset had in total 128 Proof Blocks questions, of which 91 did not have distractors (additional lines).
3.1.2Proof Blocks problem solutions.We explored a variety of prompts to identify ones that could be used to produce Proof Blocks solutions with GPT-3.5 and GPT-4.As the final prompt, we used the format outlined in Listing 1, where the prompts start with the theorem, a prompt to reorder the lines to form a proof for the theorem, an additional instruction to disallow altering the text in lines, and the scrambled lines.Because the input to LLMs is ascii text, notice the example prompt forgoes mathematical symbols and uses textual descriptions such as \frac{a+b}{2} to encode For each of the 128 Proof Blocks questions in the dataset, we generated five sets of scrambled lines and prompted both GPT-3.5 and GPT-4 to generate solutions for them 1 .The prompting was done with temperatures 0.0 and 0.7, where the temperature is used to control the degree of randomness in the outputs.In total, this yielded  = 2560 Proof Blocks problem solutions.$ \ f r a c { a+b } { 2 } \ geq \ s q r t { ab } $ .S i n c e $a , b \ geq 0 $ , $ \ s q r t { a } $ and $ \ s q r t { b } $ a r e r e a l .$ ( \ s q r t { a } − \ s q r t { b } ) ^2 \ geq 0 $ , s i n c e t h e s q u a r e o f any r e a l number i s non − n e g a t i v e .$a +b − 2 \ s q r t { ab } \ geq 0 $ .$a +b \ geq 2 \ s q r t { ab } $ .L e t $a , b$ be a r b i t r a r y non − n e g a t i v e r e a l numbers .

Evaluating Solutions
The Proof Blocks problem solutions were programmatically extracted from the LLM output.The resulting data were analyzed using the automated Proof Blocks grading algorithm outlined in [33], which relies on determining the edit distance from a given solution to a correct solution.For the present evaluation of Proof Blocks problem solutions, we focused on absolute correctness.That is, if the distance from a given solution to the model solution was greater than zero, it was considered incorrect.
When outlining the results, we first provide an overall model performance in Section 4.1.The subsequent results, outlined in Sections 4.2 and 4.3, focus on  = 640 solutions created with the best-performing model and temperature, which was in our case GPT-4 with the temperature 0.0.

RESULTS
1 The data was generated using OpenAI APIs in August 2023.GPT-4 corresponds to the gpt-4 model, while for GPT-3.5, we used the most recent gpt-3.5-turbo.

SCRAMBLED BLOCKS :
We a r e t r y i n g t o match e a c h s t u d e n t t o t h e sum o f e i g h t n u m e r i c d i g i t s .S i n c e e a c h d i g i t r a n g e s from 0 t o 9 , t h e p o s s i b l e sums r a n g e from 0 t o $8 \ t i m e s 9 = 72 $ .T h e r e f o r e , t h e r e a r e 73 p o s s i b l e sums .S i n c e t h e r e a r e 39 s t u d e n t s e n r o l l e d , e a c h s t u d e n t can have a u n i q u e sum .To g u a r a n t e e t h a t t h e r e a r e two s t u d e n t s who have t h e same sum , we need s t r i c t l y more s t u d e n t s t h a n p o s s i b l e sums .S i n c e t h e r e a r e 73 p o s s i b l e sums , we r e q u i r e 74 s t u d e n t s .

GPT − 3 . 5 SOLUTION : ( s c o r e 3 3 % ) We a r e t r y i n g t o match e a c h s t u d e n t t o t h e sum o f e i g h t n u m e r i c d i g i t s . S i n c e e a c h d i g i t r a n g e s from 0 t o 9 , t h e p o s s i b l e sums
r a n g e from 0 t o $8 \ t i m e s 9 = 72 $ .S i n c e t h e r e a r e 73 p o s s i b l e sums , we r e q u i r e 74 s t u d e n t s .To g u a r a n t e e t h a t t h e r e a r e two s t u d e n t s who have t h e same sum , we need s t r i c t l y more s t u d e n t s t h a n p o s s i b l e sums .T h e r e f o r e , t h e r e a r e 73 p o s s i b l e sums .S i n c e t h e r e a r e 39 s t u d e n t s e n r o l l e d , e a c h s t u d e n t can have a u n i q u e sum .

GPT−4 SOLUTION : ( s c o r e 1 0 0 % ) We a r e t r y i n g t o match e a c h s t u d e n t t o t h e sum o f e i g h t n u m e r i c d i g i t s . S i n c e e a c h d i g i t r a n g e s from 0 t o 9 , t h e p o s s i b l e sums r a n g e from 0 t o $8 \ t i m e s 9 = 72 $ . T h e r e f o r e , t h e r e a r e 73 p o s s i b l e sums . S i n c e t h e r e a r e 39 s t u d e n t s e n r o l l e d , e a c h s t u d e n t can
have a u n i q u e sum .To g u a r a n t e e t h a t t h e r e a r e two s t u d e n t s who have t h e same sum , we need s t r i c t l y more s t u d e n t s t h a n p o s s i b l e sums .S i n c e t h e r e a r e 73 p o s s i b l e sums , we r e q u i r e 74 s t u d e n t s .

Overall Model Performance
The model performance for GPT-4 and GPT-3.5 is outlined in Table 1 for temperatures 0.0 and 0.7.The overall success rate for GPT-3.5 is 11.4% and 11.7% for temperatures 0.0 and 0.7 respectively, while for GPT-4 the overall success rate was 64.8% and 61.9% for the same temperatures.This suggests that lower temperatures work slightly better for solving Proof Blocks problems, but the effect is not large (+3% points for GPT-4).More interestingly, GPT-4 vastly outperforms GPT-3.5, correctly solving Proof Blocks problems over half of the time compared to GPT-3.5's ∼11%.An example of how GPT-3.5 and GPT-4 solves a combinatorics Proof Blocks problem is shown in Listing 2. In this case, GPT-4 got the problem fully correct, while GPT-3.5 was partially correct, scoring 33%.

Proof Blocks without Distractors
We continued the analysis by focusing on the performance of GPT-4 in solving a variety of Proof blocks without distractors.The overview of the results, including the topics of the problems and the correctness of the GPT-4 produced solutions is outlined in Table 2.
From the table, we can see that overall, GPT-4 can solve Proof Blocks problems quite accurately when there are no distractors, being able to solve them on average 73% of the time.There are differences between topics in how well GPT-4 can solve Proof Blocks, ranging from 54% for questions related to 'combinatorics' (see Listing 2) to 100% for questions on 'algorithm analysis' and 'sets, functions'.
Model Temperature Overall Success Rate GPT-3.5 0.0 11.4% 0.7 11.7% GPT-4 0.0 64.8% 0.7 61.9% Table 1: Comparison of performance of models.This is consistent with prior results that lower temperature is better for more technical and less creative tasks, and that GPT-4 vastly outperforms prior models on mathematical tasks.

Proof Blocks with Distractors
Finally, we studied the performance of GPT-4 in solving a variety of Proof blocks with distractors.The overview of the questions and topics, as well as correctness of the GPT-4 produced solutions is outlined in Table 3. Altogether, having distractors seems to hurt GPT-4's performance in solving Proof Blocks problems as the overall performance for questions without distractors was 73%, but only 44% for questions that had distractors.From the table, we can see that there are again differences between topics.Similar to the problems without distractors, 'combinatorics' problems are the hardest for GPT-4 to solve, with performance at 20%.When distractors are used, the easiest topic for GPT-4 is 'Pigeonhole Principle' with a success rate of 77%.Interestingly, GPT-4's performance on problems on this topic was actually better compared to performance without distractors (70%).Also interestingly, out of the topics that had questions with distractors, 'Cardinality' had the best performance without distractors (93%) but this was not reflected on the problems with distractors.

LLMs and Solving Proof Blocks
Our results highlight that state-of-the-art LLMs such as GPT-4 are rather capable of solving Proof blocks.Distractors make the problems harder to solve, as Proof blocks with distractors were solved only 44% of the time, while those without distractors were solved 73% of the time.This is in line with the performance of LLMs in solving Parsons problems, where problems with distractors were in general harder to solve than problems without distractors [38].There were considerable differences in the performance between topics.For problems without distractors, the worst performance was for combinatorics (54% correctness), while two topics were solved perfectly (Algorithm analysis and Sets and functions).For problems with distractors, the correctness ranged from 20% for the combinatorics problems to 77% for the pigeonhole principle.Similar observations of the performance of LLMs varying by problem have also been observed when using LLMs to solve programming-related help requests [19], where the ability to address the help requests depended heavily on the problem and the manner in which the help request was phrased.
One reason for this difference in performance between topics could be the training data for the model.It could be that GPT-4 used more training data related to algorithms and sets and functions than it did combinatorics.Another explanation could be that GPT- as a next-token-predictor, is not actually searching the state space for these problems and therefore naturally lends itself better to certain types of Proof Blocks problems.
One implication of the results is that, for the topics where performance is good, LLMs could most likely be used to support students who are solving Proof Blocks problems.If the LLM can solve the problem, it might be able to generate a hint for students on what block to move next and where the block should be moved.
Although GPT-4 provided the best results, it is possible that others may not observe similar results even with the same prompts.Recent research has shown that the performance of GPT-3.5 and GPT-4 has changed over time, and the change has not always been an improvement [5].This highlights a worrisome issue in relying on closed LLMs for research and practice; in effect, these observations call for further developments and research into open LLMs.

Comparison to Student Performance
In general, Proof Blocks problems are harder for students than typical multiple choice questions, but not as difficult as written proof questions [34].The exam data from the evaluation of Proof Blocks questions as test questions from Poulsen et al. [34] contained 22 questions across 6 different topics.On these problems, students got the problem correct 61% of the time on their first attempt, and had gotten the problem correct by their third attempt 85% of the time [34].Student data is only publicly available for the some of the questions, and which questions students were given with distractors is confounded with the topics.Thus, while we cannot draw any particular comparisons between GPT-4 and students on particular topics, we see that the performance of GPT-4 is roughly similar to the performance of students who have been taught the material-slightly better on problems without distractors, and worse on problems with distractors.
The finding that the performance of GPT-4 is similar to students in solving Proof Blocks problems suggests that they are more difficult for GPT-4 compared to, for example, code writing tasks and creating code explanations.Prior work has found that even Codex, which is an earlier, less capable model compared to GPT-4, performed better than the average student in code writing tasks [15].Similarly, GPT-3 seems to outperform students in its ability to explain code in natural language as the explanations generated by GPT-3 were rated as being easier to understand and being better summaries of the code compared to explanations created by students in prior work [23].

Evolution of LLMs
Our results also highlight the impact of the evolution of LLMs.In our case, GPT-3.5 had an average success rate of 11.4%, while GPT-4 had an average success rate of 64.8% (both for temperature 0.0).Such an improvement is considerable, and suggests that future improvements to LLMs might also yield improvements in their capability of solving Proof blocks.In the broader CER literature, the improvement and evolution of LLMs has been highlighted in multiple areas.As an example, there is a considerable difference in the performance of Codex and GPT-3.5 in solving students' programming-related help requests [19].Similarly, the performance of GPT-4 in passing various programming assessments is better compared to earlier models [42].Using data from three Python courses with varying assessments including multiple-choice questions, programming exercises, and large projects, Savelka et al. found that GPT-3's performance was such that it would have failed the courses, while GPT-4 would have passed the courses easily.
While for some tasks such as Proof blocks, the performance of GPT-4 is vastly better compared to earlier models, when considering code generation, GPT-4 has shown comparatively moderate improvements over Codex in tasks such as generating code.This might be due to earlier models such as Codex already having quite impressive performance in code generation [15,16], so there is less room to improve for GPT-4.

Limitations
Our study comes with a number of limitations, which we address here.First, although we explored a number of prompts, i.e. did prompt engineering, we cannot state that the prompts that we used are the best possible ones for solving Proof Blocks.This is an inherent problem of Large Language Models where, due to the vast parameter space, finding an optimal prompt is practically impossible -the problem is exacerbated by semantically similar prompts potentially leading to different outcomes [13].Second, due to how the performance of GPT-3.5 and GPT-4 can change in the same tasks over time due to updates on the side of the model developers [5], it is possible that our results could not be replicated even with the same dataset using the same models in the future.There is a need for open LLMs that can be versioned and studied in more detail.Third, we used so-called 'zero-shot' prompting (i.e., we did not provide any examples of solving Proof blocks to the model) and did not engage in dialogue with the model for producing the answers, and thus, it is possible that students who were to use ChatGPT or a similar system that keeps track of the conversation history could observe better performance in the tasks.

CONCLUSION
In this study, we explored the potential of LLMs for solving Proof Blocks problems.Proof Blocks problems are problems where students are given a theorem and a set of scrambled lines that need to be ordered to form the proof for the theorem.To summarize, our research questions and their answers are as follows.
Question: How do GPT-3.5 and GPT-4 perform in solving Proof Blocks problems?Answer: Both GPT-3.5 and GPT-4 can solve Proof Blocks problems, although they have vast differences in performance.In our study, GPT-3.5 was able to solve approximately 11% of the given Proof Blocks, while GPT-4 achieved an almost 65% success rate in solving the problems.
Question: Are there Proof Blocks problems that are challenging to solve for GPT-3.5 and GPT-4?Answer: In short, yes.Focusing on GPT-4, as its performance is considerably better than the performance of GPT-3.5, we observed that problems with distractors were considerably more difficult than problems without distractors.For problems without distractors, GPT-4 was able to solve approximately 73% of the given problems, while for problems with distractors, the corresponding number was 44%.Furthermore, we also observed that there are considerable differences in performance depending on the topic of the Proof Blocks problem.
Our results highlight the possibility of using LLMs such as GPT-4 for solving Proof Blocks problems.The practical implications of this include the possibility of using GPT-4 and future LLMs as an additional tutor for solving the Proof Blocks, which can help students in improving proof comprehension [32,35].
This work opens up multiple research directions.As GPT-4 was quite successful in solving Proof Blocks problems, a natural next step would be analyzing its performance in solving free-form proofs.These could be manually graded using rubrics currently in use for student work.If GPT-4 were able to solve free-form proofs, then it could be used by instructors to create example solutions for free-form proof problems.Another avenue for future research is studying in more detail the types of mistakes that GPT-4 does, and whether prompt engineering could help it solve problems where it failed using our current prompts.

Figure 1 :
Figure 1: An example Proof Blocks problem and the user interface of the Proof Blocks tool.

Listing 1 :
Example prompt C o n s i d e r t h e f o l l o w i n g theorem : F o r any non − n e g a t i v e r e a l numbers $a , b$ ( i .e ., $a , b \ geq 0 $ ) , $ \ f r a c { a+b } { 2 } \ geq \ s q r t { ab } $ .R e o r d e r t h e f o l l o w i n g l i n e s t o form a p r o o f o f t h e theorem .Do n o t a l t e r t h e t e x t i n t h e l i n e s .

Table 2 :
Performance on Proof Blocks problems without distractors, grouped by topic.

Table 3 :
Performance on Proof Blocks problems with distractors, grouped by topic.