Generating Multi-Part Autogradable Faded Parsons Problems From Code-Writing Exercises

Parsons Problems and Faded Parsons Problems have been shown to be effective in helping students in programming courses transition from passive learning, such as lectures or textbooks, to active learning in the form of writing code. We present FPPgen, an authoring system that largely automates the conversion of existing open-ended code-writing exercises to Faded Parsons Problems (FPPs). FPP solutions can be machine-checked using either spec-based autograding, in which student solutions are evaluated against instructor-provided test cases, or mutation-based autograding, in which students produce one or more unit tests that are evaluated using mutation testing. Our system allows creating exercises that rely on complex libraries and helper functions, such as student code intended to be run as part of a complex framework-based application. Our system also gracefully supports cumulative multi-part problems, in which later parts build on earlier parts. Python and Ruby exercises are currently supported, but FPPgen is language-agnostic and adding autograders for other languages is straightforward. In our experience so far, instructors can draft simple questions in less than an hour and mutation-based questions in about two hours as open-ended coding questions, and student helpers can use our tools to convert these to FPPs in less than an hour. FPPgen is in active use in both beginning and advanced large-enrollment programming courses in a CS undergraduate program at a large US university.


INTRODUCTION, MOTIVATION, CONTRIBUTIONS
Transitioning from passive learning modes such as lectures or textbooks to actively solving open-ended coding questions can be challenging for students.One way to smooth this transition is the use of scaffolding, which entails supportive activities to guide the student during problem-solving.Two such forms of scaffolding are Parsons problems [10], a type of code-manipulation exercise in which students rearrange lines of code to reconstruct a correct solution, and Faded Parsons Problems [13], in which some of those lines may include blanks the student must fill in.Both regular and faded Parsons problems take less time for students to solve than code-writing and code-comprehension problems while providing the same or better learning gains [5,13].Without loss of generality, we focus on Faded Parsons Problems (FPPs) in this work, since regular Parsons problems are a special case of FPPs.
Since such problems can be machine-graded, they provide instant feedback to the student, and scale well to large-enrollment courses.As with typical code-writing exercises, one type of FPP is graded by running student code against instructor-provided test cases; we refer to this as spec-based autograding, from the term "specification" as used in the software-testing literature to denote a single test case.However, an important special case of teaching advanced coding skills is teaching test-writing [9], which calls for a different autograding strategy.Mutation-based autograding, in which studentauthored tests are run against mutations of the system under test (versions into which bugs have been deliberately inserted), checks that some student test case catches an introduced bug.This approach to grading student-authored test cases is inspired by the software engineering literature on mutation testing [1].
The user study conducted by Weinman et al. suggests that FPPs are an efficient "stepping stone" bridging code comprehension and code writing.Our goal is therefore to make it easy for instructors to automatically convert traditional code-writing exercises to scaffolded FPPs that use either spec-based or mutation-based autograding.To that end, the main contribution of this paper is FPPgen, a set of instructor-facing and student-facing tools that automate most of the conversion of open-ended "write this code" and "write these tests" exercises into auto-gradable FPPs.
Unlike other FPP frameworks, FPPgen supports sequenced multipart FPPs, in which later parts build on earlier parts.FPPgen also supports problems in which the student's code is not standalone but relies on substantial helper functions or may be even be designed to run as part of a larger framework, such as a full-stack Web app.FPPgen's autograders support both spec-based and mutation-based autograding, and indeed can be used for open-ended coding problems as well.Python and Ruby are currently supported, but the system is language-agnostic and readily extensible to support other languages.
The generated FPPs are designed to be administered to students via the open-source and increasingly widely used PrairieLearn [14] assessment authoring system, similar to an LMS.We chose PrairieLearn because its design for extensibility [15] enables the development of many novel assessments; in this paper we describe only the specific aspects of PrairieLearn relevant to the student experience and the instructor (authoring) experience.The interfaces between FPPgen and PrairieLearn are narrow, and porting FPPgen to other LMSs should be straightforward.
In summary, this paper presents a set of tools and a small experience report.The set consists of a generator that translates openended potentially-multi-part problems into FPPs, an extension to the PrairieLearn LMS platform that renders FPPs, and an autograder for using mutation testing as means to evaluate student-written test suites.We contextualize this set in terms of related work.We describe the student's experience in solving FPPs generated by FPPgen, the instructor's experience in authoring them, and our experience using FPPgen as instructors and teaching assistants.We provide specific guidance for others to adopt FPPgen and note its limitations.We conclude with future work.While these tools are in active use in beginning and advanced courses at our institution, this paper does not attempt to evaluate their pedagogical efficacy in the classroom, nor go into technical detail about the design or implementation of FPPgen1 .

RELATED WORK
Our contributions extend the state of the art by reducing the friction required to incorporate scaffolding (in the form of Faded Parsons Problems) to address the challenges of teaching both codewriting [13] and test-writing [9].
PrairieLearn includes an existing pl-order-blocks element, which is designed for questions in which students must arrange a list of steps according to some partial ordering.This element has also been used for very simple Parsons problems and for mathematical proofs [12].However, its autograder is based on comparing the student submission to a directed, acyclic graph, greatly limiting the flexibility and complexity of Parsons problems that can be expressed and making FPPs awkward at best.Our true spec-based and mutation-based autograding not only addresses that limitation but allows for extension to multiple languages, multi-part problems, and the inclusion of arbitrary libraries at grading time.
Clegg et al. [3] have used mutation testing to teach test-writing, but the examples described in their paper are limited to writing tests for pure functions with no side effects or helper functions.In contrast, FPPgen supports authoring realistic problems in which the student code-whether tests or otherwise-may rely on extensive helper functions and external frameworks.For example, spec-based code-writing questions can rely on entire Web frameworks as helper libraries, and mutation-based test-writing questions can use rich test-library features such as doubles and mock objects.
Regular Parsons Problems have also been implemented in Runestone [6], a platform for hosting open-source e-textbooks.We chose PrairieLearn because of its ease of extensibility with open Web technologies and third-party libraries [15], but our simple front-end markup (Figure 3) could be consumed by other Parsons Problems implementations as well.
Variations of Parsons Problems offer ways to tune problem difficulty through changing both dynamic and static scaffolding.FPPs allow instructors to vary problem difficulty by allowing the author to decide what code to fade.Runestone supports Adaptive Parsons Problems, which dynamically adjust difficulty by adding or removing unnecessary or incorrect lines of code ("distractors").Ericson [4] demonstrates that Adaptive Parsons have the same types of gains as Faded Parsons Problems over traditional code writing.Ihantola et al. [8] implemented a variant of Parsons problems optimized for touchscreen mobile devices: some tokens in a line of code can be replaced with popup menus allowing the student to select how to complete the line.In contrast to these works [4,8], our main focus is to streamline authoring rich FPPs as assessment items while maintaining a seamless student experience.
The most similar tool to FPPgen was delivered by Fremont et al. in [7].They created UPP, a standalone FPP generator and delivery system that functions at scale.UPP was designed with a focus on the student problem-solving experience, and operates as the platform on which the authors assess the efficacy of certain FPP generation strategies.This contrasts with FPPgen, which focuses on the authoring and grading experience (without diminishing the student experience) of arbitrarily complex FPPs.

STUDENT EXPERIENCE
The student interacts with our exercises using PrairieLearn [14], an open-source and highly exensible LMS developed at the University of Illinois, one of whose key features is that it can be extended with new types of exercises.Each such extension, or element, can present a completely custom question type using essentially arbitrary HTML and JavaScript for student interaction.PrairieLearn's extensionality also allows arbitrarily-complex custom autograders to give rich and nearly instant feedback to specific exercises.

Single-Part Questions
In the simplest type of FPP a student reconstructs a single piece of code, as in the top panel of Figure 1.(This panel is actually the first  part of a multi-part question, which we describe next.)In original Parsons Problems, students drag lines of code from a source tray to a target, solution tray.When authoring using FPPgen, the instructor specifies the layout of these trays.For short problems, the trays can be stacked vertically.Slightly longer problems can have trays side-by-side if the lines of code are short.However, if the problem is both long and features long lines of code, the instructor can choose to display only a single tray in which the lines can be edited and rearranged in place.The instructor can also specify that some of the lines are provided as "starter code" initially pinned to the correct location in the target tray.(If a single tray is used, pinning lines is still possible, but there is no overt visual cue to the student that those lines are already in their correct positions.) The target tray can optionally be bracketed with instructor-provided static code.This mechanism is particularly useful for multipart questions (described below) but can also be used in situations in which the open-ended code exercise would show "starter code" that precedes and follows the code to be written by the student.

Multi-Part Questions
Figure 1 shows a multipart question.We have found that the singletray display option is visually less cluttered for this type of exercise.As the student progresses to each new part of the exercise, a correct solution to the previous parts appears as part of the static code bracketing the single tray containing the FPP lines.Therefore, after the student solves part 1 (top of figure), the exercise then continues with part 2, in which the correct solution to part 1 now appears as static code (lines 1-7).As line 4 of the bottom of the figure shows, the student code calls library helper functions expect() and raise_error(), which in this case are provided by the unit-testing library RSpec, on which this exercise focuses.
The student experience therefore approximates that of cumulatively adding to the submitted code in each subsequent part.In some cases, the student's constructed code from the previous parts, while correct, might not be an ideal match for proceeding to subsequent parts of a multipart question.Using the same static code for all students provides a consistent way to grade each part of the question and make it independent of the other parts, since the same static code will be combined with the student-submitted solution when that part is graded.This approach is consistent with FPPs' goal of pattern exposure through reconstructing expert solutions [13], but should be used in conjunction with enforcing the rule that the student must get full or nearly full credit on each part before being allowed to proceed to the next part.(The threshold for proceeding can be set lower than 100% to prevent students getting from stuck.)For this reason, we use multipart questions only in formative assessments.

Student Feedback From Autograder
For spec-based autograding, the autograder will provide PrairieLearn with feedback corresponding to each instructor-provided test case, showing the name or docstring of the test case (e.g."GiftCard creation with negative balance should raise an exception"), and the instructor's choices in authoring the overall assessment (see Section 4) determine when and whether the student is shown the correct reference solution.For mutation-based autograding, as Section 4 explains, a student gets credit when one of their test cases "fails as expected" under a particular mutation.Figure 2 shows an example.The student got credit for correctly checking that an exception is raised when a GiftCard is incorrectly instantiated, which is why the feedback says "failed as intended".But the student's test did not correctly catch the case of attempting to overdraw their account: the student's test passed when it should have failed in the presence of a mutation.

INSTRUCTOR'S AUTHORING EXPERIENCE
In general, to author an FPP, an instructor provides a set of "source files" that can be versioned separately, and our toolchain generates a collection of files in the format expected by PrairieLearn to both render an FPP as part of an assessment and invoke grading software in the context of single-part or multi-part questions.We start with simple standalone single-part questions and then address how external libraries and multipart questions are handled.

Single-Part Questions
To author a single-part question that uses spec-based autograding, the instructor provides the following files: • A question prompt in Markdown or HTML, which will be seen by the student • A reference solution, which the instructor can choose whether to reveal to students after a given number of attempts • The autograder test cases, the implementation of which is hidden from students • Any additional helper functions, which may be directly implemented by the instructor (and hidden from the student), specified as external libraries, or a combination of both.

Markup While Students Solve an FPP ?expr?
Displays a blank rather than expr. 2

#blank default text
Sets default text as a blank's initial text.

#ngiven
Pins this line as part of the solution starter code at language indent level n.Any other comment Not displayed.• Information about which autograder should be used to grade the question (Python or Ruby, and spec-or mutation-based).
From these files, FPPgen creates the various files needed by PrairieLearn to render and grade a Parsons problem.The way that libraries and helper functions are referenced is different for different languages, due to PrairieLearn's built-in Python autograder; FPPgen hides these differences.Optional fading can be done by indicating in the reference solution which tokens should be replaced with blanks by delimiting them with question marks; for example, hello('?dave?') renders hello(' ') as a code line, and expects the student to fill the blank when they submit their solution.Note that depending on how fading is done, multiple correct values for the blank might be possible: for example, if one line of code blanks out the name for a variable assignment and a subsequent line uses that variable, the student just needs to specify the same variable in both cases rather than a specific variable name.Figure 3 shows the authoring syntax.

Questions Relying On External Libraries
While such simple questions may suffice for introductory programming courses, advanced courses may require more complex questions.For example, when building full-stack Web applications using a rich framework such as Django or Rails, the framework provides hundreds of library calls for the app to use; a realistic exercise involving such an app would necessarily rely on some of them.Depending on the instructor's needs, either the autograder can load these libraries alongside the student code at autograding time, or instructor-provided helper functions can stub the library calls.
Figure 4 shows an FPP in which the student adds a handler to a Web app using the Sinatra3 framework.Line 1's post is a special method provided by that framework.Since we don't actually care about the (substantial) side effects of the post action in this exercise, additional support code is provided to the autograder that stubs post, only verifying that the passed arguments are correct.We can simulate the behavior of the Sinatra's special data structure params[] in a simlar way.Taking advantage of Ruby syntax allows params[. . .] to be syntactic sugar for (params())[. . .], that is, params is actually a method call that returns a hash, that in turn can be dereferenced with a key.To equip the autograder with a mock version of params, we simply stub a function that returns a hash with sufficient keys for this exercise and then provide our function to the autograder as support code.Example of a question in which the student code relies on helper functions normally provided by the Sinatra web app framework.In this case, the instructor must decide whether to provide stub implementations for the special constructs post() (line 1) and params[] (line 2) normally provided by the framework, or require the framework itself as a set of libraries.
When an exercise heavily relies on helper functions, question writers may specify dependencies existing, external libraries.For example, the question in Figure 1 relies on the RSpec unit testing library in order to run the student code, which depends on its helper functions (e.g.expect and raise_error).In such cases, libraries can be specified using the standard Ruby mechanism of Gemfiles and packaged for offline use with Bundler 4 ; Python libraries must be packaged as detailed in the PrairieLearn documentation. 5

Multi-Part Questions
In PrairieLearn's terminology, a question is one screenful of information in which the student must provide one or more answers to prompts, corresponding to a single-part FPP as described above.A PrairieLearn assessment is a collection of one or more PrairieLearn questions experienced together.The assessment-level configuration determines whether the questions are presented in fixed or random order, how many attempts are permitted, how many points (if any) should be deducted for each incorrect attempt, whether there is a time limit, whether the student should ever be shown the correct reference solution, whether the student must answer each question correctly before moving on to the next one, and so on.
Therefore, when authoring a multi-part FPP such as the one in Figure 1, FPPgen generates individual PrairieLearn question directories for each part.The instructor must sequence the parts into an assessment, determine the number of attempts and points per part, and optionally set a threshold score the student must achieve with repeated attempts on each part before being allowed to go on to the next.FPPgen allows for the instructor to specify what student-visible static code should be included in each part (right panel of Figure 1), so that each part can be submitted to the autograder as a standalone question combining the student's submission for that part with the instructor-specified static code. 4bundler.io, the Ruby library manager 5 prairielearn.readthedocs.io/en/latest/python-grader/#course-specific-libraries

Questions Using Mutation-Based Autograding
For mutation-based grading, the student work product consists of one or more student-authored test cases, ostensibly designed to test specific behaviors in the given system under test (SUT), and the reference solution consists of one or more correct test cases, possibly with some fading (blanks).In addition to the files needed for a spec-based autograder question, the mutation-based question generator also requires a list of mutations to be applied to the SUT.For example, consider the simple SUT in Figure 5 (left), which implements a withdraw method for a Ruby class that models a retail gift card.The student is asked to write tests that confirm that when the withdrawal exceeds the balance, (a) the balance does not change and (b) an error is recorded.Figure 5 (right) shows one possible correct solution, but there are others: trivially, for example, the tests could be in a different order.Each student test is checked by specifying a mutation to the SUT that would cause a properly-written test to fail.For example, changing <= to >= in line 5 would cause both correct tests to fail, so we say that change is a paired mutation for both tests.Similarly, changing the text "Insufficient balance" in line 7 would cause the "records error" test to fail, so it is a paired mutation for that test only.The student receives credit for each of their tests that fails in the same way as the instructor's tests under their paired mutations.
The instructor specifies the desired mutations in a YAML file using the diff(1) output format (the same format used for distributing software patches to be applied using patch(1)).Each mutation is labeled according to which test case(s) it is paired with, so that the autograder can localize errors in student feedback.
The need to author and specify these mutations manually makes mutation-based grading questions more time-consuming to construct; Section 6 describes ongoing work to automate this effort.

ADOPTION EXPERIENCE
PrairieLearn has a well-defined set of interfaces [15] for developing a new element-a combination of client-side HTML and JavaScript plus server-side rendering and autograding to support authoring a specific type of exercise-such that instructors authoring that type of exercise need only write HTML-like markup rather than code when creating new exercises.By convention, element names in PrairieLearn begin with pl-and describe the type of question supported by the element, such as pl-multiple-choice.PrairieLearn also allows the use of external autograders that run in a Docker container and use another well-defined set of interfaces for communicating with the main PrairieLearn software.
FPPgen therefore consist of three elements, available publicly in the GitHub repo ace-lab/pl-faded-parsons: (1) A new element for FPPs called pl-faded-parsons.
( writing exercises into the files needed to deploy an FPP using the element and the proper autograder (see Section 4).To adopt FPPgen, an instructor using PrairieLearn downloads the scripts from our public repo and begins authoring questions by class GiftCard # other code elided for clarity def withdraw ( amount ) # mutation : <= for >= if @balance >= amount @balance -= amount else # mutation : " other string " @error = " Insufficient balance " end end end describe ' withdraw when insufficient funds ' do before (: each ) { @card = GiftCard .new (10.00)} it ' does not change balance ' do @card .withdraw (20.00) expect ( @gift_card .balance ). to eq (10.00) end it ' records error ' do @card .withdraw (20.00) expect ( @gift_card .error ). to match (/ Insufficient balance /) end end Right: two test cases checking the behavior of withdrawal when funds are insufficient.
providing the question source materials and running our scripts to generate the PrairieLearn question files.If the PrairieLearn build available to the instructor does not include our pl-faded-parsons element, it suffices to copy a single subdirectory from our repo into the instructor's course directory to make the element available within that course.Indeed, the mutation-based and spec-based autograders can also be used to grade open-ended code-writing problems, in which the student is presented with a text panel for entering code rather than solving a Parsons problem.
As instructors working with student teaching assistants (TAs), our own experience creating several spec-based autograding problems, both beginning and advanced, has been that the most challenging part is the initial authoring of the reference solution and the tests, as the example in Section 4.2 suggests.Given a working set of tests and reference solution, it takes a TA less than an hour to make initial decisions about blanks and prepare/QA a single-part or multi-part PrairieLearn question.
Mutation-based autograder questions take significantly longer to author because the authors must provide paired mutations for evaluating each student test case.Once these are provided, final question generation is straightforward, but debugging paired mutations can be subtle so finalizing such questions is often an iterative process, modifying the source files and re-running the generators.

FUTURE WORK
Since we expect FPPs to be common in formative assessments, we would like to log each drag-and-drop or fill-in-the-blanks interaction that occurs between submissions, in addition to seeing each submission attempt.Adding such logging would allow detection of whether students are "driving in circles, " attempting brute force solutions, or getting stuck in more subtle ruts.
Currently, all test cases associated with a single question (or single part of a multi-part question) are worth the same number of points.This is not usually an issue for formative assessments, but may need to be addressed for wider use on summative assessments.
We are presently exploring automating the decisions of what to fade.For example, in a set of exercises focusing on (say) Python list comprehensions, fading out the syntactic elements directly related to that feature is probably more valuable than fading elements incidental to it.A recent work [13] gives general recommendations such as "do not fade required function arguments or control-flow statements" based on initial user studies, and another [7] investigates a small set of syntactic fading strategies on simple code snippets.We also plan to incorporate automatic mutant generation [11] to streamline the process of creating mutation-based FPPs.
Further tightening of the authoring loop could shrink the gap between mutation-based and standard problem-authoring experiences.Adding features like content hot-swapping and automatic mutation generation would make rapid question prototyping easier.
Finally, we are adding support for the JavaScript language, since FPPs in Python, Ruby, and JavaScript would cover a large part of the Web development ecosystem.

CONCLUSIONS
We introduced FPPgen, which mostly automates the process of converting open-ended code-writing and test-writing exercises in Ruby or Python into Faded Parsons Problems that students can experience using the PrairieLearn LMS.Unlike other Parsons Problems tools, FPPgen allows sequenced multi-part questions in which each part's code builds cumulatively on previous parts; supports complex questions for which the student code and autograder must rely on external libraries, such as code intended to be integrated into a Web app or other framework-based app; and supports mutation-based autograding, which allows creating test-writing exercises as well as code-writing exercises.Our experience as instructors and teaching assistants using it for authoring and deploying exercises in both introductory and advanced programming courses has been positive: instructors can draft spec-based questions in under an hour and mutation-based questions in about two hours, and student helpers can convert these into Faded Parsons Problems using FPPgen in under an hour.The resulting FPPs are suitable for both formative and summative assessments.FPPgen is freely available for download on GitHub at ace-lab/pl-faded-parsons.

Figure 1 :
Figure 1: The first two parts of a single-tray multi-part FPP.This problem requires students to create RSpec tests in each part, making previous tests uneditable as they advance.

Figure 2 :
Figure 2: Partial-credit feedback for a half-correct test suite.The mutation-based autograder flags wrongully-passing tests.

Figure 3 :
Figure 3: Faded Parsons Problem Syntax How markup in the reference solution is interpreted when displaying the exercise as a Faded Parsons Problem.

Figure 4 :
Figure 4: Example of a question in which the student code relies on helper functionsExample of a question in which the student code relies on helper functions normally provided by the Sinatra web app framework.In this case, the instructor must decide whether to provide stub implementations for the special constructs post() (line 1) and params[] (line 2) normally provided by the framework, or require the framework itself as a set of libraries.
) Two Docker images to handle spec-based and mutation-based autograding in Ruby, available publicly on Docker Hub as pl-fpp-ruby-autograder.(FPPs in Python can use PrairieLearn's existing Python autograder container.)(3) The generator that converts instructors' open-ended code-

Figure 5 :
Figure 5: Left: an excerpt from a Ruby class showing the withdraw method for a retail gift card.Right: two test cases checking the behavior of withdrawal when funds are insufficient.