An Investigation of Automatically Generated Feedback on Student Behavior and Learning

Decades of research have focused on the feedback delivered to students after answering questions—when to deliver feedback and what kind of feedback is most beneficial for learning. While there is a well-established body of research on feedback, new advances in technology have led to new methods for developing feedback and large-scale usage provides new data for understanding how feedback impacts learners. This paper focuses on feedback that was developed using artificial intelligence for an automatic question generation system. The automatically generated questions were placed alongside text as a formative learning tool in an e-reader platform. Three types of feedback were randomized across the questions: outcome feedback, context feedback, and common answer feedback. In this study, we investigate the effect of different feedback types on student behavior. This analysis is significant to the expanding body of research on automatic question generation, as little research has been reported on automatically generated feedback specifically, as well as the additional insights that microlevel data can reveal on the relationship between feedback and student learning behaviors.


INTRODUCTION
Providing students with feedback during formative practice is important for the effectiveness of a learning by doing method.Formative practice (no-stakes practice questions with multiple attempts to complete) placed alongside core text content at frequent intervals generates the doer effect, the learning science principle that proves doing practice is on average six times more effective for learning than reading alone [9] [10] [23].A critical aspect of formative practice is the feedback students receive after they answer.Cognitive science gives a theoretical foundation for this component of the learning process through working memory and cognitive load research [18] [19], and VanLehn [25] argues for the need to support students in persisting after incorrect responses as an important aspect of the learning process, of which feedback is an important component.
There is a well-established body of research that provides best practices for when to deliver feedback and what types of feedback are best.Dunlosky et al. [3] found that, for studies on practice testing that did not find any benefit of practice testing over restudying, no feedback was used; practice testing with feedback outperforms practice testing alone.In an early intelligent tutoring system (LISP) at Carnegie Mellon University, feedback was found to minimize the time it took students to learn content [1].In an analysis of Statistics courseware developed at Carnegie Mellon, immediate targeted feedback was also shown to reduce the time it took students to reach a desired outcome [12].Regarding the immediacy of feedback, Anderson et al. [1] argued that immediate feedback reduced the amount of time students spent attempting to correct mistakes and increased the comprehension of the correct answer.Immediate feedback was also found to increase student satisfaction [16].
The type of feedback used for formative questions has also been shown to matter.Feedback is commonly categorized into three distinct types: knowledge of results (KR), i.e., whether the answer is correct or incorrect, knowledge of correct response (KCR), and elaborative feedback (EF) [16] [17] [6].Only providing KR feedback is the least effective method as the learner is provided no information on how to improve, though research shows KCR is only slightly more effective [16] [24].EF typically includes KR or KCR in addition to explanations, hints, strategies, etc., and therefore becomes an additional form of instruction which is more effective than KR or KCR alone [16].Anderson et al. [1] found that repeat errors were reduced using explanatory feedback (37%) over no feedback (60%).Within a web-based multimedia module, students who received corrective feedback had higher achievement and also perceived less cognitive load [6].The learning by doing methodology that incorporates formative practice with immediate elaborative feedback was foundational to the courseware developed at Carnegie Mellon's Open Learning Initiative, and this method was found to increase learning gains and do so in less time than traditional course materials [12].
However, for the guidance that this research on feedback provides, a recent systematic review of feedback in higher education (HE) by Morris et al. [15] concludes that while "the evidence from our review provides support for the use of formative assessment and feedback for promoting attainment in HE. ...to provide a stronger evidence base, our review points to the need for a much more systematic and scaled approach to examining these vital areas of teaching and learning within HE," (p.20).Morris et al. note that the need for stronger empirical evidence for feedback is impacted by the small scale of most studies (126 of 188) included, a lack of theoretical foundations, and laboratory studies that have "low ecological validity" for faculty in HE.
The present study offers an investigation of feedback that addresses several of the needs identified by Morris et al.This analysis includes a large-scale dataset gathered from a digital learning environment where students answer formative practice questions while reading their e-textbook, resulting in data from 6,647 students across 959 textbooks of varying subject domains.These data are gathered from students doing practice in a variety of natural learning contexts, showing how students behave in their "real world" environments rather than a laboratory setting.An advantage of digital learning environments is the large volume of microlevel data generated which can then be used for analysis [4], such as this feedback study.
A key component that allowed for the scale of this feedback study is that both the formative questions and feedback were generated using artificial intelligence.In order to scale the learning by doing method known to be highly effective for student learning [9] [10] [23], an automatic question generation (AQG) system was developed to create questions from textbook content.Using AQG is necessary to achieve large scale, as question creation is a labor-intensive process that requires both subject matter and item writing expertise, and the volume of formative practice items needed rapidly becomes prohibitive to develop in both time and cost.Research has found automatically generated (AG) questions perform as well as human-authored questions on key performance metrics such as engagement, difficulty, persistence, and discrimination [22] [8].More than two million AG questions have been added to more than eight thousand online textbooks in the VitalSource Bookshelf e-reader platform as a free study feature called CoachMe, available to millions of students [20].This practice feature contains several types of AG questions, including fill-in-the-blank (FITB), matching, multiple choice, and free response; the AG FITB questions are the focus of the present study.As shown in Figure 1, the questions open in a panel next to the textbook content, allowing students to refer back to the content if needed while they answer.
Feedback is also generated by this AQG system and appears immediately after the student answers the question.If a student answers a question correctly, they receive simple correct outcome feedback.For FITB questions, if a student answers a question incorrectly, there are three possible feedback options that they could receive, as shown in Figure 2. Outcome feedback is the minimum available option, informing students of their incorrect response.Context feedback is an extended selection of the textbook passage where the question stem came from to provide students with more contextual information to support their next attempt.Common answer feedback is a sentence selected from nearby in the textbook content with the same key term missing in order to give students another example and help scaffold their next attempt.As formative practice, students can answer questions as many times as they like.After submitting their first response, students have several options for their next action.Selecting the "Retry" button will clear the question back to the original state and allow the student to submit a new answer.Selecting the "Reveal Answer" button shows the student the correct response, after which they can select "Retry" to input the correct answer themselves-if they so choose.Once an initial response has been submitted (whether correct or incorrect) students also have the ability to submit feedback about the question itself with a rating prompt.Each of these possible interactions is tracked by the platform and can reveal interesting insights into student interaction patterns [21].This same type of interaction data is used in this investigation to see what students do after receiving feedback.
In a recent systematic review of AQG systems, Kurdi et al. [11] identified 92 papers evaluating AG questions, with a range of generation methods and intended uses.However, few systems were developed for formative practice, and only one study mentioned feedback.No studies were identified that discussed the performance of AG feedback specifically.While research on automatic question generation has been increasing in recent years, there is a need to study AG feedback as part of these AQG systems.
One goal of this study is to provide one of the first robust evaluations of AG feedback known to date.A second, broader goal is to expand the scale of empirical research on feedback and its effect on student learning behaviors in higher education settings, as called for by Morris et al. [15] 2 METHODS

Automatic Question Generation
Consistent with the evaluation study guidelines proposed by Kurdi et al. [11], we provide a concise overview of the essential features of the AQG methodology.The questions in this study are FITB cloze questions created from important sentences in the textbook content.The purpose of AQG is to create questions that are used as formative practice as students read the textbook.Although the AQG approach is versatile, applicable to a broad range of subject domains, it is unsuitable for certain areas such as mathematics and language learning.The input corpus is the textbook utilized by students.Textual analysis is performed with the spaCy library [5] using the CPU-optimized large model (en_core_web_lg).Generation employs both syntactic and semantic levels of understanding.This information is used for two main tasks: identifying the sentences that will be transformed into FITB questions and choosing appropriate words within the sentences to serve as the answer blanks.Syntactic information, such as part-of-speech tagging and dependency parsing, is used in both sentence and answer selection.Semantic information is also used in detecting important content.An expert-developed rule-based approach is used for the procedure of transformation.
To identify important sentences, the textbook corpus is divided into sections of approximately 1,500 words each.This segmentation is determined by key textbook features like chapters and substantial headings, which are further subdivided when they exceed 1,500 words.Each section's sentences are then ranked using the TextRank algorithm [13]; those with higher rankings are employed for AQG.TextRank uses vector embeddings to compute sentence similarities, with the results depending on the specifics of the embedding process.A word2vec-based model [14] is used in spaCy, which creates embeddings by averaging the vectors of the text's constituent tokens.Before embedding, our AQG system discards stop words and tokens with no alphabetic characters (e.g., punctuation, numbers).In addition, sentences that are very short (under 5 words) or very long (over 40 words) are excluded, as these are less likely to be suitable for questions.The qualifying sentences in each textbook corpus section are then evaluated using TextRank.
The other major step in generating cloze questions is choosing the word in each sentence for the answer blank.Our system takes into account a variety of factors when selecting answer words, such as corpus frequency distribution and presence in the textbook's glossary.The most significant of these, however, is part of speech: only nouns and adjectives are considered as answer candidates.Analysis of data from natural learning contexts has shown that students tend to rate these questions more favorably than those with other parts of speech as answers [7].

Automatic Feedback Generation
Feedback generation is carried out following question generation, likewise using the textbook corpus as the source.Context feedback is generated as follows.When the paragraph containing the question's sentence has at least three sentences (i.e., the question sentence and at least two more), the sentences immediately preceding and following the question sentence in the paragraph are used for context feedback when possible (as illustrated in the second example in Figure 2).When the question sentence is the first or the last sentence in its paragraph, the following or preceding two sentences in the paragraph are used, respectively.If the paragraph does not contain at least three sentences, the same procedure is followed expanding the scope to the adjoining paragraphs within the same textbook corpus section.
For common answer feedback, the nearest occurrence of the question's answer word in a different sentence is found.Common answer feedback is created by making this same answer word into a blank in the additional sentence as well.Sentences in the same textbook corpus section as the question sentence are searched first, then sentences in the immediately preceding section.If the nearest occurrence is not unique, the occurrence preceding the question sentence is taken.If no other occurrence of the answer word can be found in the current or preceding section, common answer feedback is not generated.Lemmatization is used so that, e.g., singular and plural forms are treated as equivalent.For example, in the question shown in Figure 1 the correct answer is "protons."In this case, "protons" occurred in the common answer feedback sentence as well, but "proton" would also have been considered acceptable.For both context and common answer feedback sentences, the same selection criteria described above for question generation must be satisfied.
Outcome feedback (i.e., simple KR feedback, the first example in Figure 2) can always be generated.In particular, outcome feedback can be used in cases where neither of the two AG feedback types could be generated.

Study Design
Beginning in January 2023, the feedback type was randomly assigned at the question level in textbooks built with the practice feature, with the intention to create equivalent question groups with respect to feedback type.For each question, feedback was randomly selected from the types that were able to be generated for that question (this resulted in a higher proportion of questions with outcome feedback, since it is not always possible to generate the other two types).The dataset for analysis was constructed from all student-question interaction data on these randomly assigned feedback conditions from January 26, 2023 through September 27, 2023 using the following selection criteria.First, only students who had incorrectly answered at least one question for each feedback type, and thus had received all three types of incorrect answer feedback, were included.This was an additional way of providing balance to the feedback conditions in the dataset, since all three conditions consisted of the same set of students.Furthermore, as only questions that were answered incorrectly can give insight on performance of the feedback, only questions that were incorrectly answered by at least one student were included.The data were grouped into student-question sessions, consisting of all actions of an individual student on an individual question ordered chronologically.Sessions in which more than ten minutes had elapsed between the student's initial incorrect answer and the student's next action (if any) were removed to account for the possibility of the student leaving the textbook between the first and second actions, potentially affecting the impact of feedback; this accounted for less than 0.5% of sessions.This resulted in a dataset for analysis containing 6,647 students, 35,006 unique questions from 959 textbooks, and 144,225 student-question sessions.
To assess the extent to which the questions in the feedback conditions were equivalent, the mean question difficulty for the first answer attempt was calculated for each condition, shown in Table 1.Note that as first attempts, these mean scores are not related to feedback performance; rather, the concept is that if the mean scores on the first attempt in each condition are equivalent, then it is more plausible that differences observed across feedback conditions can be attributed to the feedback type as opposed to differences in the questions themselves.Relative to the outcome feedback condition, the context feedback condition had a mean score 2.5% lower on the first attempt, while the common answer condition had a mean score 0.9% higher.Though a two-proportion two-tailed z test showed that both differences are statistically significant (p « .001and p = .0043,respectively), it will be seen in the Results section that these differences are smaller than differences observed that are relevant to feedback type, such as mean score on the second answer attempt.

RESULTS
As seen in Figure 2, when students answer a question incorrectly, there are multiple options for their next action.Besides simply abandoning the question-which students rarely do [21]-students could retry the question or reveal the correct answer.Therefore, the first investigation is to analyze students' next action for each feedback type.Table 2 shows the distribution of actions immediately following an initial incorrect answer for each feedback type.A few interesting trends stand out.While revealing the correct answer was the most frequent next action, the percentage of students who revealed the answer was lowest for common answer feedback and highest for outcome feedback, a nine-point difference.
The differences in answer reveal rate relative to outcome feedback were statistically significant for both AG feedback types with p « .001.Common answer feedback also had the highest percentage of correct answers as the next action, while outcome feedback had the lowest correct answer percentage, more than six points less.Incorrect answers as the second action were within a one-point spread across all feedback types, and abandoning the question was a small, similar percentage across all types.
Considering the benefits of feedback known in the literature, the next inquiry to investigate is, how do students do on their second attempt for each feedback type?To look at this, we narrowed the dataset to the second attempt after an incorrect first attempt for students who did not reveal the answer.The results in Table 3 show a difference in second attempt correctness across feedback types.Common answer feedback has the highest percentage of correct second attempts (53.8%), with context feedback several points below (49.1%), and outcome feedback having the lowest percentage of correct second attempts (42.7%).The differences between each AG feedback type and outcome feedback are likewise significant with p « .001.It is also interesting to compare these to the differences in first attempt correctness (Table 1), which reveals a relative shift of approximately ten points in favor of the AG feedback types.This trend is consistent with what is reported in the literature-elaborative feedback is more beneficial for students than only outcome feedback.
To expand on this finding, timestamp data were used to determine how long students were spending between receiving the feedback and making a second attempt.Table 4 shows the mean and quartiles for time elapsed between the first and second attempts for each feedback type (quartiles are reported to the nearest second as this was the resolution of the system's timestamp data).This reveals another interesting pattern.For both common answer and context feedback, students spend longer amounts of time between receiving the feedback and their next attempt.Although students are not guaranteed to always read the feedback presented, this logically corresponds to the additional reading students must do with the longer feedback statements.While outcome feedback had the lowest elapsed times as expected, it is interesting that common answer feedback elapsed times were not much higher, especially considering the substantially better performance observed on the second attempt.Another interesting question is: what is the relationship between the second attempt mean score and the amount of time students took between receiving feedback and making their second attempt?We could propose several hypotheses, such as students who spend a very short period of time might not be considering the feedback and have lower second attempt mean scores, with less observed difference between feedback conditions.Or inversely, perhaps students who spend longer have higher mean scores as they may be reflecting on or researching the correct answer.Figure 3 plots the second attempt mean score as a function of time to answer after receiving feedback, partitioning the time to answer into quartiles for each feedback type.These results show a surprising trend.Students who take the shortest amount of time to answer have the greatest differences in second attempt accuracy by feedback type.In the lowest answer time quartile, outcome feedback has the lowest mean score of 31.6% at a median time of 7 seconds.Context feedback has a higher mean score, 39.0%, in the shortest answer time quartile; the median time to answer in the lowest quartile was longer for context feedback than outcome feedback, at 11 seconds.Common answer feedback, however, has a mean second attempt score much higher still at 50.9%, with a median time to answer of 10 seconds.This is a surprising finding for this shortest answer time group, as it was expected that shorter answer times could correspond to students answering with less consideration of the feedback and thus lead to less difference among feedback conditions.One possible explanation is that the common answer feedback might efficiently trigger students to identify the correct answer term, resulting in a short duration with a higher success rate than the other feedback types.As also seen in Figure 3, second attempt mean scores increase with time spent in all conditions, and the differences in mean score between conditions become smaller.In particular though, the mean scores after common answer feedback are higher than for outcome feedback in all quartiles, even though the answer times are reasonably similar, compared to the longer times observed after context feedback.

DISCUSSION
The availability of microlevel data enabled an analysis of student actions that gave empirical insight into the impact of feedback types-a necessity as called for by Morris et al. [15].Results from this analysis suggest that the type of AG feedback provided to students did have an impact on both their next action and the correct response rate on the second attempt.Of the three feedback types, the outcome feedback had the least beneficial effects-it had the highest rate of reveal answer and the lowest second attempt accuracy rate.These results are consistent with what is generally known about outcome feedback in the literature, and this analysis contributes firm empirical evidence of the disadvantages of outcome feedback compared to other types.The context feedback fell in the middle of the range for both the reveal answer action and correct second attempt rate.This result is beneficial in that it should always be possible to generate context feedback as an option that provides an advantage over outcome feedback.Finally, the common answer feedback had the lowest reveal answer rate of the three while also having the highest rate of correct second attempts.This type of feedback was intended to provide students with a scaffold to get to the correct answer, so it was encouraging to find this feedback had the best second attempt accuracy of the three types.Prior to the randomization and investigation of the AG feedback on formative practice, it was uncertain if there would be empirical evidence that feedback type impacted student behavior.To the contrary, the microlevel data revealed that feedback types did have an effect on student actions and outcomes.
Another interesting finding from this investigation is the relationship between time to answer and second attempt accuracy.At the shorter time durations between receiving the feedback and answering the question again, there was a large difference in mean score according to feedback type.Outcome feedback had the lowest mean score while common answer feedback had the highest.However, as the length of time to answer increased, all three feedback types became closer in mean score.This may indicate that for students who do not have an immediate second attempt in mind, the type of feedback matters less to their answering strategy; longer times to answer may be due to students choosing to reread the textbook content, for example, or other strategies in addition to or instead of consideration of the feedback provided.
The results of this analysis have meaningful implications for the type of feedback that is delivered to students both in this learning context and at large.The difference in student behaviors alone is enough reason to change the feedback types delivered to students, prioritizing common answer feedback, then context feedback, and avoiding only using outcome feedback.These results are also meaningful to the field of AQG research, as researchers should consider the results of this AG feedback analysis when developing an AQG system for student use in natural learning contexts.As the first known large-scale study on AG feedback performance, these results provide an initial benchmark for future AG feedback work.It was shown here that AG feedback can improve student performance on formative practice questions; a next logical step will be to investigate the impact of AG feedback during practice on summative learning outcomes.Future research should also consider additional directions for AG feedback development.Given the superior performance of common answer feedback over outcome feedback, could other types of AG feedback provide even more learning support for students?The advancement of large language models may lead to new opportunities to support error-sensitive, personalized feedback based on student answers.The continued advancements in artificial intelligence and automatic question generation will lead to new avenues of feedback generation that should be studied in a similar manner to understand the impact on the learning experience.

Figure 1 :
Figure 1: An example FITB formative practice question in a chemistry textbook.

Figure 2 :
Figure 2: Examples of outcome, context, and common answer feedback for FITB questions.

Figure 3 :
Figure 3: Second attempt mean score by answer time quartile for each feedback type.

Table 1 :
Question data and first attempt mean score by feedback type.

Table 2 :
Next student action after incorrect answer by feedback type.

Table 3 :
Second attempt mean score by feedback type.

Table 4 :
Time elapsed in seconds between first and second answer attempts by feedback type.