Improving Student Learning with Hybrid Human-AI Tutoring: A Three-Study Quasi-Experimental Investigation

Artificial intelligence (AI) applications to support human tutoring have potential to significantly improve learning outcomes, but engagement issues persist, especially among students from low-income backgrounds. We introduce an AI-assisted tutoring model that combines human and AI tutoring and hypothesize this synergy will have positive impacts on learning processes. To investigate this hypothesis, we conduct a three-study quasi-experiment across three urban and low-income middle schools: 1) 125 students in a Pennsylvania school; 2) 385 students (50% Latinx) in a California school, and 3) 75 students (100% Black) in a Pennsylvania charter school, all implementing analogous tutoring models. We compare learning analytics of students engaged in human-AI tutoring compared to students using math software only. We find human-AI tutoring has positive effects, particularly in student’s proficiency and usage, with evidence suggesting lower achieving students may benefit more compared to higher achieving students. We illustrate the use of quasi-experimental methods adapted to the particulars of different schools and data-availability contexts so as to achieve the rapid data-driven iteration needed to guide an inspired creation into effective innovation. Future work focuses on improving the tutor dashboard and optimizing tutor-student ratios, while maintaining annual costs per student of approximately $700 annually.


INTRODUCTION
The main obstacle to improving math performance among middle school students lies in ensuring fair and equal access to effective learning opportunities [7,20].While economically disadvantaged and historically underserved students have the potential to excel when given the same resources as their peers (c.f., [18]), they often face learning gaps due to limited access [7].Individualized instruction via tutoring can have consistent and significant positive impacts on student achievement and learning [12,20], particularly when deployed among middle and high school grades in math and during the school day, as opposed to after school [25] and when delivered by trained tutors attending to students' socio-motivational needs and relationship building [6,19].However, low-income students lack access to in-school tutoring programs and well-trained tutors, evidenced by the 16 million low-income children on the waitlist for high-quality afterschool programs [4].Further exacerbating the opportunity gap among students from low-income families is the high costs of $2500+ per student for private, individual tutoring [19]-most families simply cannot afford it.Relying solely on human tutoring cannot adequately address the current educational needs, which have intensified in the face of pandemic-era learning losses [32].The gravity of the situation is evidenced by declines in middle school math scores reported in 2022 -the steepest since the inception of state assessments [31].The lack of accessibility faced by students arises from a range of factors, such as inadequate access to basic inputs like digital devices and internet connectivity.Additionally, there are complications related to inclusivity, which inadequately addresses the diverse needs of students, including English language learners and those with disabilities [32].The challenges facing math learning related to access, equity, fairness, and inclusion have fostered collaborative and focused efforts on AI-assisted human-technology ecosystems that increase learning opportunities for all students [7,30].
High-dosage human tutoring has been shown to double student learning [25,26].However, providing intensive, individualized human tutoring to the millions of students deemed in need of support is costly and requires more human capital than society is likely to mobilize [20]-hence, the emergence of AI as a complement to human tutoring.Advances in AI and learning analytics have created opportunities to increase tutoring efficiency and lower tutor-to-student ratios [2,7].AI-assisted (also known as, AI-augmented, AI-in-the-loop, AI-supported) human tutoring works by having the AI provide the adaptive math instruction, heavily researched and well-known to be effective [25], with humans providing as-needed socio-motivational intervention and relationship-building support [7].An example in practice is the application of a real-time AI-driven dashboard to support effective use of tutor's time during tutoring and differentiating student support by allocating tutor time to students who need it most.There is evidence of this approach among teachers demonstrating students learn more when teachers and AI work in synergy [15].This study among teachers is the only one known of its kind and sparks the question-what about human tutors?In [7], among 70 participating students from predominantly low-income backgrounds in a human-AI synergistic tutoring program, an observed doubling of learning was demonstrated compared to a matched control group.However, this prior study took place within an after-school program that spanned the COVID pandemic.This current work focuses on this novel AI-informed approach to tutoring and occurs during the school day.
Informally referred to as hybrid human-AI tutoring and described as human tutors and AI-based tutoring software working in tandem, this method provides just-in-time assistance to students and is scalable, by allowing a single tutor to work with more students.Hybrid human-AI tutoring shows promise as a means of doubling student learning at a fraction of the cost of high-dosage human tutoring [2,7].However, its impact on student learning processes and intermediate outcomes leading to improving achievement remains presently unknown.Fig. 1 provides a conceptual pathway connecting participation in human-AI tutoring to higher achievement, shown in the following series of steps: 1) student participation in human-AI tutoring increases exposure to learning opportunities; 2) increases in learning opportunities yields more successful lesson (i.e., module, unit) completions; and 3) more lesson completions yields higher achievement.Research supports that students of all ability levels, ranging from those several grade levels behind to accelerated learners, can benefit from tutoring [18].However, there is currently a lack of research on how hybrid human-AI tutoring benefits students across varying baseline abilities.This present tutoring approach is specifically designed to attend to students from low-income backgrounds, constituting over 90 percent of students in the study.One key challenge is that the assessment of effectiveness remains a protracted endeavor.Randomized controlled trials (RCTs), are universally accepted as the ideal methodology for assessing intervention efficacy and impact [23].However, conducting RCTs necessitates a substantial investment of time, human capital, and resources, often spanning across multiple years to yield conclusive results [13].Furthermore, the implementation of RCTs raises ethical quandaries, requiring certain students to be arbitrarily denied beneficial interventions that, under ordinary circumstances, are most likely superior to no intervention at all.Quasi-experimental studies may be underutilized as a means of evaluating early program effectiveness, particularly when RCTs are expensive and take a long time to complete.In addition, RCTs alone are not a "holy grail" of effectiveness research, as they often provide insight into "what works" but not "why it works, " with other methods often needed to pinpoint the causal mechanisms [10, p. 2].As AI-supported tutoring programs are a new approach, there simply is not enough empirical research published leveraging RCTs to gauge early program impact on student learning.While rapid quasi-experimental studies (e.g., pre-post with non-equivalent control groups) offer a more agile and resource-efficient means of assessing intervention outcomes, they are not without limitations, such as the lack of random assignment introducing potential biases and confounding factors that may impact internal validity [23].Strategies to mitigate biases, such as matching techniques, can remove some potential bias providing useful, rapid, and relatively low-cost analysis of impact [7,23].
Given this, we focus on enhancing the efficiency and rigor of experimental trials of the hybrid human-AI tutoring approach by conducting rapid-cycle quasi-experimental studies, which can streamline and expedite the evaluation of program effectiveness.The implementation of several rapid-cycle studies across different school sites and student populations can create multiple lines of evidence which combine to generate a more comprehensive evaluation of efficacy.Our three-study quasi-experimental investigation aims to determine the initial impact of a hybrid human-AI tutoring program that takes place during the regular school day and focuses on low-income students.In providing hybrid AI-human tutoring support to students who need it most within three ethnically and geographically diverse U.S. middle schools, we strive to answer the following research questions: RQ 1: What differences exist in learning processes and intermediate outcomes between students participating in hybrid human-AI tutoring compared to students engaging solely with math software?RQ 2: Does the impact of hybrid human-AI tutoring vary by student's baseline ability?Specifically, if there are positive effects on student learning processes, is the effect equally distributed across students of varying baseline abilities?

Human-delivered High-Impact Tutoring
High-impact tutoring is a targeted and intensive form of academic support designed to provide personalized assistance to students who may be struggling academically or require additional help beyond the capabilities of regular classroom instruction [1].It is characterized by several key features, including: provision during the school day three or more times per week; delivery by well-trained tutors; usage of student data to guide instruction; and limitation of teacher-tostudent ratios of 1-to-3 or lower [1].Across eight meta-analyses (see e.g.[25]) encompassing over 150 studies, tutoring has shown to substantially improve student learning, especially among underrepresented and marginalized students, though the effect size can vary with different program characteristics.In a RCT consisting of 2,600 9th-and 10th-grade students attending Chicago Public Schools and using Saga Education's high-dosage tutoring model, researchers found tutoring increased math test scores by 0.16 standard deviations (SD), with later replications reporting 0.37 SD gains [12].
Not surprisingly, given the substantial investment in human capital and the time skilled tutors spend with students, high-dosage tutoring is costly, with Saga's model averaging $3,500 to $4,300 per student annually [12].

Impact of AI-assisted Math Software
Given the high cost of high-impact tutoring delivered exclusively by humans, researchers have long explored the less expensive alternative of creating intelligent tutoring systems (ITS) and related math software to personalize math instruction and boost learning for K-12 students.Specifically, AI-based technologies has been shown to facilitate the math software systems across various dimensions, including creating rich environments (e.g., multimedia applications), fostering individual functionality of multiple components (e.g., learner model, domain model, pedagogical model) within modular architecture and their communication, and delivering adaptive instructions grounded in educational and psychological principles [8].The overarching impact of AI on mathematical learning involves almost all scales ranging from "physical intelligence" to "digital intelligence."A large number of review studies and meta-analyses suggest large learning gains can be achieved when students use these technologies [11], but there can be substantial variability in usage of educational technology, and such variability in usage can be correlated with differing learning outcomes.This view receives strong support from important new evidence based on 1.3 million observations of student usage of learning software (including intelligent tutoring systems) across multiple grade levels and subjects points to remarkably similar rates of student learning as a function of practice opportunities, despite large differences in the students' family income levels, ethnicity, and prior measured academic achievement.These results suggest that educational achievement gaps are driven by differences in learning opportunities [18].Intelligent tutors show great promise in terms of providing appropriately calibrated learning opportunities to large numbers of students in a cost effective way, but for this promise to be realized students must actually use the technology.A cascade of enabling conditions, including physical and emotional health, access to technology, a growth mindset, and adequate motivation, must be in place; otherwise students will not use the technology, even when it is low-cost and increasingly widely available.We seek to expand access to these promising technologies by using a hybrid human-AI tutoring model, in which we can increase engagement and use of AI math software by the students who need it most.

Defining Hybrid Human-AI Tutoring
While intelligent tutoring systems can support tutoring at scale, they are not equipped to provide the relational and motivational support some students may need to engage with these systems [7].Recently, hybrid human-AI tutoring, defined, as mentioned, as human and AI tutors working in tandem, aims to leverage the power of AI-driven adaptive math software to provide personalized instruction for students and to allow human teachers and tutors to focus more of their effort on the relationship building and socio-emotional support many students need to successfully engage with the AI [14,16].Moreover humans can provide additional content support beyond the capability of the AI software system, such as supporting students who are unproductively struggling [17].In recent research and educational practice, human-AI systems have taken on a range of forms.[7] and [24] have examined schedule-driven systems in which students are connected with a human tutor through pre-scheduled lessons, often occurring outside regular math classes and featuring a long period of repeated interactions between a student and the same human tutor.Schedule-driven tutoring can be effective, but the costs are high, limiting access and scalability.At the other extreme are student-driven systems, in which the students themselves determine whether and what sort of help they need and initiate engagement.
This approach can have low cost, but many students who may benefit will not seek support on their own, and the efficacy of these systems for the least advantaged students remains unproven [3].
Tutor-driven and dashboard-guided systems represent a middle ground, in which students spend part of their regular math class time engaged in personalized practice using math software, enhanced with algorithms that can identify student learning progress and struggle that is likely to be unproductive, which are then displayed on a dashboard to inform teachers or tutors, but allow the teachers/tutors to determine which students to support.Alternatively, the dashboard could suggest, in real time, that the classroom teacher and/or virtual tutors support specific struggling students.Either of these systems is likely to be less costly than schedule-driven systems, but more likely than studentfacilitated systems to support the learning of the least advantaged and least confident math learners.Holstein et al. [17] showed how this kind of system for classroom teachers tripled students' learning gains in a short-term research study.
The system measurably redirected the teacher's attention towards the students who had lower initial knowledge-and away from the students most likely to seek assistance (who are often the more confident and adept math learners).Fig. 2 distinguishes between (a) a scenario in which the initiative taken by more confident students deflects teacher/tutor attention away from the students who need attention the most; (b) a scenario in which a "round robin" approach tries to ensure an equal distribution of teacher/tutor attention, and (c) our target scenario, in which an intelligent dashboard directs human teacher/tutor attention to the students with the lowest levels of initial math preparation, even when these students are not actively seeking assistance.We seek to study tutor-and dashboard-driven human-AI tutoring systems in which off-site tutors can initiate remote sessions with students through teleconferencing software during the regular school day.The availability of these off-site tutors could, by itself, improve student outcomes even if dashboards neither provided much useful information to the tutors on student learning challenges nor directed them to support particular students.Our research design therefore includes AI-human tutoring interventions with and without "dashboards," in order to help us better understand the impact generated by use of a dashboard that provides actionable information and direction.

METHODS
We piloted a hybrid human-AI tutoring model in three U.S. middle schools located in the Midwest, the East, and the West respectively, with students receiving the intervention one day per week during the school day.Denoted as Sites 1 to 3, students interacted with the following math software, respectively: IXL, a comprehensive K-12 math curriculum; i-Ready, a K-8 learning program created by Curriculum Associates; and Carnegie Learning's MATHia (formerly Cognitive Tutor, [28]), an adaptive one-on-one math learning platform designed for grades 6 to 12. 1 For Sites 1 and 2 (tutor-driven), weekly student learning process data was collected to determine student's individual needs, but tutors used a "round robin" approach, ensuring they provided support to every student.For Site 3 (dashboard-driven), real-time learning process and usage log data was collected using MATHia's LiveLab, a live tutor-facing dashboard that provides tutors with real-time learning process data, such as idle time and progress, to assist tutors with prioritizing and differentiating student support. 2 Table 1 summarizes the implementation characteristics across sites.Throughout the school year, IXL-reported process data regarding skill usage and proficiency was collected weekly and made available to teachers.In addition, Renaissance Learning's Star diagnostic assessment was administered several times to assess student's baseline comprehension and estimated grade level performance.

Site 2 Method
Site 2 occurred at a large, urban California school district (n=15,700 students) in one of three middle schools.At the school where Site 2 was conducted, the 385 7th grade student participants were 90% low-income, 80% Black or Latinx, and 49% female.For each student, i-Ready learning process data recorded time on task and time spent completing a lesson, which were viewed as step 1 measures, and lessons passed, which were viewed as a step 2 measure quantifying higher achievement (See Fig. 1).Students engaged in i-Ready Personalized Instruction, as a weekly math intervention, which took place one day per week for 50 minutes, and completed the i-Ready Diagnostic, a computer-delivered diagnostic assessment [9].Once students completed the i-Ready Diagnostic, which was administered in the fall 2022, a personalized lesson plan was developed based on individual student's needs, though sometimes teachers also assigned specific lessons separate from the diagnostic-suggested lessons.Then, the last three weeks of the school year (weeks of May 15th, May 22nd, and May 29th, 2023), students engaged in hybrid human-AI tutoring, using i-Ready and receiving motivational and cognitive support from a remote human tutor.The tutoring intervention took place with tutor-to-student ratios of 1:4.There were 25 active weeks where students engaged in i-Ready only (with an active week defined as 150 or more lesson completions across the population).We removed 13 other inactive weeks that involved sporadic or inconsistent usage resulting from shorter weeks, holidays, state testing interruptions, and other atypical events.Note this analysis of only the active weeks is favorable for the outcomes of these non-tutored weeks such that our results would likely be even stronger if we included them.The next to the last three weeks of school, weeks 40, 41, and 42 (removing the partial week 43 as inactive), students were supported with an early prototype version of a hybrid-human AI tutoring treatment.
In this early rough approximation of our vision, remote tutors did not have a dashboard to guide them.Instead, the tutors were advised to visit students through Zoom in a round-robin cycle while students were using i-Ready.

Site 3 Method
Site 3 took place at a small, urban charter school in Pennsylvania (100% low income and Black), enrolling only males.
None of the 75 students in grades 6-8 were reported to have reached proficiency on state math assessments in the prior academic year (2021-2022), suggesting a low number of prior opportunities for math practice.For the first semester of the school year students engaged in a business-as-usual math curriculum, participating in daily math instruction delivered by their classroom teacher.Weekly MATHia practice is an integrated part of this curriculum.Students began engaging in the treatment of hybrid human-AI tutoring one day per week, in lieu of that day's standard math instruction, at the beginning of the second semester, January 5, 2023.This treatment concluded on May 31, 2023.Hybrid human-AI tutoring was implemented with students engaging in MATHia, while human tutors working remotely provided contentspecific and relationship-building support to students.Site 3 was the only site in the current investigation where tutors had access to a real-time tutoring dashboard.MATHia's LiveLab provided tutors real-time information on student engagement and usage by indicating measures of productivity (i.e., workspaces completed) and activity (i.e., idle time).

Site 1 Results
IXL-reported learning process data of skill usage and proficiency were provided weekly: time spent, questions answered, and skills proficient.Referencing the conceptual pipeline connecting learning process data from human-AI tutoring to changes in student achievement (see Fig. 1), we classify time spent and questions answered as step 1 measures and skills proficient as a step 2 measure.In addition, students completed Renaissance Learning's Star diagnostic assessment [21] up to nine times in the academic year with the first four administrations occurring September, December, February, and March.Descriptive statistics for the September Star diagnostic indicating mean grade level placement and standard deviation are as follows: Control, M = 7.06, SD = 0.133; Delayed Treatment, M = 7.04, SD = 0.070.Both IXL and Star data were collected for math and reading.Table 3 displays the site timeline and descriptive statistics of weekly IXL process data for Control and Delayed Treatment groups by time frame (i.e., fall, early spring, and late spring), demonstrating the interrupted time series design [23].The aim was to determine whether there was a statistically significant difference between each of the learning process measures (dependent variables) across each time frame.Table 4 displays the results of the independent t-test.Aligning with our hypothesis that hybrid human-AI tutoring has an effect on student learning, there is a statistically significant difference between the Control and Delayed Treatment groups in early spring and late spring, where the Delayed Treatment students were engaging in EdTech+MathTeacher and EdTech+MathTeacher_Tutor conditions, respectively.There were no statistically significant differences between the Control and Delayed Treatment groups in the fall across all learning process measures.Thus, the preliminary independent t-tests confirms statistically significant differences between the use of math software only, EdTech_Only, compared to with a math teacher, EdTech+MathTeacher, and further addition of the hybrid human-AI tutoring, EdTech+MathTeacher+Tutoring, conditions.Longitudinal plots displaying student learning process data for mean weekly time spent and the number of skills for which the student gained proficiency (skills proficient) are shown in Fig. 3a and 3b, respectively.We observe an increase in both measures in early spring for the Delayed Treatment with the addition of a MathTeacher and late spring with Tutoring.To test whether there is a positive and significant effect of Tutoring, with respect to MathTeacher, or if there was a decline, we conducted a mixed effects linear regression.
As shown in Table 5, we found that hybrid human-AI tutoring had a positive and statistically significant effect on student's time spent practicing with IXL ( = 0.202, 95% CI:[0.057,0.347], t(4235) = 2.734, p < .001).Further, the interaction of the pretest and hybrid human-AI tutoring on time spent, pretest:Tutoring, demonstrated a negative and statistically significant effect, ( = -0.212,95% CI:[-0.325,-0.096], t(4235) = 2.734, p < .001).This interaction indicates that the benefits of tutoring were higher for lower pretest students and decreased for higher pretest students.Perhaps not surprisingly, the addition of a math teacher in the classroom also had a statistically significant positive effect on time spent (see the MathTeacher row in Table 5) and there was a positive interaction with pretest, pretest:MathTeacher.This interaction indicates the MathTeacher impact on time was greater for higher pretest students than for lower pretest students, consistent with the left side in Fig. 2. Oppositely, tutors tend to attend more to low-pretest students (or at least all students uniformly in round robin), indicated by the negative interaction term, pretest:Tutoring.
A final significant effect of note (see the Late_spring row in Table 5) is that students' time spent dropped substantially in the late spring as is also apparent in the Control group lines in Fig. 3.
Effects of the same analysis (see Equation 1) for questions answered and skills proficient as the dependent variable were generally consistent with matching significant results for all but the Tutoring vs. MathTeacher comparison.
Referencing RQ1, we found statistically significant positive effects of the MathTeacher for all measures and an additional benefit of human-AI tutoring for time spent.Referencing RQ2, we found that hybrid human-AI tutoring benefited lower pretest students more so than higher pretest students for all measures, including skills proficient.Conversely, we found that having a math teacher present while students engage in using Edtech benefited higher pretest students more so than lower pretest students for all measures.

Site 2 Results
In the 25 active weeks prior to engaging in the tutoring treatment, students took an average of 32 minutes to complete a lesson and spent 24 minutes on task each week.During treatment, students took 36 minutes, on average, to complete a lesson and spent 33 minutes on task each week.The increase in student participation due to hybrid human-AI tutoring (33 minutes up from 24 minutes before treatment) approaches the i-Ready data-based recommendation of 45 minutes, even for these historically lower performing students [22].
We also analyzed if tutoring support was being distributed to the students with the greatest need.Student's individual need level was operationalized by averaging the grade level of all lessons assigned and completed (both passed and not passed), as determined by the i-Ready Diagnostic.Given these diagnostics were computed throughout the school year starting in the fall, the 7th graders at this site are essentially at grade level if they are entering with 6th grade level skills.Fig. 4 displays the average number of lessons passed per week according to students' individual need level, with those below grade level in the left three pairs of bars and those at or above grade level in the rightmost pair.We see that students who were below grade level passed more lessons per week during the last 3 weeks when tutoring was implemented than in the 25 weeks before tutoring whereas the opposite was observed for students at or above grade level.Like Site 1, the Tutoring treatment occurred in late spring and recall there we saw a substantial decline in Control students' engagement in late spring.We did not have such a control in the Site 2, but if students there also experienced a late spring decline, the positive effects of tutoring were greater than shown in the Fig. 4 differences.While we cannot adjust for a possible late-spring decline, we can test for the statistical reliability of the differences as shown in Fig. 4. We again followed the general pattern of an interrupted time series analysis [23].We fitted a mixed effects linear regression model (see Equation 2) with the number of lessons passed per week as the dependent variable and containing the fixed-effect predictors: average grade level across all lessons completed by a student (lesson_grade); and whether the week was before or during treatment (during_treatment).Individual student (studentID) is a random effect, accounting for the correlation between lessons passed by a student across weeks.The lesson_grade serves as a proxy of students' prior math knowledge expressed in grade level (e.g., a student with an average of 4.5 is estimated to have math knowledge comparable to a student halfway through 4th grade).Parameter estimates are shown in Table 6.
There was a positive and statistically significant main effect for the number of lessons students passed per week during the treatment,  = 2.44e-01, CI:[0.106, 0.381], t(10300) = 3.47, p < 0.001.Students completed, on average, more lessons each week during the three weeks of the human-AI tutoring treatment.There was also a negative and statistically significant interaction between student's estimated grade level and tutoring treatment, demonstrating the student's most in need are passing more lessons while participating in tutoring treatment.While students in general benefited from tutoring (and effects may be even bigger if there is a general late spring decline), students at or above grade level did not benefit.The round-robin tutoring approach we used, given dashboard guidance was not yet available, may have led to unneeded interruptions that distracted already-engaged students from making progress in the software.

Site 3 Results
Our early attempts at remote tutoring started at Site 3 and, as a consequence of our early-stage efforts, the implementation evolved over the course of the Spring semester.Our analysis thus explores the differences in impact of the early phase of implementation with that of the later phase.The intervention changed most substantially in two ways.First, time spent tutoring students increased from about 25 to 50 minutes per session; however, this also coincided with a shift in frequency of tutoring from once a week to once every other week such that available tutoring time over two weeks was the same.Second, the number of tutors available per class increased with tutor to student ratios changing from

DISCUSSION
Human-AI tutoring has positive impacts on learning process data and student outcomes.Across all three sites, our findings support the hypothesis that hybrid human-AI tutoring increases student engagement with AI software and learning progresses compared to student AI-software use alone (RQ1).In Site 1, we find statistically significant increases in average time spent and skills proficient among students participating in EdTech+MathTeacher and EdTech+MathTeacher+Tutoring compared to the Control (EdTech_Only) group (see Table 4).Mixed-effects linear models confirm that the EdTech+MathTeacher treatment increases engagement as measured by time spent, and that the EdTech+MathTeacher+Tutoring treatment provides a statistically significant (though smaller) additional boost (see Table 5).The EdTech+MathTeacher treatment also increases the number of lessons completed in the math software, relative to the control group, but the additional boost provided by the EdTech+MathTeacher+Tutoring treatment is not statistically significant.In Site 2, we find human-AI tutoring increased engagement with AI in terms of time spent from 24 to 33 minutes per week and also boosted lesson completion rates for students whose skills assessments suggested greater need.Interestingly, human-AI tutoring may have actually slowed the progress of students who were at or above grade level (See Fig. 4).This finding hints at the possibility that the "round-robin" approach that enforced equal participation in human tutoring among all students could be productively replaced by a system that directs tutor effort toward students with greater need (e.g., where the tutor decides whom to help, based on information on a dashboard).
In Site 3, we took a step in this direction by piloting the introduction of a live tutoring dashboard into a human-AI tutoring intervention.Variations of the human-AI intervention were provided to all students, so this site lacks a Control group, per se.We found students completed more work spaces per hour of participation (compared to working with the math learning software without a human tutor) in the intervention with tutor-student ratios of 1:4 rather than 1:8.
Demonstrating benefits of hybrid tutoring using intermediate measures like time spent or skills proficient within the math software is a valuable step.Nevertheless, we also want to know how hybrid AI-human tutoring impacts students' performance on external assessments of knowledge, such as state standardized tests.Referring to the conceptual pathway in Fig. 1, linking human-AI tutoring to higher achievement, our current results provide strong support for step 1 (increased learning opportunities) and somewhat more tentative support for step 2 (increases in math software lesson completion).We do not present new results on step 3 (increases in achievement).Instead, we draw upon results from prior studies linking increases in learning opportunities and lesson completion achieved through recommended use of our software tools to realized increases in state test scores associated with that recommended level of use.We then use these past correlations to predict the achievement gains that could result from the increased utilization of software tools induced by our human-AI tutoring intervention.Table 7 displays an overview of the software applications used in this study, the documented past correlations between "recommended" use of this software and subsequent test score gains, and the increased utilization of the software associated with the human-AI tutoring interventions demonstrated in this study.
For Site 1 using IXL, the average weekly skills proficient for both groups participating in math software use only was 0.88 (see 1st row, 4th column in Table 7), whereas students engaging in hybrid human-AI tutoring as treatment averaged 1.03 skills proficient per week (5th column).To put this finding into perspective, a prior IXL-conducted study indicates students who reach proficiency on one IXL math skill every other week demonstrate statistically significant increases in performance on the Pennsylvania State Assessment (PSSA) after a two-year period [27].At our study site, treatment students attain, on average, at least one skill proficient every week.Evidence from Site 1, when considered alongside prior IXL research [27], tentatively suggests that hybrid tutoring may contribute to enhanced student performance, indicating the potential of accelerated achievement gains.At Site 2, which used i-Ready, all students engaged in i-Ready for the majority of the academic year, averaging 24 minutes per week (see 2nd row, 4th column in Table 7).In the later three weeks of hybrid human-AI tutoring treatment, average usage increased to 33 minutes per week.Curriculum Associates reports 45 minutes of usage per week is associated with significant improvements on the SBA [22] and suggests 30-49 minutes of time-on-task weekly with at least 70 percent of lessons passed for the year [5].Our results suggest that human-AI tutoring generates substantial progress towards that recommended level of utilization.
Site 3 used MATHia but always combined with remote human tutoring, thus we cannot compare human-AI tutoring to software only.However, we did see an impact from our improved remote tutoring implementation, which particularly changed in going from to a tutor-student ratio of 1:8 to 1:4.This change led to students passing 0.36 more workspace lessons per hour than they did earlier (3rd row, 5th column in Table 7).This increase in workplace pass rate (above the unknown software only rate) likely puts students well within MATHia's recommended 0.5-0.75workspaces passed per hour recommendation (1st column).MATHia also recommends 60 minutes of use per week.We hope to encourage Site 3 to use human-AI tutoring every week instead of other week to better approach this recommendation.
Students with the greatest needs benefit more from hybrid human-AI tutoring than students meeting grade level.Although we find hybrid human-AI tutoring to have positive effects on students' engagement in learning, these effects are not uniform across students of varying abilities (RQ2).Our mixed effects linear model analysis at Site 1 identified an interaction between pretest and the impact of adding human tutoring support in conjunction with EdTech use.For all measures of time spent, questions answered, and skills proficient, human-AI tutoring produced significantly better outcomes for lower pre-test students than for higher pre-test students.We found a similar result at Site 2. For students below grade level (as identified by i-Ready's diagnostic score), the addition of human-AI tutoring raised their lesson completion relative to EdTech use alone, however, for students at or above grade level, the addition of human-AI tutoring may have slowed their lesson completion.Note that because we could not estimate a late-spring decline in Site 2, it may be that human-AI tutoring also helped students at or above grade level.Even if the human-AI tutoring effect at Site 2 is generally higher than estimated, the interaction effect remains.Among a majority of low-income students, our analysis indicates that lower-performing students, identified by lower pretest scores, may gain more than higher-performing students.Human-AI tutoring is benefiting the students most in need of it.Why might this be?
This result is just what we expect from dashboard-driven tutoring (right side in Fig 2), but we were surprised to see it as a consequence of the round-robin tutoring approach used at Sites 1 and 2. While tutors did not visit needier, lower pre-test students more often, they may have stayed longer with them.A second possible explanation is that student awareness that a tutor is coming has a positive impact on engagement particularly for students who have a history of lower engagement.A third is that students with lower pretests are likely to need and benefit more from the relationship building and human tutor interaction provided.These results hint at the possibility that a dashboard-directed system that pushes tutors away from a "round-robin" engagement with all students to even greater interaction with students below grade level could yield even stronger aggregate results (cf.[17]) and achieve greater educational equity.
Math teacher support during EdTech use appears beneficial but may not be ideally equitable.Referencing the findings for RQ2, math teachers play a significant role in student engagement and learning when working in conjunction with AI-driven math software (cf.[17]).The importance of their role is evidenced in Site 1, where the EdTech+MathTeacher treatment yields statistically significant positive effects on math time spent and skills proficient relative to the Control condition without a math teacher.Perhaps the supervisor available in the Control group throughout and available in the Delayed Treatment in the fall did not provide any extra tutoring support beyond the software whereas the Math Teacher did.Our analyses further indicated that having a math teacher present while students engage in using Edtech benefited higher pretest more than lower pretest students.This result may be a consequence of teachers being inclined to support students who request help and higher prior achieving students are more likely to do so.Better understanding of these teacher effects on student engagement and educational equity is an interesting direction for future work.
Rapid, quasi-experiments provide early evaluation guidance.We suggest quasi-experimental methods toward collecting rapid and reasonably reliable evidence both to guide iterative design in early program development and to create a robust "web of validity" as a program matures.These methods can not only address intervention effectiveness but also delve into the mechanisms behind its effect.Our results support the notion that human-AI tutoring can increase engagement with math software and some intermediate measures of student learning, that this effect exists especially for students below grade level, and that low tutor to student ratios dampen impact.This approach provides useful data early in a project without the high costs, in time and money, required by traditional RCTs.

LIMITATIONS, IMPLICATIONS, FUTURE WORK, & CONCLUSION
While these quasi-experiments shed light on the effect of hybrid human-AI tutoring during the school day, they also possess important limitations.The absence of random assignment introduces the possibility of selection bias and limited control over extraneous variables not present in true experiments [23].Small sample sizes also reduce statistical power, hindering analysis of more complicated interactions.Implementation fidelity is harder to determine in these regular math classroom-based field studies than in an after-school program settings over which researchers might be able to exercise more control.Each site differed in implementation details (see Table 1) which may limit the external validity of our results, as school-specific factors and demographics may influence human-AI tutoring intervention's effectiveness differently in other settings.Finally, we employed disparate measures across sites, which may raise issues concerning validity and reliability.The measures used reflect the constraints and recommendations (see Table 7) of the different ed-tech tools.Where possible, we used consistent measures (e.g., time).Using different math software enhances the generalizability of the findings reflecting the reality of implementing interventions aimed at helping students who need it the most.Remaining software-agnostic enhances equity by increasing access to our intervention among students and schools.Despite these limitations, our findings provide valuable insights on how to develop hybrid human-AI tutoring models, which will guide subsequent efforts to confirm and extend our findings.
This work suggests many productive areas for future research, including developing refined models of hybrid human-AI tutoring to direct attention to those who can benefit most (such as through AI dashboard support), and conducting control trials to understand the impact of hybrid human-AI tutoring on external assessments.Another key question for broader impact is to consider further refinements to increase the cost-effectiveness of this model.Future work will also investigate the heterogeneity of treatment effects across student-level demographics.
Averaging across our three sites our marginal cost per student was below $750, with a range from $597 to $1170 per year. 4The variability in annual cost is related to software licensing fees and site-based administration costs.Our work to date thus suggests that a marginal cost of about $700 per student, which is a small fraction of the recorded $3,500-$4,300 per student costs associated with other high-impact tutoring programs [12], is attainable.The powerful combination of potentially high efficacy and low marginal costs suggested by our early-stage results holds out the exciting possibility that this line of research could eventually yield high social returns.

Fig. 2 .
Fig. 2. Tutor-and dashboard-driven human-AI tutoring can redirect teacher/tutor effort to support the students with the lowest level of prior math opportunities.

Fig. 3 .
Fig. 3. Longitudinal plots displaying mean weekly (a) time spent and (b) skills proficient, across fall, early spring, and late spring for Control (blue) and Delayed Treatment (red).A mixed-effects linear model was fitted containing the following fixed-effect predictors: student first Star diagnostic score, taken on or before December 2022 (pretest); time period (fall = 00, early spring = 10, late spring = 01); and student group, Control or Delayed Treatment (student_group).The following interaction terms were added: tutoring:MathTeacher; pretest:tutoring; pretest:MathTeacher.Individual student (studentID) and weeks, numbered 1-40 spanning study duration (week), are random effects with fixed means capturing unexplained variability that is unique to each student and week.The generalized mixed model was fit by using Restricted Maximum Likelihood (REML) using R's lme4 package (see Equation1).The model was fitted across all three dependent variables: time spent, questions answered, and skills proficient, with all dependent variables and pretest scores normalized.

Fig. 4 .
Fig. 4. Average lessons passed per week by estimated grade level.Notice during the hybrid human-AI tutoring treatment, the neediest students (indicated by assigned-lessons being several grades behind), passed more lessons, on average, compared to before treatment.

about 1 :
8 to about 1:4.Given this shift, we considered how increases in tutor availability and longer sessions may have affected student progress in MATHia.We measured student progress as the average number of MATHia workspace lessons completed per hour.Fig.5illustrates the total workspace lessons completed compared to the total time using MATHia for students while engaging in the treatment, involving tutor-to-student ratios of 1:8 (blue) and 1:4 (red).The rate of progress increases for students upon shifting from tutor-to-student ratios of 1:8 to 1:4.Students completed an additional 0.36 (95% CI:[0.02,0.70]) workspaces per hour of MATHia engagement after the shift when they had greater access to hybrid human-AI tutoring, t(131.15)= 2.24, p = .03, 0 = 1.96.3

Fig. 5 .
Fig. 5.Total workspaces (i.e., lessons) completed compared to the total time each student spent using MATHia while engaging in the treatment involving tutor-to-student ratios during the Pilot Treatment phase (1:8, blue) and Maturing Treatment phase (1:4, red).The projected student progress while engaging in the treatment with tutor-to-student ratios of 1:4 is indicated by the dotted red line.

Table 1 .
Summary of implementation characteristics across sites.
3.1 Site 1 MethodSite 1 took place in an urban Pennsylvania school district at a middle school enrolling 490 students, of which 99% were economically disadvantaged and 96% score below math proficiency.Student demographics were 73% Latinx, 18% Black, 3% White, 1% Asian, and 5% multiracial.The entire 7th grade population, excepting nine students due to data loss, (n=125) were arbitrarily assigned by school administrators to either a Control (n=73) or a deferred treatment, in which students received the same condition as the Control initially and then participated in the Delayed Treatment (n=52).During fall semester, defined as August 22, 2022 through January 18, 2023 (16 weeks), Control and Delayed Treatment groups rotated through special classes with supplemental math support defined as students using IXL math software with a non-math (i.e., social studies, science, English/language arts) teacher facilitating the session but providing no further guided math support (Edtech_Only).Beginning in early spring, January 19, 2023 through April 6, 2023 (6 weeks), the Delayed Treatment students received supplemental math support from their math teacher, who provided guided help while they engaged in IXL (EdTech+MathTeacher).In late spring, April 7, 2023 through May 22, 2023 (7 weeks), the Delayed Treatment students continued to receive supplemental math support from their math teacher, but they were also assisted by tutors initiating remote sessions with individual students through teleconferencing software during support time (EdTech+MathTeacher+Tutoring).Tutor-to-student ratios averaged 1:4.Table2summarizes the interrupted time series design comparing the Control group to the three conditions within the Delayed Treatment group: math software use only; math software use with a math teacher; and math software use with a math teacher and tutors.

Table 2 .
The Control used EdTech throughout (EdTech_Only).The Delayed Treatment group added math teacher supervision (EdTech+MathTeacher) in Early Spring and remote human tutoring (EdTech+MathTeacher+Tutoring) in the Late Spring.

Table 3 .
Site 1 timeline of conditions for Control and Delayed Treatment groups with descriptive statistics displaying the math learning process data.The average weekly mean is shown with standard deviation indicated in parenthesis.
Independent t-tests were performed comparing IXL process data measures for Control and Delayed Treatment across each time frame (i.e., fall, early spring, late spring) for the following reasons: preliminary data exploration, initial hypothesis testing, and to assist with mixed effects linear regression model development.Preliminary independent t-tests were performed comparing IXL learning process data for Control and Delayed Treatment groups across each time frame (i.e., fall, early spring, late spring).

Table 4 .
Independent t-tests comparing Control and Delayed Treatment groups in fall, early spring, and late spring for each average of learning process data per student per week.

Table 5 .
Mixed-effects linear model predicting math time spent.

Table 6 .
Mixed-effects linear model predicting lessons passed per week.

Table 7 .
[33]29]w of software used with the research-supported recommendation correlating with predicted outcomes on state standardized tests and our findings for student software use only and hybrid human-AI tutoring.Pennsylvania System of School Assessment, two-year study,[27]**Smarter Balanced Assessments Consortium Math Assessment,[22]***CTB/McGraw Hill Acuity Series[26,29]; additional correlational studies associate MATHia usage with state test outcomes in Virginia and Florida[33]. *