An Empirical Evaluation of Educational Data Mining Techniques in a Dynamic VR Application

What makes an expert+ Beat Saber player? In the field of Educational Data Mining (EDM), there are various techniques for estimating latent skill mastery, such as Bayesian Knowledge Tracing (BKT) and Item Response Theory (IRT). While these techniques can estimate a student’s skill level and even predict their performance, these EDM techniques have yet to be applied to dynamic, embodied motor tasks or immersive environments. In this work, we explore how these techniques may be used for VR learning applications and apply them to estimate latent skill mastery and task difficulty in a VR game similar to Beat Saber. We conducted a pilot study (n = 24) and a full study (n = 75) to collect empirical data with players of different skill levels in the VR game. While the EDM techniques lacked in accuracy, they provided opportunities such as helping identify flaws in the learning system design and the skill modeling. Through scrutinizing our methodology, we identify five challenges in applying these techniques to VR, provide insights into developing new robust assessment systems for VR learning environments and provide an agenda for future research.


INTRODUCTION
Estimating a user's skill level, or mastery, is a central challenge in the realm of digital training systems [8].Understanding this intricacy is critical as it directly impacts our ability to design customized learning experiences tailored to the development of students' skills [8].While VR training and learning systems are becoming more common [38], there is a lack of effective methods to reliably measure and evaluate learners' skill development [7,34,52,64].For example, VR systems, such as a virtual laparoscopic surgery training system or a virtual fire-fighting training system, should be capable of measuring users' skills and also identifying challenging tasks or less-developed skills for users.
Traditionally, VR researchers have predominantly focused on measuring performance in individual tasks (e.g.number of hits in tennis) as opposed to the mastery of hidden or latent (i.e.agility, coordination) and higher-level skills [65].For example, Gray et al. [18] have used two consecutive successful attempts at a task to indicate a level of mastery in baseball and used a failed attempt to indicate a lack of mastery.However, this simplified approach overlooks the underlying skill required to master a sport like baseball.For example, in a VR game of tennis, hitting a forehand involves more than a single skill, it requires users to have a certain level of agility, coordination, technique, and other latent skills.As such, digital learning systems should be able to effectively measure latent skills to inform the overall performance of users rather than focusing only on individual task performance.
Algorithms for explicitly modelling latent skill development have been proposed in the Educational Data Mining (EDM) literature, including Bayesian Knowledge Tracing (BKT) [8] and Item Response Theory (IRT) [23].These algorithms use observable task performance to infer the latent skill mastery, ability, and task difficulty [32].These algorithms have been successfully shown to (i) identify cognitive skills or knowledge components (e.g., learning algebra) that users struggle with individually and as a cohort, (ii) determine task difficulty, and (iii) help evaluate the effectiveness of a learning system [8,49].Nevertheless, these EDM algorithms have been designed and only applied to estimate cognitive skills (e.g., maths) rather than perceptual-motor skills (e.g., tennis).Applying these EDM techniques to VR may reveal the potential of developing more robust assessment processes and facilitating the future development of adaptive VR learning systems.
In VR training, learning is embodied, situated, and simultaneously conceptual and procedural [50,58].Users usually need to combine both cognitive and perceptual-motor skills to succeed in dynamic VR experiences.As such, it is challenging to model skills in VR.When modelling traditional classroom-based cognitive skills, EDM practitioners have the advantage of using discrete independent tasks, such as individual exam questions (e.g., algebra questions).That is not possible in continuous physical tasks.Equivalent chunking in continuous, motor tasks present numerous novel challenges.For example, prior motion impacts subsequent trajectories: in a game of tennis, a player's position when hitting the previous ball impacts their ability to hit the next so we cannot treat the ability to hit each ball independent of the other.Many VR training systems involve physical dynamic movements.Thus, new ways for defining and modeling skills in VR must be explored.
In this study, we built a game similar to the popular VR game Beat Saber as our test case for modelling perceptual-motor skills.Beat Saber is a full-body game requiring a tight spatiotemporal movement coupling for success.Beat Saber has an inbuilt scoring system and a rich community of players who speculate about the best training strategies.By requiring players to hit targets in specific locations, directions, and times, Beat Saber mimics the skills needed for many racket and bat sports.Hence, Beat Saber is a great case for us to explore skill development in dynamic VR applications.Understanding skill acquisition in Beat Saber may offer broad implications for sports training and medical rehabilitation applications.
We conduct a pilot study of 24 participants followed by a full study of 75 participants to collect user performance data in our dynamic VR game.We then use two EDM techniques (IRT & BKT) to model and estimate the skill mastery and task difficulty of our participants.Our goal was to identify candidate skills, determine when users have mastered a skill, identify tasks and skills they struggle with, compare their performance to other users (the cohort), and, finally, use these results to evaluate the learning system.To the best of our knowledge, our study is the first to empirically explore the usage of EDM algorithms for modelling perceptual-motor skills in VR.If VR skills can be modelled effectively, then the algorithms can be used to estimate mastery and predict future task performance.Subsequently, VR researchers can use these algorithms to evaluate VR learning systems and optimize systems for learning outcomes.
However, our models show relatively low prediction accuracy.To better interpret what we believed to be surprising results, we thoroughly revisit our system design and scrutinize any flaws in our skill modelling.Reflecting on our methodology, we then identify and discuss five major challenges that must be solved in order to accurately apply these techniques to assess users' skill mastery in VR.These challenges include: (i) under-defined skill modeling approaches in VR, (ii) discrete versus continuous nature of tasks in VR, (iii) the impact of the number of tasks on model accuracy, (iv) the need for factoring coefficients such as fatigue and boredom as well as guess and slip when estimating perceptual-motor skills, and (v) finally, the challenge of integrating sensor data with traditional task performance data.
In this paper, we make the following contributions: (1) We explore the use of two EDM techniques in VR to estimate latent skill mastery and task difficulty as well as identify system design and skill modelling flaws.
(2) We propose and discuss five major challenges in applying these techniques to VR learning systems and set an agenda for future research.
Our investigation offers valuable insights and new approaches for evaluating VR users and learning systems.This work provides important implications for future research in the development of adaptive VR learning systems, where estimating skill mastery and task difficulty are vital.

BACKGROUND
In this section we explain the relationship between task difficulty and skill estimation, and how one cannot effectively estimate skill mastery without an understanding of task difficulty.We discuss the challenges in estimating task difficulty and how these must be overcome before we can reliably estimate task difficulty.Finally we discuss how Bayesian Knowledge Tracing (BKT) and Item Response Theory (IRT) address the challenges associated with estimating task difficulty and subsequently estimating skill mastery.

Task difficulty and skill estimation
Motor learning depends on internal and external constraints, which can inhibit or support it [43].There are three levels of constraints: the individual's internal physical or structural constraints, the behavioural or perceptual constraints, and the external environment and the task constraints [22].In this study, we focus on task constraints, which include the task's level of difficulty, and the functional constraints, which are perceived by the individual and are behavioural in nature.People's perceptions are impacted by their self-perceived skill level.For example, skilled athletes may perceive a ball (e.g., in a game of tennis) to be bigger, slower, and easier to hit, whereas amateur players tend to perceive the ball to be smaller, faster, and more difficult to hit.Such perceptual phenomenon also occurs in cognitive skill development (e.g., maths): low-achievers are more likely to perceive a task to be harder than high-achievers [42].
Thus, estimating perceived task difficulty is essential for assessing the performance or mastery of a student's skills.Additionally, it is important to note that it is not possible to measure a student's mastery of latent skills directly.Instead, it is only possible to estimate latent skills based on observable task performance [8].Therefore, task difficulty and skill estimation are complexly intertwined.To illustrate further, assume a scenario where a student succeeds in the easy tasks but cannot consistently perform well in hard tasks, they may not have mastered the skill even if they have a high "score" or number correct answers.Thus it is important to understand different task difficulties as they can impact the ability to accurately estimate a student's skill mastery [23].Regardless of pedagogy or approach for coaching motor skills or training for cognitive skills, it is necessary to accurately measure task difficulty [49].Furthermore, quantifying this subjective measure (task difficulty) for cohorts of students is essential when designing learning systems with the aim to optimise learning outcomes.
Beyond task difficulty there are other factors that should be considered when estimating skill mastery.Traditionally, when designing assessment strategies, researchers tend to use observable task performance as a measure of skill development.For example, when students successfully solve a given task, it is considered that they have developed some levels of mastery.Similarly, when students cannot solve a task correctly, it is considered that they are 'struggling' [41].However, response correctness alone has been shown to not fully reflect students' skill development [23].For example, a correct response may be the result of a lucky guess -when the student does not know a skill, yet responds correctly.Correspondingly, an incorrect response may also be the result of a slipwhen the student knows a skill but gives an incorrect answer [4].Therefore, assessing students' skill development based on response correctness is insufficient.
Thus we need to use methods that can robustly estimate task difficulty in order to more accurately assess skill development.Quantifying task difficulty and estimating latent mastery are essential to designing effective learning systems.Recently, a plethora of VR educational and training systems have become popular.Despite the promises of using VR for training, there is little work exploring ways to quantify task difficulty and estimate skill mastery in VR.

Challenges in Estimating Task Difficulty
While the development of VR learning and training systems has witnessed growth and prominence, the approaches to skill assessment within these digital systems are still underdeveloped [20,27].In order to accurately assess skills, it is first important to accurately estimate task difficulty.Task difficulty is different from task complexity.Task difficulty is based on the perceptual or actual behaviours of learners while task complexity is defined by the instructor.
Although related, task difficulty and task complexity are different constructs [20,27].Task difficulty is a subjective measure based on students' perception of a given task, whereas complexity is an objective measure determining how simple or complicated a given task is in terms of the required interactions and activities [5,16,53,54,60].Current methods of assessing student performance in VR generally do not take into account the inherent difficulty of different tasks, only complexity, as defined by educators [65].
Zahabi and Abdul Razak (2020) conducted an in-depth literature review of 69 adaptive simulator training systems.While 23 systems used scenario complexity (e.g., an experience that is believed to have a certain level of complexity defined by the researchers) as an adaptive variable to challenge users and prepare them for more complex tasks, none of the mentioned systems evaluated task difficulty when assessing performance.Interestingly, the majority of systems used kinematic measures such as speed, accuracy, range of movement, force, or muscle activity to measure performance.Some researchers have used physiological responses to measure a trainee's cognitive load, stress level, and emotions that could correlate with perceptions of task difficulty [14,30,37].While researchers have used their own judgements of task complexity to design their systems, they have not directly measured or inferred task difficulty based on students' performance.Without an understanding of task difficulty, the estimation of students' skill mastery may not be accurate.
Additionally, for learning to be effective -where students stay engaged, there is a need for the learning content to be personalized and matched with each student's skill level [51,65].For example, within a simulation-based learning environment, different learner profiles should be made so that users are offered learning content that exactly matches their skill level [6].This concept is commonly recognized as flow, a highly focused mental state that can be achieved when the task difficulty is within an acceptable range for the student, i.e., it matches the student's skill level [11].Existing VR learning systems, however, typically lack such customisation.The assessment systems in VR only score the users based solely on their task performance (e.g., successfully hitting a ball) [9,19,35], without any understanding of the difficulty of that particular task.
One way to 'measure' task difficulty is to use self-reported task difficulty through surveys.However, self-reported data may not be reliable as students' perceptions of task difficulty may vary and could also suffer from social desirability bias [31].As suggested in prior literature, people tend to report responses that they consider are more 'favourable' by society or by the researchers.Additionally, surveys are mostly conducted at the start and the end of the studies and, thus, may not be able to trace students' learning over time.
One way to automate and robustly estimate task difficulty & skill mastery is through the application of Educational Data Mining Techniques such as Bayesian Knowledge Tracing (BKT) and Item Response Theory (IRT).In the next section we discuss these models in more detail and explain how they address the challenge of estimating task difficulty and skill mastery.

Psychometric and Knowledge Tracing Techniques
Recently, algorithms that can explicitly model latent skill mastery, such as Bayesian Knowledge Tracing (BKT) [8] and Item Response Theory (IRT) [23] have been gaining popularity in digital learning environments.These algorithms use observable task performance to infer latent skills, such as student mastery, ability, and task difficulty [32].
BKT is primarily focused on modeling the knowledge state of a student and how it changes over time as the student interacts with learning materials.The model parameters in BKT include: (i) Initial Knowledge: The probability that the student has mastered a skill initially.(ii) Learning Rate: The probability that the student has transitioned from a state of not having mastered a skill to having mastered it after an opportunity to practice.(iii) Guess: The probability that a student answers correctly by guessing.(iv) Slip: The probability that a student answers incorrectly despite having mastered the skill.
Task difficulty can be indirectly incorporated into BKT through the guess and slip parameters.A more difficult task might have a higher slip rate (since even knowledgeable students might make mistakes) and a lower guess rate (since guessing correctly is harder).However, BKT does not explicitly model task difficulty as a separate parameter.
In contrast, Item Response Theory (IRT) explicitly models task difficulty.IRT is a psychometric technique used in digital learning systems to assess latent skills and task difficulty [4,23].IRT (using a Rasch model) is commonly implemented in computer adaptive testing to quickly determine a student's ability on a common scale defined by their peers.IRT involves modeling the relationship between a student's latent ability and the probability of correctly answering each item on a test.
Other than estimating individual student performance indicators, they provide an understanding of how an entire cohort performs and what problems/skills they find more challenging [47,48].These EDM techniques help education practitioners model cognitive mastery, assess learners and design systems that optimize for learning outcomes [49].
BKT is predominantly used to model cognitive skills and not used to model perceptual-motor skills.These EDM approaches have not been used in embodied immersive virtual environments, where the nature and structure of tasks and skills are quite different.Thus applying them to embodied skills in VR introduces numerous challenges such as skill modeling.

Challenges in Skill Modeling
A great example of the complexities associated with modelling and assessing competencies in a skill such as "problem-solving skill" is presented by Shute and Almond [28].They highlight that there are a number of challenges with assessing problem-solving skills, such as the lack of a clear definition, the multidimensionality and ambiguity of the construct, and finally, the flawed self-report measures [57].Additionally, many tasks may be linked to multiple skills, thus further complicating the skill modelling process [28].We refer to this literature to highlight the skill modelling challenges raised by researchers exploring 2D game-based environments.
Moving to 3D spaces in VR, modeling skills in Virtual Reality (VR) training present unique challenges due to the immersive and dynamic nature of the learning environment.Unlike traditional settings where BKT and IRT are typically applied where tasks are discrete and independent, VR involves continuous physical tasks, making it challenging to define clear boundaries for skill assessment .The interdependence of movements in VR-where one action affects the next-adds complexity to skill modeling [50,58].This necessitates the development of innovative modeling approaches.Researchers are tasked with exploring new methods to accurately assess and represent the diverse range of cognitive and perceptualmotor skills involved in VR learning experiences.Some like Mislevy argue, the telemetry that can be harnessed from interactions in immersive games can provide rich evidence to make inferences about complex skills, opening new avenues for assessing skill level [45].We argue that validated methods like BKT and IRT can offer insights from simple task performance data, to help estimate skills and evaluate learning system designs in addition to sensor data.

The Future of EDM in VR
Abrahamson et al. discuss the importance of integrating research in embodied design and multimodal learning analytics (defined as a "methodological approach to analyzing data from multiple measures of students' actions and sensations") [1].They also specifically refer to the technical considerations of early integration of Learning Analytics (LA) when designing embodied systems such as a VR learning system.Some examples include understanding students' engagement level, gaming behaviours, ability to reflect, and sense of awareness in learning environments [15,21,39].We wonder how task difficulty influences these factors and, in return, how these perceptual differences impact students' perceptions of task difficulty.
VR learning analytics literature has more specifically looked at affective states, anxiety, mind-wandering, and the sense of agency that may impact learning [25,26,29,40,59,63].Once again, task difficulty and research in the zone of proximal development (ZPD) have shown that the perceptions of students' ability impact their perception of task difficulty and vice versa.So when measuring these affective states, it is important to have a measurable understanding of perceptual task difficulty.We see our work as an initial stepping stone in integrating powerful techniques such as BKT and IRT, into creating personalized embodied learning systems for the future of VR.
Thus in this study, we empirically evaluate the application of EDM techniques in VR.We use them to estimate task difficulty, skill mastery and to assist in evaluating the learning system design.Highlighting the value of using this techniques in VR while simultaneously setting an agenda for future research given the complexity of integrating EDM into embodied domains.

METHOD
In this study, we assess participants' perceptual-motor skills in a movement-based VR environment.We integrated the game with a data collection module to collect participants' performance-based data (i.e., their scores) and their movement-based parameters.We used the data to estimate skill mastery and task difficulty.All procedures received approval from our institution's ethical review panel.

VR Experience Design
We used the Unity 3D Engine to build a game similar to the popular VR dance game Beat Saber as our test case in this study.In the recreated game, the participant holds two swords (red and blue) and sees a series of approaching cubes of two colours, four rotations and from four positions, resulting in 32 different types of cubes.These cubes are randomly spawned at a distance and fly towards the participant.Each cube has 3 attributes: colour (red or blue), rotation (east, west, north, or south), and quadrant (top-right, topleft, bottom-right, bottom-left).The game requires participants to "cut" the incoming cube with the sword of the same colour, in the direction indicated by a marker (see Figure 1.If the two rules above are met, the game records a successful hit, and the cube is then destroyed.Otherwise, the cube is recorded as a miss.In contrast to the original Beat Saber game, our game does not include a music track in order to control for the effect of beat misalignment.Instead, the number of cubes spawned per minute and the speed at which the cubes approach the participant are directly set by us for the experience.

Participants and Procedure
We recruited participants from our university community (through online channels and notice boards) with the criteria of having full upper and lower body mobility.We initially recruited 24 participants for a pilot study, and then recruited 75 participants for the full study.The 75 participants included 40 women, 34 men, and 1 other, between the ages of 18 to 55. Participants from the pilot were not allowed to participate in the full study.
In the initial pilot study, participants were given different levels of game difficulty.They were given a 1-minute training scenario, followed by 4-minute scenarios that gradually increased in difficulty (speed of approach and number of cubes spawned per second).We found that the data suffered from a ceiling effect where most participants successfully hit 80-90% of cubes.Our ability to quantify the task difficulty distribution helped us to identify that the game was too easy (see note on CTT task difficulty in the discussion section).This allowed us to make important changes for the full study, as a skewed data set resulted in artificially high accuracy numbers.
Thus, in the full study all participants were given only one 3minute session on the game with a high difficulty level (number of cubes faced per minute and speed of approach).All participants were given a brief description of the study and were informed that they would receive a gift voucher for participating in the experiment.
The Meta Quest 2 headset was used for the study with the original controllers.Participants did not have to press any buttons on the controllers, but to simply move the controllers to "cut" with the swords.Participants were attached via an Meta link cable to the PC running the game, and saving their data.All participants started standing at the same spot where they were first calibrated before starting the game.No walking was required during the game.

Measures
The application included an integrated data collection module.The module collected participants' hand and head positions and orientations at every frame, the cube attributes, the position and orientation of hits, and successful hits as shown in Table 1.This data was recorded at every frame of the game and was stored on a private GitHub repository.This data was used to identify the cube hit (from the 32 types) and with which controller.

Task Definitions & Modeling.
A task was defined as 2-cube hits as opposed to just 1-cube hit.This was due to our belief that a participant's ability to hit a cube depends on the previous cube they faced.A successful task was therefore defined as two correct hits of a pair of consecutive cubes.Given that there were 32 cubes, this resulted in 1024 pair combinations, as explained by Figure 2. Two consecutive cubes can be the same colour, different colours, in different quadrants or in different directions.We assumed that different combinations of cubes (tasks) required different skills.For example, hitting back-to-back cubes that were the same colour may require different skills to two consecutive cubes that changed colour as the participant not only has to consider the direction of cut but is required to remember which hand to use as the sword colour in each hand must match the colour of the cube.Therefore the skill required may change with each combination.For further exploration, we categorised these tasks into skills described below in the Skill Definitions section.

Skill Definitions & Modeling .
As the goal of this study was to estimate skill mastery, we grouped tasks that represent different skills into different categories.This way we can assess each participant's performance in different tasks but also infer their mastery of associated skills.There are many assumed latent skills in a game like Beat Saber, such as target selection and accuracy, reaction times, bimanual coordination, among other skills.Many of these skills are central across many VR experiences/skills and are valid ways to define a skill.However, these skills are challenging to operationalize for a number of reasons.
Operationalizing skills like target selection, accuracy, reaction times, and bimanual coordination in VR experiences like Beat Saber, present a unique set of challenges.The multifaceted nature of skills, combining cognitive and motor abilities, complicates the isolation and measurement of individual skills.Additionally, the continuous and interdependent nature of actions in VR makes defining clear, isolated tasks difficult.Lastly, the inherent complexity of VR environments complicates the creation of controlled conditions for isolating and measuring specific skills.Researchers, therefore, must innovate and devise context-specific methods to navigate through these complexities.
We chose to model the skills as follows: (  2) Given the lack of literature on embodied skill modeling we categorized skills in our VR game based on suggestions by expert Beat Saber players in online forums and by comparing them to a game of tennis [44].Using the described methodology, we grouped tasks (cube combinations) under skills, with an example presented in Figure 2.For example, one of those skills is "Flexibility -Color & Side Changing Cubes".We believe this skill can be comparable to the skill "flexibility", which is defined as the capacity of a joint to move freely through a full range of motion [24].Participants' needed the ability to switch hands and reach for an opposing colour cube requiring a full range of motion.
The second skill upon which we assessed the participants was "Agility -Occluded Cubes".It is defined as a rapid whole body movement where a change in direction or velocity may occur in response to some stimuli that may be comparable to the skill "agility" [56].Generally, as a motor ability or skill, agility can influence achievement where a quick change of direction may be needed.Researchers have suggested that agility as a multifaceted skill that may depend on several other abilities, such as power, speed and balance [55].But within this study, we defined "Agility -Occluded Cubes" as participants' ability to hit two cubes close to each other i.e., back-to-back occurring cubes.To hit such cubes successfully would require sufficient speed from participants.Finally, a strategy for approaching cube combinations far from each other would be described as "technique" in sports training literature [44], where the player must centre their body after each hit to be the shortest distance possible from the next hit.
This approach, while exploratory, provides a foundational framework that leverages existing knowledge from physical sports, offering a tangible starting point for understanding and analyzing skills in the relatively nascent domain of embodied VR experiences.This method, therefore, makes sense as it pragmatically utilizes available expertise and established concepts, albeit from different contexts, to navigate the complexities of skill modeling in VR.

BKT & IRT Modeling.
In order to estimate mastery of each of the aforementioned skills, and estimate task difficulty the data was modelled using IRT and BKT.IRT and BKT were selected for our analysis because they are two of the most widely utilized models in Educational Data Mining (EDM), each offering unique insights and capabilities in assessing and tracing student knowledge and performance.

Item Response
Theory.We began by using a Classical Test Theory (CTT) approach to determine the difficulty of each task (cube combinations) [13].CTT simply ranks all tasks based on how many participants got them correct.For example, if only a few participants got them correct they are assumed to be difficult.Using this estimate, we derived each participants' overall IRT-derived ability for each skill, i.e., we estimated the likelihood of hitting a combination of target cubes of different difficulties correctly.
In educational settings, IRT (Item Response Theory) is often used for real-time online testing to understand a person's ability or skill level at a specific moment in time (real-time).However, in the context of our study, we did not model participants realtime, so we had to adjust this technique to be able to estimate ability after the game was completed and all the data was collected.When used in real-time IRT is a psychometric test -a test that tries to measure mental abilities or hidden psychological traits by constantly challenging the student by giving them harder and harder tasks to pinpoint their ability.
Even though the VR game wasn't initially designed as a psychometric test, we created a grading method to turn the game Finally, in the right image, the participant has to not only worry about the direction of the cut but also remember which hand to use as the colours and quadrants are also changing.
experience into one after the fact.We chose to grade only specific tasks that presented new difficulty levels to the participants, ignoring tasks that were at the same difficulty level as ones they'd seen before.This approach also intentionally lowered the participants' success rates, which helped to overcome a problem (called the ceiling effect) identified in the pilot study, where too many participants scored near the top, making it hard to differentiate between their abilities.
We chose to use Classical Test Theory (CTT) and Item Response Theory (IRT) to analyze the VR game data for several nuanced reasons.We chose CTT to rank the difficulty of tasks because we did not have real-time feedback about which tasks participants found hard while playing the game.We assumed that all tasks were equally likely (independently and identically distributed, or IID) and that the same mix of easy and hard tasks was given to all participants, whether they were high or low-skilled.So, using CTT, we believed that, on average, higher-skilled participants would correctly complete more of the harder tasks.
Then, we used a specific IRT model, the Rasch model, to estimate the hidden (latent) ability levels of participants.The Rasch model is well suited for this application as the 1PL/Rasch uses only the difficulty parameter and does not take into account guessing effects or items that may be more discriminating than others, which makes it well suited with small sample sizes.Thus, we used CTT to rank tasks based on difficulty and IRT to understand how skilled the participants were, making assumptions to manage the fact that we did not have real-time feedback and not all participants saw all tasks.

Bayesian Knowledge Tracing (BKT). Bayesian Knowledge
Tracing (BKT) is a dynamic, probabilistic model widely utilized in the realm of educational data mining to predict students' learning and mastery over specific skills as they engage with educational content.In our study, we employed BKT to estimate participants' mastery levels across defined skills within the VR environment.The model operates by iteratively updating the probability of skill mastery as participants progress through tasks, thereby providing a real-time, evolving estimate of their learning trajectory.BKT is particularly renowned for its ability to model the latent (unobservable) knowledge of learners by observing their performance on various tasks.We utilized pyBKT, a Python library [3], to implement the BKT model.The model yields not only mastery estimates but also Guess and Slip Coefficients, which are pivotal in understanding the nuances of participant performance.A "guess" refers to an accidental correct response without mastery, while a "slip" denotes an incorrect response despite having mastered the skill.The detailed logic behind these algorithms are further described in the appendix.Furthermore, a comparative analysis was conducted, contrasting the BKT modeling against IRT modeling, and evaluating them based on several metrics, including accuracy, recall, precision, F1 score, and Mean Squared Error (MSE), to discern and discuss the relative efficacy and applicability of each model in our study context.

Validation
We validate our mastery and task difficulty estimates by testing the predicted task response against the real result (correct or incorrect).We used binary classifier accuracy measures to assess the prediction's quality.Those measures include precision, recall, F1 score and accuracy, and MSE.
We used leave-one-out cross-validation, which required training a model on all participant data except for one participant, then scoring that participant using the trained model.The participants had mastery scores calculated for each task and skill.This process was repeated for all the participants.The detailed methods and accuracy results are further explained in the supplementary materials.
We attempted to estimate the cohort's mastery of each skill and predict their successful and unsuccessful task performance with the metrics presented in the results section.Accuracy is the ratio of correctly predicted task performance attempts to the total task attempts.Precision is the ratio of correctly predicted "correct" attempts to the total predicted "correct" attempts.Recall is the ratio of correctly predicted "correct" attempts to the total task attempts in the skill.The F1-Score is the weighted average of Precision and Recall.These metrics ensure that the model's high accuracy is not due to a skewed data set, where most attempts were correct and thus easily predictable.Finally, Mean Square Error (MSE) is the measure of the differences between correct predictions and the actual "correct" attempts observed.In this case, the lower the number, the better.Additionally, we compared the accuracy of the BKT and IRT models to a random baseline.A "Random Baseline" serves as a comparative benchmark that represents the accuracy of random guessing.In simpler terms, if a model were to predict outcomes without any informed logic-essentially guessing at random-what would its accuracy be?This is what the random baseline seeks to represent.In this instance, we evaluated the accuracy of a random baseline by comparing three different guessing strategies: guessing randomly, guessing all tasks as wrong, and guessing all tasks as right.We then selected the strategy with the highest accuracy, which was guessing all tasks as right, for our random baseline

RESULTS
In the methods, we identified candidate skills, and our goal was to estimate task difficulty, determine when participants have mastered these skills, and use these results to evaluate the learning system.As a result of using these techniques we were able to identify tasks participants are struggling with, and compare their performance to other participants (in the cohort).
The results indicate that BKT performed only marginally better than a random baseline and IRT performed worse (Accuracy BKT: 0.69, IRT 0.54, Random Baseline: 0.65).The random baseline accuracy is 0.65, meaning that a model that predicts task performance by merely guessing would be expected to be correct 65% of the time.These results can be seen in Table 2 and in more detail in Figure 3 and in Appendix B.

BKT Latent Skill Estimation
Figure 4 illustrates the latent skill estimation for participant number 40, "Agility" & "Flexibility" skills.In this figure, the red dots represent 2 consecutive successful ( = 1) or unsuccessful ( = 0) cube hits (only combinations that represent this skill are included).Based on these attempts and BKT's estimation of the participant's probability of guess, slip and prior knowledge, the participant's mastery of the skills is estimated and represented by the blue line.
From Figure 4, we can see that there is a difference in performance between the two skills (top and bottom).In the top figure we can see that the participant never quite masters the skill.Whereas, in the bottom image, after about 20 cubes the participant has mastered the skill.The model assumes any misses after this point to be "slips" -accidental misses.However, in the top figure, we can see that the model assumes mastery multiple times early on (see peaks).This reflects flaws in skill modeling, or the system designsee challenges section.Ultimately, these models help us evaluate learners, and the system.
Additionally, through this type of modelling, we can observe each participant's learning rate.This is the rate at which the system believes the participant went from their prior skill level (the very first prior is the global skill level of every participant in the system for that specific skill) to mastery of the skill.The learn rate distribution in our system ranged from some of the participants learning the skill in under 1 minute and others did not learn the skill even after 3 minutes.Some participants had previous VR experience and tended to perform slightly better, but this was rare, the majority of our participants had no prior experience.Thus experience did not explain why some did not learn.
Along with the estimate of the mastery of each skill, BKT provides the probability of guess, slip and a "correct" attempt.We demonstrated the accuracy of each task's "correct" predictions in the validation section above.High accuracy represents highly predictive task performance probabilities and can be optimised by improving the skill modelling.The guess and slip parameters are also used to determine the granularity of modelled skills.In other words, if the grouping of tasks is too broad.Guess and slip parameters are presented in Table 3.We had a high (around 0.4) probability of guessing and a low (around 0.2) probability of slipping.It is important to note that the cohort's overall skill mastery, determined through the IRT modeling, is significantly impacted by these task difficulty predictions.Additionally, a participant who performs poorly among a group of high performers would be penalized more than if they were among similar-performing peers.
Figure 6, predicts skill mastery for a participant using IRT.We can see that the ability estimate rises and falls with the task difficulty BKT IRT  estimate.As well as showing that the model did not grade easier tasks (explained in the IRT section in the methods).

DISCUSSION
The BKT model performed marginally better at estimating mastery, and predicting task performance (Accuracy BKT: 0.69, IRT  2 and with more detail in Figure 3 (and Appendix B).The marginally better accuracy of the BKT model to a random baseline (given the most popular answer every time), highlights flaws in our skill modeling and system design.We argue that the model's ability to highlight these issues is why we should be using these models.By helping identify flaws, researchers and practitioners can focus on improving those aspects of the system to help make more accurate predictions of user performance.One way these models support researchers and practitioners is through the Guess and Slip coefficients.We were able to use the Guess and Slip coefficients to evaluate our skill modeling and learning system.We found the Guess coefficients were high (> 0.4).This may be an indicator of our skill definitions being "too big".More granular skills have been shown to produce better predictions with lower Guess & Slip coefficients [48].These results are important in evaluating how well experts or system developers have defined the skill.Enhanced granular skill modeling improves Guess and Slip coefficients and the model's prediction accuracy.
The IRT/CTT modeling on the other hand, helped us determine latent task difficulty.Figure 5 illustrates the distribution of task difficulty (all tasks across all skills).Tasks with a low "probability of a correct attempt" can be seen as a task with high difficulty (the lower the probability of correct, the higher the task difficulty for that individual).IRT & CTT help us to robustly pinpoint individual tasks that participants and cohorts struggle with.This can be used by system developers and domain experts to sequence tasks using techniques such as scaffolding (ordering tasks from easy to hard).But perhaps more importantly, it helped us recognize the limitations in the design of our system.In the pilot study (top Figure 5) initially highlighted that most of our tasks were too easy for most of our participants, creating a ceiling effect.This helped us design the full study with new parameters that increased difficulty.In the motor learning literature, the optimal percentage for learning requires 25 per cent of tasks to be above the participant's mastery level [61].If this percentage is not achieved, the training is either too hard (too many challenging tasks), or too easy (too few challenging tasks) -this leads to sub-optimal learning and poor learn rates.So task difficulty estimation helped us identify that we were not meeting this target in our initial pilot study.Researchers can explore the impact of varying task difficulties on the learning rate.
Furthermore, by estimating latent skills, we were able to observe the participant's learning rate (time taken to achieve mastery) as represented in Figure 4. Learning rate is represented by the slope of the mastery estimate curve.System developers or domain experts can use the distribution of learning rates among a cohort to optimise the learning rate for individuals or the cohort.For example, A/B testing among participants can be conducted to identify which sequence of tasks or difficulty of tasks leads to optimal learning rates among individuals and the cohort.
From a learner perspective, the models can help identify the skills participants mastered and those they struggled with (see Figure 4).While our implementation of these models did not accurately predict task performance in future unseen tasks; with a higher accuracy model we could pinpoint exactly when a participant should stop practising one skill and move to another without exposing them to every task.The models helped us compare participant performance to the performance of the cohort.Thus, we can identify the skills individual participants' and the cohort struggle with (see Figure 7), and focus on the necessary changes to improve these skills.For example, we found participant 18 struggled with similar skills compared to rest of the cohort, as we can compare it to the prior beliefs of the mastery level of the cohort set out in Table 3. Overall, this participant was a poor performer compared to the cohort.In addition to enhancing learning system design, these insights allow us to share relevant feedback with learners.

Challenges and Opportunities
We have identified the following challenges and opportunities afforded by implementing EDM techniques in VR.
Lack of literature on modelling embodied skills in VR.To estimate skill mastery using BKT, we need tasks to represent higher-level embodied skills.This is referred to as skill modelling or skill definition.VR creates a complex environment where defining and isolating participants' skills may not be a straightforward process, especially when the existing literature on skill composition or modelling in VR is under-defined [12,49].Performing perceptualmotor skills involves motor abilities as well as motor and cognitive processes.Thus, defining independent skills becomes a more complex task.On the contrary, knowledge components required for cognitive skills have been well defined over the years in the field of Educational Data Mining [43,49,62].In a traditional classroom setting, the grouping of learning tasks and activities might be easier (such as individual exam questions).But the same may not be true for motor skills -as in these settings, equivalent chunking of continuous, physical tasks presents novel challenges where participants' prior motion can impact their subsequent trajectories.While there are challenges associated with the lack of literature on defining embodied skills which the EDM models need to work, the EDM models themselves provide avenues for helping define the skills.As mentioned, the prediction accuracy of the models, as well as the Guess and Slip coefficients support researchers in defining skills.The model accuracy and coefficients provide feedback on how granular or representative those tasks are of the associated skills.
Discrete versus continuous nature of tasks in VR and thus the need to intelligently discretize continuous motor tasks.As briefly touched on in the previous point, movements are continuous physical tasks.This raises the question, does discretizing them into correct and incorrect attempts make sense?Will this allow us to correctly model embodied skills?Or should we not pursue this path and focus on looking at movement-related parameters such as reaction time, jitter, and speed instead?
We argue that if the skill modeling is done well, the issue of descretizing can be overcome (using methods like those used in this study).Resulting in the EDM models providing important insights into how students and the learning system are performing.BKT and IRT, may provide user and learning system performance data were sensor data cannot.
Impact of the number of tasks on model accuracy From the EDM literature we know that if we treat the entire game as one skill, or if our skills are not adequately refined, it can negatively impact the BKT model's accuracy [46].It may similarly be concluded that perhaps the large number of cubes faced by each participant also negatively impacted the models accuracy.However, this is counter intuitive as more data from each participant should help the model's predictive power.Given that so many cubes arrive in a short amount of time, each cube pair should only have a minimal impact on any overall assessment of skill.This requires further research, as the model may need to be modified.In typical use, BKT is applied to data that has a small number of tasks per participant (e.g., 12 algebra questions) but a large number of participants (e.g., 1000s), therefore it is unclear how the model would perform when participants are faced with large numbers of tasks (60+ in our case).
Factors such as fatigue and/or boredom impacting skill estimation in VR.Other than guess and slip, we can potentially see boredom and/or fatigue playing a role in the performance of some participants in this dynamic game.This is evident towards the end of Figure 8, where it may be possible that this participant who has infact mastered this skill has now become tired or bored.While BKT helps us model guess, slip and forgets (when a participant previously mastered a skill, but has now forgotten), these are not time-bound coefficients and therefore do not equate to boredom, fatigue or other effects that may happen during one learning session in a dynamic VR game.It may be necessary to create new coefficients or modify these models to allow for them to account for phenomena specific to motor skill acquisition.
The complexity of integrating sensor data with traditional task correctness data into multimodal assessment techniques.While we collected significant amounts of sensor data (movement & and interaction data about participants) there is little prior literature for how we could integrate this with the task performance data.Sensor data is highly relevant and valuable data gained in multimodal learning environments.However, the generalizability of this data and how it can be integrated into multimodal assessment techniques remains a question.The research community needs to create frameworks for assessment techniques that use the sensor data, the traditional response correctness, and parameters such as task difficulty.For example in the learning literature, the time a participant takes to answer a set of questions may be correlated with "gaming" the system -clicking through the material without engaging with the content.This gaming behaviour is often integrated with other parameters to understand the user's behaviour and performance.

Study Implications
Through our investigation, our work provides implications for the assessment of latent skills in VR.For digital learning systems, findings of this work inform a potential path to personalisation of learning content.
We also provide insights into both the online and offline design of adaptive learning systems and VR learning management systems.Bayesian Knowledge Tracing and Item Response Theory, provide individual user and cohort level performance parameters that can be used to determine the optimal sequence of tasks, assess the difficulty of tasks to achieve optimal learning rate (prior knowledge to mastery) and much more.
Finally, identifying students that are struggling due to their skewed perceptions of task difficulty (mentioned in the Background section) allows us to help such students by positively manipulating their perceptions.It may be possible to improve a student's performance by changing the characteristics of the task (size, speed, etc.) allowing the student to become more aware of their perceptual abilities [10].It could also be used for talent identification referred to as the Moneyball Approach [33], where educators can quantitatively evaluate talent not just based on their external constraints (e.g., their physical attributes that impact game play) but also their perceptions, strategies and techniques [17].
The methodologies and findings presented in this paper have potential applications in evaluating and tracking the progress of individuals engaged in various therapeutic interventions.For instance, the skill modeling and assessment approaches utilized in our study could be adapted to create VR-based therapeutic interventions for individuals recovering from motor impairments or undergoing physical rehabilitation.The VR environment can offer a safe, controlled, and engaging space where patients can perform targeted motor tasks, and our modeling approaches can provide detailed, data-driven insights into their progress and performance.Moreover, the application of models like BKT and IRT in such contexts could facilitate the development of personalized therapeutic pathways, dynamically adapting to the evolving abilities and needs of each individual.This could enable therapists to tailor interventions more precisely and potentially enhance the efficacy and efficiency of therapeutic outcomes.Future work could explore the adaptation and validation of our approaches in such therapeutic contexts, potentially opening new avenues for leveraging VR and data-driven models in rehabilitation and recovery processes.

Future Work
BKT is designed for cognitive skills.Therefore, concepts such as fatigue are not included as parameters in the model.While Guess and Slip could potentially represent fatigue or rather accidental hits in a VR setting.In active VR learning settings, such as VR sports, fatigue may negatively impact user performance.We believe that fatigue needs to be modelled separately with time dependence.An additional parameter, Forgets has also been used in literature but only plays a role when a participant returns to a task after a long period of absence and is modelled into the algorithm from the start of the session.The "Forgets" parameter does not impact the modelling over time, so unless changes to the model are made, this parameter cannot be mapped to fatigue.This is just one example of where these models may need refinement to be better suited for an active movement base learning environment like VR, others are discussed in the challenges section above.
Perhaps most importantly, the learning analytics and EDM communities will greatly benefit from identifying new approaches to skill modelling in dynamic VR environments by addressing the challenges identified in the discussion section.

LIMITATIONS
The guess and slip parameters, as shown in Table 3, we believe suggest unrefined skill modelling [36].Which resulted in poor model accuracy.However, it is important to note that the model was not empirically degenerate, as all the constraints set out for the knowledge tracing models were met (such as no negative learning transitions, or guess and slip probabilities greater than 1) [2].
We did not use a domain expert to help us model the skills identified.An expert may have approached the modeling differently.Additionally, our tasks were not truly independent (ie.one cube combination only belonging to one skill).Pure BKT expects tasks only to represent one skill or to be categorized under both skills.
We did not simultaneously assess other parameters such as a participants movement patterns.This multimodal approach could have shed light on some of the individual behaviours seen in the data.For example, it appeared that participants perceptions of the speed of the game changed as they played longer.Initially perceiving the game to be much faster than it actually was and moving in an erratic manner before slowing their body down.
Given the under-defined area of VR skill modeling, our work is a preliminary exploration into classifying and modelling these skills.VR skills modelling will be a big area of future work for us and other researchers.

CONCLUSION
In conclusion, we demonstrate that through the use of EDM techniques we can highlight system design and skill modelling flaws.We were able to use these methods to estimate user skill levels and the design of the system.By systematically quantifying latent task difficulty, we were able to evaluate a ceiling effect in our pilot study and sub-optimal distribution for optimal learning gains which we later improved in the full study.We were able to use the model coefficients (Guess and Slip) as well as accuracy to identify if our system and our understanding of the skills are accurate or need further refinement.We encourage the use of these techniques in VR environments to better understand and evaluate users, "VR Skills" and learning system design.In order to develop accurate assessment techniques for VR environments we identify challenges that need to be addressed.These include: under-defined skill modeling in VR, discrete versus continuous nature of tasks in VR, the simultaneous use of cognitive and motor skills in VR, the impact of the number of tasks on model accuracy and finally the need for factoring coefficients such as fatigue and boredom as well as guess and slip when modeling students in VR.

A RESEARCH METHODS
A.0.1 Bayesian Knowledge Tracing.We used a python package called pyBKT to train the Bayesian Knowledge Tracing (BKT) models.The package trains on historical task completion data from previous participants to fit global slip, guess and learn probability parameters.The parameters are fit for each of the four skills using an Expectation Maximization optimization algorithm.
To score a new participant the fit slip, guess and learn probabilities are combined with a prior probability on mastery to calculate a posterior probability of mastery given the task was completed successfully from the following Bayesian equation: Each participant has a different trajectory of mastery tracing based on updates to the prior set on mastery by the BKT model.As the participant continues to receive tasks and get those tasks correct or incorrect, the model continually updates its prior belief until the mastery updates converge on a mastery probability.Based on this prior probability of mastery, the model can be used to predict if the next task will be completed correctly if probability of mastery is greater than 50%.
A.0.2 Item Response Theory.We built a probabilistic Item Response Theory (IRT) model that fits difficulty parameters for each task then uses a Maximum Likelihood Estimate (MLE) to calculate the probability of Mastery given the last correct and incorrect task given to the participant.The first step is to determine the question global difficulty from the historical participants.To do this we simply find the percent of attempts of a task that were completed successfully by all participants and subtract this percent from 1 as shown below: Calculate Global Difficulty for Task i Since the difficulty is often skewed one direction or another, a Box-Cox transformation is performed on all the difficulties to make them behave more closely to a normal distribution.This is important for establishing a more evenly spaced ordinal relationship between the questions of different difficulty:

Figure 1 :
Figure1: The cubes approach the participant head-on, as shown by the travel direction arrow.The participant must cut the cube on the face indicated by the orientation and position of the blue coloured marker.Any other direction would be an incorrect hit and would not record a score in the game.As the cube rotates, the participant must ensure that they are cutting in the correct direction and with the matching sword colour.

Figure 2 :
Figure 2: The 1024 tasks are defined as hitting 2 cubes correctly.These two cubes can be the same colour, different colours, in different quadrants or in different directions.The challenge changes with each combination.In the left image, the participant has to cut in one direction and completely change trajectory in time to hit the next incoming cube as the rotations have changed.Finally, in the right image, the participant has to not only worry about the direction of the cut but also remember which hand to use as the colours and quadrants are also changing.

4. 2 Figure 5
Figure5is the histogram of the distribution of task difficulties as predicted by CTT based on the outcomes of the entire cohort (all participants).The figure illustrates a comparison of task difficulty estimates between the pilot study and the full study, revealing a notable shift toward increased task difficulty in the full study.It is important to note that the cohort's overall skill mastery, determined through the IRT modeling, is significantly impacted by these task difficulty predictions.Additionally, a participant who performs poorly among a group of high performers would be penalized more than if they were among similar-performing peers.Figure6, predicts skill mastery for a participant using IRT.We can see that the ability estimate rises and falls with the task difficulty

Figure 3 :
Figure 3: Accuracy & MSE for BKT (left side) and IRT (right side) for the skill Agility -Occluded Cubes

Figure 4 :
Figure 4: BKT estimating two skills, Flexibility (top) and agility (bottom), for participant number 40 indicates that this participant masters one skill but not the other.

Figure 5 :
Figure 5: CTT Distribution of Task Difficulty from the pilot study (top) and full study (bottom).Task difficulty is defined as "probability of a correct attempt".Leading to the redesign of the study to remove the ceiling effect.Highlighting the importance of quantifying latent task difficulty in the system design process.

Figure 6 :
Figure 6: IRT ability and difficulty plots for participant number 35

Figure 7 :
Figure 7: Example of individual participant results

Figure 8 :
Figure 8: We can see performance drops in this amateur participant as they become fatigued or bored.Which leads the model to readjust mastery estimation, potentially inaccurately.

1 +
is the optimal parameter for distribution normality To re-scale the distribution of normalized difficulties to be between 0 and 1, we use a min/max scaler that performs the following operation: we have calculated the difficulties for all the tasks we can use these values to generate Item Characteristic Curves (ICC) for tasks that are both correct and incorrect for a standard 1PL Rasch model:ICC if task is correct  (  =  |    ′′  ) =   −     ′′    −     ′′

Figure 9 :Figure 11 :
Figure 9: Precision, Recall, F1 Score, and Accuracy for BKT (left side) and IRT (right side) for the skill Agility -Occluded Cubes

Table 1 :
Summary of Collected Data LCWrongHit Left Controller Hit Cube ID, Wrong Hits RCHitPos, RCHitAngle Right Controller Hit Position, Angle RCHitCubeID, RCWrongHit Right Controller Hit Cube ID, Wrong Hits

Table 2 :
Model Accuracy

Table 3 :
Skill Coefficients and Prior Beliefs on Mastery Level.We can see that the hardest task was Flexibility, estimated only 17% of students have mastery at the start of the training.