Tunnel Runner: a Proof-of-principle for the Feasibility and Benefits of Facilitating Players' Sense of Control in Cognitive Assessment Games

Cognitive assessment games attempt to improve cognitive assessment’s experience and data quality by implementing game-like features, e.g., points and narratives. However, cognitive games maintain the repetitiveness and restricted control common in traditional cognitive assessment tasks, which thwart players’ sense of control and impair their motivation and experience. Leading to only modest improvements over traditional tasks. To demonstrate the value of designing cognitive games that facilitate a sense of control, we created and evaluated the infinite runner game Tunnel Runner. In two studies (n1=117, n2=121), we assessed the validity of the game’s cognitive measurements (inhibitory control, decision-making) against traditional cognitive tasks. Our results demonstrate Tunnel Runner’s valid and reliable cognitive measurements alongside substantial improvements to players’ experience and sense of control compared to the cognitive tasks, showcasing the feasibility and benefits of cognitive games designed to facilitate players’ sense of control.


INTRODUCTION
Cognitive functions enable individuals to adapt to the uncertainty and ever-changing demands of the modern world [26], and are therefore assessed in a wide variety of applications, ranging from targeted persuasion [96] to personnel selection [75] and health care [72].Cognitive functions are often measured with computerized cognitive assessment tasks, such as flanker [33] and stop-signal [58,59] tasks.Unfortunately, cognitive tasks tend to disengage and demotivate participants and often provide biased [69,108] and unreliable [27,43,76] measurements that are unlikely to correspond to meaningful real-life outcomes [19,32,78].
The mismatch between the demands cognitive tasks impose on participants and the experience they provide is a major obstacle to obtaining high-quality cognitive measurement.Cognitive tasks require focused, engaged, and motivated participants who continuously exert cognitive effort and whose behaviors consistently reflect specific cognitive functions.The mismatch arises because participants' sense of control over an interaction is crucial for their motivation [21,23] and engagement [31,73,74], while cognitive tasks restrict participants' control to ensure that their responses reflect specific cognitive functions.Cognitive tasks exert tight experimental control over the interaction to ensure measurement validity [87], resulting in static environments that repetitively provide participants with narrow affordances and are unaffected by participants' responses.This practice diminishes participants' sense of control and harms their experiences, leading to bored [6,44] and inattentive [10,11,28] participants whose cognitive functioning can be misrepresented by the task [69,108].
Cognitive assessment games attempt to improve the experience and data quality of cognitive assessment by implementing gamelike features such as points and narratives into cognitive assessment [24,61,104].Significant steps have been taken towards using games for cognitive assessment: cognitive games have been validated as cognitive measurements [36,61,104] and showed the ability to provide markers of psychopathology [45,67], while proof-of-concept studies established that performance on commercial games can be sensitive to cognitive impairments [1,37].However, cognitive games maintain tight and task-like experimental control and thus show disappointingly modest benefits to players' experience compared to cognitive tasks [e.g.36,62,63,71,93,105], as cognitive games overlook a core feature of good games: giving players a sense over the interaction.However, reduced experimental control could risk invalidating the resulting cognitive assessment, as players may respond in ways that do not reflect specific cognitive functions, cannot be modeled, or prevent comparisons with other players.Thus, it is important to first establish the feasibility and benefits of cognitive games that facilitate players' sense of control by reducing experimental control alongside practical approaches to making these games.To this end, we developed an infinite runner game called Tunnel Runner, designed as a proof-of-principle for the feasibility and benefits of cognitive games that facilitate players' sense of control.
We developed Tunnel Runner by integrating cognitive challenges to inhibitory control [26] and decision-making under uncertainty [91] into the game.The challenges were inspired by Simon [60,88], flanker [33], stop-signal [58,59] and Iowa gambling tasks [5], which are recognized by the National Institute of Mental Health's Research Domain Criteria, and have been extensively used to study individual differences in cognitive function and psychopathology [5,26,91,92].When integrating the challenges, we aimed to maintain continuous and game-like player control alongside a less repetitive experience.To maintain player control while integrating these cognitive challenges, the challenges were aligned with the game's core gameplay, i.e., continuously moving through a tunnel while avoiding obstacles.
To assess the feasibility and benefits of cognitive games that facilitate players' sense of control, we conducted two online studies ( 1 =117 and  2 =121) designed to meet the following research goals: RG1: Assess whether players show expected behavioral patterns in response to Tunnel Runner's cognitive challenges (internal validation).RG2: Assess whether the game's measurements correlate with other relevant measurements (external validation).RG3: Examine the reliability of Tunnel Runner's cognitive measurements.RG4: Compare players' experiences of autonomy, motivation, and engagement when playing Tunnel Runner with standard cognitive tasks.
We met these goals by assessing the validity of Tunnel Runner's measurements both internally and externally (RG1-2); against flanker and stop-signal tasks and against self-report measures of impulsivity and cognitive reflection, which are related to decisionmaking under uncertainty [89,90,106].Furthermore, we evaluated the reliability of the game's measurements (RG3) and compared players' experience of the game and the flanker and stop-signal tasks (RG4).We found that Tunnel Runner's cognitive challenges evoked different response patterns than the cognitive tasks.Yet, all the game's cognitive challenges, except the Simon-based challenge, enabled valid and reliable cognitive measurements.Furthermore, despite taking longer and requiring more effort than each cognitive task, Tunnel Runner provided far greater experiential improvements than the modest benefits observed with comparable cognitive games.Thus, we establish Tunnel Runner as a powerful cognitive measurement system and a successful proof-of-principle for the feasibility and benefits of facilitating players' sense of control during cognitive assessment.As such, we contribute a justification and a foundation for, along with a first glimpse into what should be the next step in the evolution of cognitive games: games that provide high-quality cognitive measurements while facilitating players' sense of control over the interaction.

Task-based cognitive assessment
A good cognitive measurement is reliable and valid, which means that it consistently and accurately measures a specific cognitive function.Cognitive tasks can be divided into conflict and nonconflict tasks.Non-conflict tasks ask participants to repeatedly respond to a single set of stimuli and associated decision rules, where the quality of participants' decisions is used to calculate a score that reflects cognitive functions such as decision-making under uncertainty in the Iowa gambling task [5], or working memory in memory span tasks [17].Conflict tasks measure cognition by measuring differences between responses to regular and conflict trials.Where conflict trials require participants to use specific cognitive functions to overcome the conflict evoked by, for example, surrounding the target stimulus with misleading and similar-looking distractors [33].Conflict tasks include most inhibitory control measures, such as flanker [33], Simon [60,88], and stop-signal tasks [58,59].
Using multiple trial types enables the calculation of difference scores that reflect changes between responses in regular and conflict trials.These difference scores enable conflict tasks to separate cognitive functions of interest from irrelevant factors, e.g., participants' general tendency to respond quickly or accurately, which are accounted for by the regular trials.However, the reliability of difference scores decreases as the correlation between responses in regular and conflict trials increases [107], often resulting in unacceptably low measurement reliability for conflict tasks [19,43,76].
Reliability reflects how much of a measurement's total variability is attributable to individual differences as opposed to measurement error [82,107].Reliability is a crucial measurement property, as unreliable measurements cannot accurately assess individual differences in cognitive function [107].Many cognitive tasks, particularly conflict tasks [27,43,76], have low reliability and, therefore, often fail to correspond to meaningful real-life outcomes [19,78].
Cognitive tasks use many repetitions to reduce measurement error while exerting tight experimental control to ensure that participants' responses reflect specific cognitive functions.Thus, cognitive tasks, particularly conflict tasks, provide long, repetitive, disengaging, and demotivating experiences that nevertheless often fail to produce reliable measurements [19,32,43].Furthermore, the tediousness of cognitive tasks impacts their measurement validity by inducing boredom [6,44] and inattentiveness [10,11,28], which can lead participants to respond in ways that do not reflect their cognitive functioning [69,108].
Facilitating participants' motivation and engagement should help alleviate the reliability and validity problems that plague cognitive tasks.Cognitive tasks often show low variability between individuals [43], which should be enhanced by motivating participants to perform at their best.At the same time, measurement noise can be reduced by facilitating greater engagement, which should lead participants to respond more consistently.Higher motivation and engagement should also prevent boredom and inattentiveness and, therefore, alleviate measurement validity threats.Thus, participants' motivation and engagement are prioritized by efforts to create better cognitive assessment tools.

Game-based cognitive assessment
Video games exemplify the use of good player experience to engage and motivate participants to complete repeated cognitive challenges [24,25,84], and therefore serve as inspiration for the field of game design for cognitive assessment.Game designs for cognitive assessment typically take a gamification approach [61,104], where gamelike features are integrated into non-gaming contexts [25], resulting in cognitive games that use features such as points, narratives, and improved aesthetics alongside task-like gameplay mechanics [e.g.36,62,63,71,93,105].Such cognitive games have been validated as cognitive measurements [36,61,104] and showed the ability to predict psychopathology [67].However, the gamification approach leads cognitive games to maintain task-like experimental control, limiting players' sense of control over the interaction.
By limiting players' sense of control in favor of experimental control, cognitive games overlook a crucial driver of players' engagement and motivation.Intrinsic motivation [83], a state in which an activity is performed for its own sake-because it is fun, interesting, and meaningful-is facilitated by a sense of relatedness, competence, and autonomy [98].Where autonomy is driven by one's sense of control [22,47,81].Furthermore, engagement, understood as the depth of investment when interacting with a system, is facilitated by factors such as aesthetic appeal, rewards, and users' sense of control [74,77].Exceptionally high engagement involves the experience of flow [73], which is characterized by high concentration, loss of self-consciousness and sense of time, and a feeling of purposefulness; a desirable and motivating experience that is facilitated by users' sense of autonomy [31,49,95].Although players' sense of control plays a central role in players' motivation and engagement [24,81,84], it is nevertheless overlooked in cognitive games.
Work on cognitive games made significant steps towards showing that cognition can be measured in more engaging and motivating game-like environments [36,61,104].However, compared to cognitive tasks, cognitive games tend to show limited benefits to players' experience [e.g.36,63,71,93,105], improved data quality in cognitive games has not yet been established, and common gamification techniques were not found to reduce attrition in longitudinal cognitive testing [62].
Gamification techniques typically use game elements such as points, leaderboards, badges, improved visuals, and narratives [63,85,105] to improve aesthetic appeal while giving players a sense of competence, reward, and meaning.However, most gamification techniques do not target players' sense of control [85] and can sometimes undermine it by overemphasizing outcomes [9,24].An exception is avatar customization, which gives players control over the appearance of their in-game avatar and has been shown to enhance players' autonomy, enjoyment, and engagement [7,8].Since players' sense of control is central to their experience, giving players more control should, as shown by studies of avatar customization, meaningfully improve their engagement and motivation.
Since the experiential benefits of cognitive games are severely limited by their tightly controlled and task-like gameplay mechanics, there is a need to reconsider how game-based cognitive assessments are designed.Players' experiences should benefit from replacing constrained task-like mechanics with more open and game-like gameplay mechanics.To achieve this, we argue that cognitive games should stop imitating cognitive tasks and instead use cognitive tasks as inspiration for gameplay mechanics.However, shifting towards more game-like mechanics would require cognitive games to relinquish considerable experimental control over the interaction, which raises serious concerns.Can such games enable valid and reliable cognitive assessment?If so, will the games still meaningfully improve players' experiences?These concerns call for a proof-of-principle for the viability and benefits of cognitive games that facilitate players' control at a cost to experimental control.To provide this proof of principle, we developed Tunnel Runner, a cognitive game that measures inhibitory control and decision-making under uncertainty while allowing players to nearly continuously rotate their avatar.

Cognitive functions: inhibitory control and decision-making under uncertainty
We designed Tunnel Runner to measure inhibitory control and decision-making under uncertainty, drawing from cognitive challenges defined by the flanker, Simon, stop-signal, and Iowa gambling tasks.These tasks have been extensively used to study individual differences in cognitive function and psychopathology [5,26,91,92], and are recognized by the National Institute of Mental Health's Research Domain Criteria.

Inhibitory control.
Inhibitory control refers to the ability to prevent behavior, cognition, and affect from being captured by inappropriate habits and irrelevant stimuli [26].Inhibitory control enables individuals to flexibly engage in goal-directed behaviors rather than being driven by inappropriate habits and irrelevant external stimuli.Inhibitory control is divided [26] into interference control, the shielding of behavior from irrelevant stimuli, and response inhibition, the suppression of potent and inappropriate behavioral responses.Interference control, a crucial part of selective attention, is often measured by the flanker and Simon tasks [26].Flanker tasks [33] typically use a central stimulus, such as an arrow pointing to the left, surrounded by arrows that can match or mismatch the central stimulus by pointing in the opposite direction.Although participants are asked to respond only to the central stimulus, mismatches with the flanking arrows consistently increase participants' reaction times and the prevalence of response errors, known as the flanker effect [70].Whereas in typical Simon tasks [88], participants are asked to use their left arm in response to a red shape and their right arm in response to a green shape.However, in some trials, the shape appears on the side opposite to the arm that should be used, creating stimulus-response incompatibility.This interference increases response times and the prevalence of response errors, known as the Simon effect [80].The magnitude of conflict effects serves as an indicator of interference control.
Response inhibition, a crucial part of self-control, is commonly measured with the stop-signal task [58,59].In a typical stop-signal task, participants are always presented with a go signal, such as an arrow pointing left or right, which is followed by a delayed stopsignal in up to 25% of trials [100].Participants' stop-signal reaction time (SSRT), a measure of response inhibition, is then inferred based on the horse race model [58,103].This model assumes that stop and go responses depend on a competition between two independent cognitive 'runners', a go runner and a stop runner, with the 'winner' determining whether a response will be performed or withheld.

Decision
-making under uncertainty.Decision-making under uncertainty refers to the ability to choose between alternative behavioral options based on their expected consequences and potential risks [91].A key distinction in decision-making research pertains to the information given to participants about the uncertain consequences of different options, leading to two key categories [91]: decision-making under risk and decision-making under ambiguity.Decision-making under risk refers to situations in which decision outcomes are uncertain and their probabilities are known.Decision-making under ambiguity refers to situations in which outcomes are uncertain and their probabilities are unknown, where ambiguity is overcome by exploring the available decision space and learning from feedback.
The Iowa gambling task [5] is a staple of decision-making research, which was extensively used with both clinical and nonclinical populations [5,91,92].A typical version of the Iowa gambling task [5] asks participants to repeatedly decide between four cards, with two being 'good cards, ' which result in a net gain in the long run, and the other two being 'bad cards' that result in a net loss in the long run.Picking a good card always results in a moderate reward (50 points).One good card leads to a loss of 50 points on 50% of times, whereas the good other card leads to a larger loss (250 points) on 10% of times.Picking a bad card always gives a larger immediate reward (100 points) than the good cards.However, one bad card gives a large loss (250 points) on 50% of times, whereas the other bad card results in an even larger loss (1250 points) on 10% of times.Thus, participants are expected to overcome ambiguity by learning from repeated feedback, at which point the task shifts toward decision-making under risk.

TUNNEL RUNNER
Tunnel Runner is an infinite runner game built in Unity 3D using the Unity Experimental Framework (UXF) [12] and in which players move their avatars through a tunnel.We picked an infinite runner game for its simple yet flexible core gameplay mechanics, involving a player-controlled avatar constantly moving through a virtual space.Infinite runner games give players continuous control over their avatars' movements, are challenging and responsive, and provide clear and immediate feedback in terms of rewards and penalties.Furthermore, infinite runners create time constraints and allow the presentation of stimulus in a controlled manner, thereby facilitating the measurement of response time and accuracy under different conditions.Thus, infinite runner games provide enjoyable experiences that leave room for cognitive assessment [36].
Tunnel Runner is premised on the escape of five rats from a mad scientist's lab via a tunnel filled with traps and obstacles, which are used to present cognitive challenges to inhibitory control and decision-making under uncertainty that are inspired by flanker, Simon, stop-signal, and Iowa gambling tasks.As the rats run through the tunnel, they encounter obstacles represented as colored rings divided into six equally sized sections.Two sections-the rats' starting section at the bottom of the tunnel and the 'dead zone' at the top of the tunnel-are always colored gray, while each of the four remaining sections is given a different color in each trial.This design ensures that there is always an optimal way to respond to a trial.We chose a control scheme that should be easily accessible to a broad population of participants, where players control the rats' movement direction by pressing the A key to rotate them to the left and the L key to rotate them to the right.To encourage accurate first responding, a trial's first button press results in a slightly faster rotation than later button pressing.The game gives players continuous control over the rats' rotation for more than 80% of playtime, with the explicit goal of getting the central rat through the one section that matches its color.
Whenever the central rat passes through the correct section, the player receives one point and extra points corresponding to their streak-the number of correct sections they passed in a row in preceding trials.Passing through a wrong section resets the streak count.The streak mechanic is meant to facilitate a sense of reward and to promote correct responses.Correct and incorrect trial outcomes result in different visual feedback to inform players of their mistake, provide a sense of responsiveness, and facilitate correct responses.Passing through incorrect sections causes a red X mark to appear on the screen's center, accompanied by a short vignette effect meant to be similar to a blink from pain, while passing through correct sections results in a green plus mark that appears next to a streak count, representing the points earned in the trial.

Trial types
3.1.1Regular trials.The game's core gameplay consists of regular (non-conflict) trials depicted at the top left of Figure 1.In regular trials, the central rats and the four flanker rats-which surround the central rat-are assigned the same color at 433 and 350 milliseconds, respectively, after the obstacle ring is first presented.The rats can only be rotated after color assignment, leaving the player with 1,317 milliseconds to ensure that they pass through the correct section of the obstacle before the next trial starts.While the time until color assignment is static, the time until obstacle collision is adapted so that passing through the correct section reduces the time between the next color assignment and obstacle collision, whereas passing through the incorrect section increases the time until the next collision.We designed the adaptation algorithm to facilitate success rates of about 80% per player, which we expected to be challenging and to encourage correct first responses without feeling unfair.
We designed Tunnel Runner to have a single set of regular running trials that are compared against multiple sets of conflict trials, with the purpose of decreasing Tunnel Runner's repetitiveness while increasing its efficiency compared to typical cognitive tasks and games.We aimed to have three types of conflict trials that challenge players' inhibitory control and intuitively fit the game's gameplay and narrative without being too demanding or confusing.First, we noticed the possibility of surrounding a central target with similar-looking yet misleading flankers to create a challenge to interference control similar to flanker tasks' flanker effect.Furthermore, we saw the possibility of creating stimulus-response incompatibility by reversing the game's controls on some trials such that, like in a Simon task, responding to one side of the screen would require players to press a button on the opposite side of their keyboard.Lastly, we noted that we could challenge players' response inhibition with a delayed stop-signal.Thus, we developed mismatching flanker trials to create flanker-like interference, ice trials to create stimulus-response incompatibility, and lava trials that use a delayed stop-signal.These cognitive challenges were designed to measure players' interference control and response inhibition.
3.1.2Mismatching flanker trials.Players are told that the flanker rats are trying to help but often go wrong and should be ignored.In 50% of trials, the flanker rats' color matches the color of the section opposite the correct section, creating mismatching flanker trials which are inspired by flanker tasks and depicted at the top right of Figure 1.Based on studies of the flanker task, we specified a time difference of 83 milliseconds between the color assignments of the flankers and the central rat with the aim of enhancing individual differences [50] by increasing the challenge's difficulty [68].The algorithm adapting the time players have from color assignment to collision is as sensitive to regular mismatching flanker trials as it is to regular matching flanker trials.

Lava trials.
Players are told that most of the tunnel can be covered by lava, which hurts the rats on touch.This results in lava trials, inspired by stop-signal tasks and depicted at the bottom right of Figure 1, where the rats are surrounded by lava (serving a stopsignal) after color-assignment.Touching the lava, which can only be prevented by not moving the rats from the starting point, leads to a loss of points and resets one's streak.Whenever the rats enter the lava, players keep losing points until they rotate the rats back to the starting point.Lava trials are independent of the assignment of flanker colors, so the colors of the central and flanker rats are equally likely to match or mismatch in lava trials.At the first lava trial, the stop-signal appears at a delay of 300 milliseconds after the central rat is assigned a color, following which the stop-signal delay (SSD) is adapted based on the player's performance.SSD in matching flanker trials is adapted independently of the SSD in mismatching flanker trials.As in typical stop-signal tasks [100], moving the rats after the stop signal results in lava appearing 50 milliseconds earlier, giving players more time for response inhibition.Whereas successful inhibition causes the lava to appear 50 milliseconds later, giving players less time for inhibition.This adaptation should lead players to avoid responding at around 50% of lava trials, which is desirable for inferring SSRT [57,100].

Ice trials.
Players are told that the tunnel can be covered by ice, which reverses the rats' rotation.During ice trials, inspired by the stimulus-response incompatibility effect in Simon tasks and depicted in the bottom left of Figure 1, the tunnel is filled with ice once the rats are presented with a new obstacle.The ice reverses the key mapping so that pressing A moves the rats to the right, whereas pressing L moves them to the left.This response reversal is meant to create stimulus-response incompatibility, as reaching the section on one side of the screen requires using a button on the opposite side of the keyboard.However, this challenge significantly deviates from typical stimulus-response incompatibility measures such as Simon tasks, since spatial features determine the correct choices in Tunnel Runner, whereas they are irrelevant in Simon tasks.The presentation of ice trials is independent of flanker trials, such that matching and mismatching flanker trials are equally spread across ice trials.On rare occasions, ice trials overlap with fire trials to create a sense that stop signals might occur during ice trials and, therefore, ensure that ice trials are comparable to regular trials.As lava trials may otherwise selectively affect response caution in non-ice trials.
3.1.5Snacking trials.Tunnel Runner's fast pace and varied cognitive challenges make it highly demanding for players' inhibitory control, so we wanted to ensure that players take relatively long breaks from the game's running trials without wasting players' time.Thus, we used these breaks to implement a slower cognitive challenge that is less demanding for players' inhibitory control.Consequently, during breaks from Tunnel Runner's running trials, a challenge to players' decision-making under uncertainty is presented as series of snacking choices whose consequences are structured similarly to Iowa gambling tasks.
During snacking trials, depicted in Figure 2, players are told that they discovered a hidden room filled with boxes, and are repeatedly asked to decide between four boxes by pressing a number that corresponds to the box of their choice.The selected box could reveal one or two slices of melons, which rewards players with 5 or 10 points, respectively.Furthermore, the chosen box sometimes reveals 1, 5, or 25 traps, corresponding to a loss of 5, 25, and 125 points.Boxes 1-2 always reward 10 points, but cost 25 points on 50% of trials, or 125 on 10% of trials; while boxes 3-4 always reward 5 points, but cost 5 points on 50% of trials, or 25 on 10% of trials.Creating a decision-making under uncertainty scenario where, as in Iowa gambling tasks, choices with larger immediate rewards are detrimental in the long run.

Differences between Tunnel Runner and standard cognitive tasks
The game's running trials require participants to match between two stimuli, the obstacle and the central rat, instead of responding to a single stimulus as is typical in cognitive tasks.This increased complexity might affect participants' response times.Furthermore, Tunnel Runner allows participants to correct mistaken first movements, which may reduce response caution in players' first responses.Tunnel Runner is also about 2-times longer than traditional flanker, stop-signal, and Simon tasks and combines several cognitive challenges into a single game.Thus, the game could be more effortful and frustrating to complete.Lastly, players' decisions in the game's decision-making challenge influence a pool of points they worked hard to obtain, unlike the initial batch of points freely given in the Iowa gambling task.This difference in framing might influence decisions-making patterns.

METHOD
Tunnel Runner's gameplay mechanics significantly differ from and use less experimental control than the cognitive tasks that inspired the game's cognitive challenges, making it far from certain that the game can provide good cognitive measurements.Furthermore, it is unclear whether the experiential benefits of letting players continuously rotate the avatars' position can outweigh the high effort the game demands from players.To address these concerns and meet our research goals, we empirically evaluated whether the game's cognitive challenges promote appropriate response patterns (RG1) and provide reliable measurements (RG3) that validate against other relevant measures (RG2).Furthermore, we compared players' experiences of the game with their experiences of flanker and stopsignal tasks (RG4).

Behavioral data collection
The assessment of the reliability and validity of the game's cognitive measurements required the collection of behavioral data from the game and from relevant cognitive tasks.The data enabled us to assess whether players' responses to the cognitive challenges met key expectations needed for internal validation, correlated with relevant task-based or questionnaire-based measurements as needed for external validation, and showed good measurement reliability.
4.1.1Game-based measurements.Following 40 practice trials, participants completed 468 running trials (252 regular trials, 120 ice trials, 84 lava trials, and 12 ice-lava trials).All trials were equally divided across mismatching and matching flanker conditions, per 4 blocks, and per correct response direction.The timing and nature of responses were measured per trial.Relatively few trials were used per measurement, as pilot studies showed that this amount led to an optimal balance between measurement reliability, time efficiency, and player experience.120 snacking trials were equally spread over three breaks from the game's main gameplay loop.We designed the game to take about 25 minutes.Players' response patterns across the different trial types were used for the following cognitive measurements: • Flanker effects on first-response accuracy and reaction time (RT) of correct responses were calculated as the differences between accuracy and RT in regular and mismatched flanker trials.Whereas balanced integration scores (BIS) were calculated as the difference between z-transformed accuracy and RT conflict effects [52,53].• Similarly, ice effects were calculated as the differences in accuracy and RT and their BIS between ice and non-ice trials.• SSRTs were calculated by applying the integration method [57,100] to player responses, separately, for non-ice matching and mismatching trials.• Players' decision-making under uncertainty was measured as the prevalence of good choices in snacking trials after the first 40 trials, which represent an initial exploration phase [5] in Iowa gambling tasks and are therefore excluded when assessing the prevalence of good choices [89,90].
4.1.2Flanker task.We built a traditional flanker task [16,33,46] in Unity and UXF [12].As depicted in Figure 3, the task asked participants to press the A key when a central arrow pointed left and the L key when the arrow pointed right.The central arrow was always flanked by 4 arrows that were equally likely to match or mismatch its direction.Participants had 1 second to respond, with brief feedback depending on the response or lack thereof, leading to the next trial.We measured the nature and timing of each trial's first response, and measured flanker effects as the differences in accuracy and RT between matching and mismatching flanker trials, and their BIS.The task took 12 minutes and consisted of 40 practice trials and 416 test trials whose conditions and correct response directions were equally divided into 4 blocks.
4.1.3Stop signal task.We built a stop signal task based on best practices [100] in Unity and UXF [12].As shown in Figure 4, participants were asked to press the A key when a central green arrow pointed left, and the L key when it pointed right, and to avoid responding if the arrow turned red.The stop signal started after the trial began in 25% of trials, starting with a stop-signal delay of 250ms.Incorrect responses decreased the stop-signal delay by 50ms, whereas correct stops increased the stop-signal delay by 50ms.Responses were expected within 1 second after the go signal, resulting in brief feedback depending on the response or lack thereof, followed by the next trial.We measured the timing of participants' first responses, which we used to calculate SSRTs with the integration method.The task took 12 minutes comprised of 40 practice trials and 400 test trials including 100 stop-signal trials equally divided across 4 blocks.

Questionnaire data collection
We aimed to compare players' experiences of the game and the task, with a focus on motivation, engagement, and autonomy.Our use of repeating questions that followed multiple cognitively demanding activities could have created data quality problems [6,69], which would have led to data loss when combined with our use of multiple attention checks.We tried to mitigate these concerns by using a minimal number of well-validated scales that could enable us to holistically assess participants' experiences.
We measured players' experience using 12 questions from the Player Experience Inventory (PXI) [2] equally divided across four scales (autonomy, curiosity, meaning, and enjoyment), allowing for seven response options, ranging from strongly disagree to strongly agree, including a neutral option.We emphasized players' enjoyment, curiosity, and sense of meaning, as these are deeply related to intrinsic motivation [83].Furthermore, we treated the PXI's measurement of autonomy as a reflection of players' sense of control, since a sense of control is crucial for autonomy [22,47,81].
For the assessment of engagement, we used the User Engagement Scale Short Form (UESSF) [77], which has 12 questions evenly divided across four subscales (focused attention, usability, aesthetic appeal, and reward), allowing for five response options, strongly disagree to strongly agree, including a neutral option.
We measured players' task load, specifically perceived effort and frustration, using two questions from the NASA TLX [40] which were answered with a visual analog scale ranging from 1 (very low) to 21 (very high).
To avoid inducing biases in responses, all experience-related questions referred to the "last activity", as opposed to a game or a task.
To validate the game's measure of decision-making under uncertainty, we measured two correlates of the Iowa gambling task: cognitive reflection [89,90], and lack of premeditation [106].Participants answered four questions from the short UPPS-P Impulsive Behavior Scale [18] measuring lack of premeditation with four response options, ranging from strongly disagree to strongly agree, excluding a neutral option.Participants also completed the 7-item cognitive reflection test [94], which presents 7 open questions whose answers are coded as correct or incorrect.

Procedure
We conducted two online studies to quantitatively assess Tunnel Runner players' experience, measurement validity and reliability, and to compare Tunnel Runner's player experience and cognitive measurements with standard cognitive tasks.Despite the increased data quality issues observed in online studies compared to lab studies [36,50,105], we conducted online studies to reach a more diverse population and to assess Tunnel Runner's ability to measure cognition in real-life, rather than controlled environments.Our institution's ethical review board approved both studies.
Before filling the consent form, participants completed a brief performance check to check whether Tunnel Runner could be displayed with 50 frames-per-second or more, ensuring good game experience and precise measurement.After consent, participants provided demographic information and completed measures of cognitive reflection and lack of premeditation.Participants then completed Tunnel Runner, which always preceded the cognitive tasks to prevent exhaustion, demotivation, and training effects due to previous cognitive tasks, which could threaten the evaluation of the game's cognitive measurements.After completing Tunnel  Runner, participants were asked to answer 26 questions from the UESSF, PXI, and NASA TLX.The questionnaires were followed by the flanker task in study 1 and the stop-signal task in study 2. After completing a cognitive task, participants were asked to answer the same questions regarding their experience, engagement, effort, and frustration during the task.
Having players complete Tunnel Runner prior to the cognitive tasks is important for the evaluation of the game's measurements; however, the ordering may influence comparisons between participants' experiences of the game and cognitive tasks.Another study comparing a cognitive game with a cognitive task [36] did not find ordering to be consequential for relevant experiential measures, except for concentration, which was reduced for the second activity.Thus, participants' focused attention in the cognitive tasks might have been reduced due to ordering.

Sample description
Both studies were conducted via CloudResearch [55], recruiting only CloudResearch-approved Mechanical Turk users from the USA with at least 95% approval rate on at least 1,000 human intelligence tasks.These criteria were meant to avoid bots and ensure a high-quality participant pool [55].Participants' characteristics are described in Table 1.
As is common with behavioral tasks [36,43,50,108], we performed filtering at both the trial and individual levels.We created a scoring system to detect whether players misunderstood the game or did not take it seriously, in which specific response patterns incur one or two points, with two or more points leading to participantlevel exclusion of behavioral data.The following criteria led to a participant's immediate exclusion, as they rendered their data invalid: average frames-per-second below 35 (study 1: 2 participants, study 2: 1 participant); first response accuracy of no more than 3 standard errors away from 0.5 (11,18); and failure to respond on more than 10% of non-lava trials (7,2).Whereas at least 2 of the following criteria led to exclusion, as they imply that participants were inattentive or did not understand the game, overemphasized or ignored aspects of the game, or provided unreliable data: stopping rate above 0.7 or below 0.3 in lava trials (11,8); correcting mistaken first movements in less than 30% of opportunities in nonlava trials (18,13); correcting first movements in less than 30% of opportunities in lava trials (11,9); failing to respond in more than 3% of non-lava trials (18,17); and average frames-per-second below 45 (2,2).This led to the loss of 22 participants in study 1, and 31 in study 2, whose responses were unlikely to be comparable to those of the remaining participants.
We applied similar participant-level filtering in the flanker and stop-signal tasks.In study 1, behavioral data from 9 participants who completed the flanker task were removed for low accuracy or for too often failing to respond.In study 2, 10 participants who completed the stop-signal task were removed for low response accuracy, or too often failing to respond on go trials, or too rare or too frequent inhibition of response on stop-signal trials.
Our analyses of the game's flanker and ice challenges excluded trials with RTs lower than 300ms or higher than 1,500ms after color assignment, and 3 standard deviations away from a participant's mean per condition when RTs were the dependent variable.No trial-level filtering was applied to the game's stop-signal and decision-making trials.Similarly, our analyses of the flanker task ignored trials with RTs lower than 250ms or higher than 1,000ms after the arrow appeared and which were more than 3 standard deviations from a participant's mean per condition when RTs were the dependent variable.No trial-level filtering was applied in the stopsignal task.Whenever a study's task-based measurements were directly compared to the game, we only considered participants who passed both game-based and task-based filters.
Three attention checks were used in the questionnaires, with failure in at least one check or nonsensical responses to the cognitive reflection test leading to the exclusion of a participant's questionnaire data.This approach led to the exclusion of 6 participants in study 1, and 9 in study 2. When possible, participants' behavioral data were analyzed regardless of attention checks, and self-report data were likewise analyzed regardless of participants' behavioral data.

Data analytic approach
We based statistical significance on two-sided hypothesis tests ( = .05;95% CI) unless we had a clear directional expectation, in which case we used one-sided tests (90% CI).Since the validation of Tunnel Runner's cognitive measurements required positive results across multiple tests, false negatives could pose a major problem.To minimize the risk of false negatives, very high statistical power is needed, which is helped by one-sided tests while maintaining the standard .05false positive rate.
We fit hierarchical regression models using R packages lme4 [3] for model fitting alongside lmerTest [51] for hypothesis testing.We used hierarchical regression models [86] to account for interactions between cognitive challenges, i.e., overlaps between ice and flanker challenges, and the clustering of responses at the level of an individual participant.When estimating individual differences in RT, accuracy, and decision quality, we used (joint) maximally specified Bayesian hierarchical regression models.The models allowed many parameters (such as variance) to differ as a function of trial features and the individual respondent.By thoroughly modeling response patterns, Bayesian hierarchical regression models can considerably improve the estimation of individual differences and their correlates in cognitive tasks [15,38,39,56,82], and allow the estimation of measurement reliability directly from the models' posterior distributions.
Individual differences involving binary outcomes, e.g., accuracy and decision-making quality, were estimated with hierarchical Bayesian logistic regression; continuous outcomes, e.g., RT, were estimated with Gaussian or Ex-Gaussian models.SSRTs were estimated via the integration method [57,100].Model selection depended on the models' convergence and goodness-of-fit as assessed by the widely applicable information criterion [99] and visual posterior predictive checks.Ex-Gaussian regression models [42], which model responses as a mixture of Gaussian and exponential distributions, were used because they have been fruitfully applied to the analysis of conflict tasks such as flanker [14] and Simon tasks [64].Although our RT models allowed several parameters, such as Gaussian variance and the exponential rate, to vary per trial type and individuals, we focus our reporting on individual differences in Gaussian central tendency parameters, means, as these are most often used to assess cognitive functioning.All Bayesian models were fit with the R package brms [13] using weakly informative priors.
Our methods for estimating reliability differed per measurement.For self-report scales, we estimated reliability using McDonald's  [30], a more general and robust form of Cronbach's  [41].Individual differences assessed with Bayesian models had their reliability estimated as the intraclass correlation coefficient obtained from the posterior distributions of relevant random effects [54].When neither Bayesian modelling nor self-reports were used, we estimated reliability using an odd-even split-half procedure [29].

Statistical expectations
Our studies were designed to compare players' experience playing Tunnel Runner to their experience completing a cognitive task, and to assess the reliability and validity of the game's cognitive measurements.The evaluation of players' experience and the measurements' reliability was exploratory, while the assessment of the validity of the game's measurements was driven by specific statistical expectations informed by prior literature.
To establish internal validity, game-based flanker and ice effects should show a moderate increase in RT and error rates [33,97], with the flanker effect having an initially positive delta function slope [97], i.e., a larger effect on RT as responses take longer, whereas the ice effect should have a negative delta function slope [97].In lava trials, participants should inhibit responses in about 50% of trials [100], and their SSRTs should be slightly longer when lava trials are combined with mismatching flankers compared to SSRTs calculated from trials with matching flankers [101,102].To meet the expectation that stop-signals do not interfere with response processes, responses in lava trials should be faster than responses in comparable go trials and show a flanker effect on RT [100].Lastly, in snacking trials, players' choices should show an overall preference for infrequent penalties [92], and the prevalence of good decisions should increase as the game progresses [92].
We only examined a measurement's external validity if it first met our internal validity benchmarks.Participants completed flanker and stop-signal tasks after playing Tunnel Runner, and their performance on these tasks was compared with and correlated to relevant game-based measurements.We expected positive correlations between flanker effects on RT and error rates and their BIS between game-based and task-based measurements, and between task-based and game-based SSRTs.Since asking participants to complete the Iowa Gambling task twice in a row is likely to be affected by a transfer of learning from the game to the task, we instead validated the game's decision-making measurement by correlating it with two correlates of the Iowa Gambling task: cognitive reflection [89,90], and lack of premeditation [106].

User experience
We compared players' game experiences against their experiences of cognitive tasks, with an emphasis on autonomy.We tested differences in engagement and experience metrics using two-sided paired samples t-tests.These comparisons are described in Table 2, while the measures' internal consistencies, which were generally good, are described in Table 5 in the appendix.
All engagement and experience metrics, except usability, effort, and frustration, were better in the game than in the flanker and stop-signal tasks (p < .001for each comparison), with large Cohen's d effect sizes that ranged from 0.85 to 1.51.Players reported much greater autonomy, curiosity, meaning, enjoyment, reward, aesthetic appeal, and focused attention in the game compared to tasks.Usability was higher in the flanker task than in the game (p < .001),yet marginally higher in the game than in the stop-signal task (p = .056).Players reported higher frustration in the game than in the flanker task (p = .025),yet less frustration in the game than in the stop-signal task (p = .002).Players reported that the game was more effortful than both the flanker task (p < .001)and the stop-signal task (p = .008).Overall, players' experiences were substantially improved in the game compared to the tasks, although the game was more effortful than both tasks and more frustrating and less usable than the flanker task.
Examination of the distribution of autonomy scores, visualized in Figure 5, elucidates the differences between the game and the tasks.The large difference between the game and the tasks (d = 0.97 in both studies) was driven by low task autonomy, as 87% and 82% of answers to the three autonomy-related questions were unfavorable toward the flanker and stop-signal tasks, respectively, while 49% of responses were unfavorable towards the game.Since autonomy is driven by one's sense of control [22,47,81], our results suggest that players tended to experience much greater, if still limited, sense of control during the game compared to the tasks.

Interference control
We evaluated Tunnel Runner's ability to measure inhibitory control and decision-making under uncertainty.For each cognitive construct, we evaluated its internal validity, i.e., whether the game's challenges evoked theoretically-relevant behavioral patterns, its external validity, i.e., whether the game's measures correlated with relevant behavioral or self-report measures, and the reliability of the game's measures.

Internal validation.
To establish internal validity, game-based flanker and ice effects should show increased RT and error rates [33,97].The flanker effect should show a positive initial delta function slope, i.e., a larger effect on RT as responses take longer [97].In contrast, the ice effect should show a Simon-like negative delta function slope, i.e., a smaller effect on RT as responses take longer [97].To assess the effects on RT per study, we fit Gaussian and logistic hierarchical regressions with random and fixed intercept, alongside fixed ice effect, flanker effect, and ice-flanker interaction.Due to expectations for increased RT and error rates incurred by the flanker and ice effects, we used one-sided hypothesis tests.
We calculated the Delta functions [20,65] of the game's flanker and ice effects, expecting the flanker effects to have an initially positively sloped delta, and the ice effects to have a mostly negatively sloped delta.Delta plots, shown in Figure 6, suggested that the delta of the flanker effect in non-ice trials was indeed initially positively sloped, whereas the ice effect's delta was, unexpectedly, positively sloped for most of the RT distribution.Which implies Table 2: Means and standard deviations alongside Cohen's d effect sizes and p-values of differences between the game and the tasks on self-report questionnaires.p-values were calculated using two-sided paired samples t-tests.PXI stands for the Player Experience Inventory, UES-SF for the User Engagement Scale short form, and NASA-TLX for the NASA Task Load Index.* Significant difference.that the conflicts evoked by ice trials had different temporal dynamics than those evoked by Simon tasks.Overall, the game's flanker effect matched all our expectations, whereas the game's ice effect failed to meet our expectations.As such, only the flanker challenge received further validation.

Reliability.
We assessed the reliability of ice and flanker effects on participants' RT, accuracy, and BIS.Because ex-Gaussian models of RT showed the best fit across studies, tasks, and conditions, individual differences were examined via maximally specified hierarchical ex-Gaussian and logistic models of RT and accuracy, which were fitted separately per game-based measures and taskbased measures and per study.In both studies, game-based measures showed acceptable-to-good reliability of the flanker effect on RT (study 1: .75,study 2: .79),accuracy (study 1: .77,study 2: 63) and their BIS (study 1: .80,study 2: .76).Whereas the taskbased flanker effect showed excellent reliability for RT (.92), but poor-to-acceptable reliability for accuracy (.45) and BIS (.64).Lastly, both studies' ice effects showed good-to-excellent reliability for RT (study 1: .84,study 2: .81),accuracy (study 1: .91,study 2: .90),and their BIS (study 1: .89,study 2: .86).

External validation.
To examine differences in response patterns, we jointly modeled game-based and task-based flanker effects on accuracy and RT with hierarchical logistic and Gaussian models with fixed and random intercept alongside fixed game effect, flanker effect, and game-flanker interaction.We used two-sided tests due to a lack of expectations.The models showed that participants typically took 303 ms (95% CI: 300 -306, p < .001)longer to respond in regular game trials than the 472ms taken on average by the task's regular trials, while the task's flanker effects on RT were 8ms (95% CI: 3 -12, p < .001)longer than the game's flanker effect.Furthermore, mean accuracy in the game's regular trials was .86,which was lower than the task's .99(OR = 0.07, 95% CI: 0.06 -0.08, p < .001).Although the accuracy reduction of .08 in the game's flanker trials was not smaller than the task's .07 in absolute terms, the game's accuracy reduction was weaker in relative terms (OR = 5.26, 95% CI: 4.50 -6.16, p < .001).Altogether, responses tended to be slower and less accurate in the game than in the task, suggesting that game trials were more difficult than task trials.
We assessed correlations between game-based and task-based flanker effects by correlating mean estimated individual-level flanker effects obtained from fully specified joint Bayesian hierarchical ex-Gaussian and logistic models.One-sided tests were used due to clear directional expectations and showed that game-based and task-based flanker effects on RT were weakly correlated (r = 0.18, 90% CI: 0.01 -0.34, p = .045),game-based and task-based flanker effects on accuracy were strongly correlated (r = 0.78, 90% CI: 0.70 -0.84, p < .001),which remained after the removal of two outliers with high leverage (r = 0.68, 90% CI: 0.56 -0.77, p < .001),and the task's and game's BIS were moderately correlated (r = 0.46, 90% CI: 0.31 -0.58, p < .001).Overall, while the game evoked different response patterns than the flanker task, the game's flanker challenge validly and reliably measured interference control.

Internal validation.
To establish the internal validity of the game's lava trials, participants were expected to inhibit responses in close to 50% of trials [100], and SSRTs were expected to be slightly longer when combined with mismatching flankers compared to SSRTs with matching flankers [101,102].Furthermore, to meet the expectation that stop-signals did not interfere with response processes, we expected RT in lava trials to be faster than responses in comparable non-lava trials and to show a flanker effect [100].

Reliability.
We calculated the reliability of participants' gamebased SSRT in three partitions: using only mismatching flanker trials, using only matching flanker trials, and by averaging the ztransformed SSRTs calculated per each trial type.Averaged SSRT showed good-to-excellent reliability of .92 in study 1 and .83 in study 2, was the game's most reliable SSRT measurement and, therefore, the one considered for external validation.Task-based SSRT, measured in study 2, had good reliability of .84.

External validation.
To facilitate comparison with task-based SSRT, we re-scaled the game's averaged SSRT to the distribution of game-based SSRT with matching flanker trials.A two-sided paired samples t-test comparing the two SSRT measurements showed that rescaled game-based SSRTs were longer than task-based SSRTs (m = 36, 95% CI: 26 -47, p < .001).Nevertheless, game-based and task-based SSRTs showed a correlation of .53(90% CI: .38 -.65, p < .001).Suggesting that while the game's lava trials evoked longer SSRTs than the stop-signal task, they validly and reliably measured response inhibition.

Decision-making under uncertainty
We only assessed study 2's decision-making measurement, as too many of study 1's players rarely made bad decisions after 40 trials, leading us to alter study 2's snacking trials.The decision-making pattern observed in study 1 limited the measurement's ability to distinguish between players and went against the response patterns observed in studies of decision-making under uncertainty [92].Changes to study 2's decision-making challenge resulted in a desirable change in response tendencies, as shown in Figure 7.

Internal validation.
To meet expectations, players' choices in gambling trials should show a preference for options with infrequent penalties [92], while the prevalence of good decisions with better long-term consequences should increase as the game progresses [92].To examine whether participants preferred infrequent penalties, we fit a hierarchical logistic regression model with fixed and random intercept on all gambling choices divided into frequent or infrequent rewards.As expected, players showed a preference for options with infrequent penalties (OR = 1.47, 90% CI: 1.24 -1.74, p < .001).To assess learning over time, we added trial number divided by 10 (to facilitate model convergence) as a fixed predictor to the hierarchical logistic regression.This model showed that the prevalence of good choices increased per increment of 10 trials (OR = 1.13, 90% CI: 1.11 -1.14, p < .001),supporting our expectation that participants improved their decisions as the game progressed.Overall, players' decision-making patterns met all our expectations for a measurement of decision-making under uncertainty.5.4.2Reliability.Bayesian Hierarchical logistic regression with fixed and random intercept was fitted on all gambling choices, which were divided into good or bad choices.Only trials 40 -120 were used, since earlier trials serve as an initial exploration phase and are therefore ignored when analyzing choices in the Iowa gambling task [5,90], resulting in excellent reliability of .95.The reliability metrics of all game-based cognitive measurements are summarized in Table 3.

External validation.
Asking participants to complete the Iowa gambling task after the game is unlikely to yield meaningful results because of transfer of learning from the game to the task, removing the element of ambiguity from task performance.Instead, we validated the game's decision-making measurement by correlating it with two correlates of the Iowa gambling task: cognitive reflection [89,90] and lack of premeditation [106].We expected a positive association between decision-making quality and cognitive reflection and a negative association between decision-making quality and lack of premeditation.While participants' lack of premeditation ( = .88)was not significantly correlated with their decision-making (r = -.07,90% CI: -.25 -.12, p = .271),their cognitive reflection ( = .67)positively correlated with their decision-making (r = .24,90% CI: .06-.40; p = .016).Altogether, the game's snacking challenge validly and reliably measured decision-making under uncertainty, although we cannot conclude whether it reflected players' lack of premeditation.The external validation of all game-based cognitive measurements is summarized in Table 4.

DISCUSSION
To meet our research goals for assessing the benefits and feasibility of cognitive games that facilitate players' sense of control, we compared Tunnel Runner's players' experience against standard flanker and stop signal tasks (RG4).We found substantial improvements to players' experience of autonomy, curiosity, meaning, enjoyment, reward, focused attention, and aesthetic appeal.Furthermore, we assessed the reliability of Tunnel Runner's cognitive measurements (RG3), finding that they showed generally good reliability.Lastly, our assessment of internal and external validity of Tunnel Runner's measurements (RG1-2) showed that, except for the ice challenge, the game's measurements evoked expected behavioral tendencies and validated against task-based and self-report measurements.Thus, we establish Tunnel Runner as a cognitive assessment system that can obtain valid and reliable cognitive measurements along with large experiential benefits over traditional cognitive tasks.As such, we show the feasibility and potential benefits of cognitive games that facilitate players' sense of control.
6.1 Explanation 6.1.1Player experience.Tunnel Runner showed unusually large improvements across experiential measures for a cognitive game, despite taking more time and effort to complete than traditional tasks.Most of the game's features, such as its visuals, point system, and narrative, are known to provide limited experiential benefits [63,85,105], making these features insufficient for our results.Tunnel Runner's experiential benefits are better explained by its increased player control, which is atypical for a cognitive game.Previous work established that a sense of control is crucial for motivation [21,23] and engagement [31,73,74], and work on avatar customization [7,8] showed that small improvements to players' control are enough to enhance their experiences.Thus, the increase in players' sense of control, i.e., autonomy in Tunnel Runner, likely resulted in large improvements to players' experiences [47].These experiential benefits are particularly relevant considering that Tunnel Runner was the more demanding activity.
6.1.2Cognitive measurement.Tunnel Runner provided reliable cognitive measurements that, except for its ice challenge, validly measured inhibitory control and decision-making under uncertainty.
Ice challenge: The game's ice trials did not evoke Simon-like behavioral patterns, as reflected in the trials' unusually high error rates alongside their positively sloped delta.Suggesting that the challenge was not resolved by the cognitive mechanisms measured by the Simon task.This difference is likely because of the importance of spatial features in Tunnel Runner's gameplay, which are irrelevant in traditional Simon tasks.The ice trials' reversal of control likely required participants to switch response rules relating responses to spatial features instead of ignoring spatial features, such that the ice challenge may have required cognitive flexibility [26].
Flanker challenge: Tunnel Runner's flanker challenge validated as a measurement of interference control both internally and against the flanker task.However, the game's RT and error rates were considerably higher than those of the flanker task, and the game's flanker effects on RT and error rates significantly differed from the task's flanker effects.These differences, alongside the weak correlation observed between the flanker task and game-based RT effects, suggest that the game provides a similar yet alternative measure of interference control.
While the game's flanker effects on accuracy and its BIS were more reliable than the task's, the task's flanker effect on RT was unusually high and more reliable than the game [19,50,82].Likely because the game's flanker measurement had about 60% of the trials the task used, and the game may have trained players for the task, enhancing their response consistency and thereby reducing measurement error [107].Furthermore, the use of an online sample, rather than a lab-based one, likely improved the reliability of both task-based and game-based measurements by giving them access to a more diverse population placed in a greater variety of real-life environments [107].
Lava challenge: The game's response inhibition measure validated both internally and against the stop-signal task, yet gamebased SSRTs were higher than task-based SSRTs.Suggesting that lava trials enabled a similar yet alternative measure of response inhibition.Furthermore, the game-and task-based measures showed good and almost identical reliability, although the game used about 84% of the trials used by the task, while the task may have benefited from a training effect due to the game.
Snacking challenge: The game's decision-making challenge evoked expected behavioral patterns, alongside correlations with cognitive reflection but not with lack of premeditation.This pattern is not unexpected for a measure inspired by the Iowa gambling task.As associations between the Iowa gambling task and cognitive reflection are better replicated [89,90] than associations with lack of premeditation, which were found in one study [106] but not in a larger study [4].
Decision-making under uncertainty is theoretically related to aspects of impulsivity such as lack of premeditation [5,106], making an empirical association desirable for the game's measurement.Unfortunately, changes to the decision-making challenge between studies 1 and 2 reduced the intended sample sizes by half, leaving the correlation analysis underpowered to detect associations with both measures.Nevertheless, we believe that the decision patterns observed in the snacking trials, alongside their correlation with cognitive reflection, justify treating the game's snacking challenge as a measure of decision making under uncertainty.

Implications
6.2.1 Implications for game-design for non-gaming purposes.We found that Tunnel Runner efficiently provides valid and reliable cognitive measurements alongside substantial experiential benefits over traditional cognitive tasks.Thus, we establish that high-quality game-based cognitive measurement can be achieved with less restrictive experimental control than is typical in cognitive tasks.Showing that cognitive measurements can be robustly taken in various contexts.Furthermore, we showed that enhanced player control can help cognitive games achieve greater experiential benefits than the modest benefits previously observed.As such, our work provides a justification and a foundation for the next generation of game-based cognitive assessment, i.e., games that do not imitate cognitive tasks and instead emphasize gameplay mechanics that facilitate players' sense of control.
With growing efforts to use video games as digital markers of mental health and well-being in daily life [45,66,67], there is a need to develop cognitive measurements that can compete with reallife distractions for participants' attention [108].Many cognitive tasks and games are not suitable for this purpose, making them susceptible to low data quality [19,69,108] and high attrition rates [62] in repeated evaluations.Previous work showed that increasing player control through avatar customization can reduce attrition rates in digital self-improvement programs [8].These findings, alongside Tunnel Runner's substantial experiential benefits, make the facilitation of players' sense of control a promising way to make cognitive games a viable source of digital markers of daily mental health and well-being.
The implications of our results extend beyond cognitive games and can inform the design of games for other non-gaming purposes [48] such as education, work, and self-improvement.While the challenges of enhancing players' control differ by context, our work gives reasons for optimism; as it shows that it is feasible and beneficial to facilitate players' sense of control even in highly behaviorally restrictive and sensitive application areas such as cognitive measurement.Thus, it may also be feasible and beneficial to emphasize players' sense of control for non-gaming purposes other than cognitive measurement [8].

6.2.2
The challenges of facilitating players' sense of control.Our work shows some of the challenges of reduced experimental control in game-based cognitive assessment.Players' ability to correct errors in Tunnel Runner is a consequence of giving them more control, which can enrich their behavioral data [34,35].However, error correction enables players to use response strategies involving quasi-random first responses that they later correct.Instead of using aggressive experimental control, we tried to nudge players away from this strategy while maintaining their sense of control by making players' first responses rotate the rats slightly further than later responses.This was paired with difficulty adaptation and streak mechanics designed to promote more efficient response strategies.Nevertheless, several players relied on frequent error corrections rather than accurate first responses, making their behaviors incomparable to other players.However, players' relatively high rates of incorrect first responses that were followed by error-correction, if paired with conceptual and statistical advances to allow the analyses of more than players' first responses, could enable cognitive games that facilitate players' control to more accurately assess the dynamic nature of players' cognition [35].
The defining feature and challenge of Tunnel Runner's design and development process, which distinguished it from other cognitive games, was the search for compromises that facilitate both player control and measurement validity whenever the two clashed.For example, the game's earliest versions gave players complete continuous rotational control during running trials; however, players would sometimes keep rotating the rats while and after passing through the obstacle, which invalidates response data from the subsequent trial.Consequently, we disallowed rotation only before color-assignment, which minimally affected players' sense of control in playtesting.In another compromise, we tried to encourage non-random first responses by making the rats' rotation slower the more they have been rotated in a trial; however, this feature was removed following playtesting, as it made the game feel less immersive, responsive, and player-driven.Instead, we used a less intrusive approach where the rats' rotation was faster right as players started rotating, akin to an initial sprint, which made success more difficult following an incorrect first response.In summary, the facilitation of players' sense of control while taking valid cognitive measurements required us to constantly balance between the two; to find compromises, as opposed to implementing easier solutions that restricted players' control whenever it threatened cognitive measurement.We found this design mindset, when coupled with playtesting to assess any restriction to player control, to be crucial for creating cognitive games that provide considerably improved player experiences alongside high-quality cognitive measurement.

Limitations
Our studies and the design of Tunnel Runner show several limitations: (1) the number of players whose in-game responses could not be analyzed was about 2-3 times greater than the tasks', although the frequency was in line with other online studies of cognitive games [50,105].(2) Giving players greater control made them more likely to adopt response strategies (quick quasi-random response followed by correction) that cannot be analyzed by current modeling practices.(3) Order effects may have affected our results because Tunnel Runner always preceded cognitive tasks, so to prevent exhaustion, demotivation, and training effects due to previous cognitive tasks.(4) We did not assess the test-retest reliability of Tunnel Runner's measurements.(5) Using established questionnaires, we measured autonomy as a reflection of players' sense of control instead of using a more direct measure.And (6) gamers dominated our self-selected sample, and thus our results may not generalize to non-gamers.

Future Work
The ability of cognitive games that facilitate players' sense of control to provide high-quality cognitive data alongside large experiential benefits creates a wide range of opportunities.Such cognitive games can take different cognitive measures and use different gameplay mechanics, such as shooter mechanics, alongside other control-enabling design features, such as avatar customization.Cognitive challenges could also be embedded in games that provide a greater sense of control and autonomy than Tunnel Runner, and whose gameplay mechanics are closer to commercial games.This line of work should help produce powerful cognitive assessment systems that can effectively study cognition, mental health, and well-being in daily life settings.
Facilitating players' control creates new challenges for the design of cognitive games.Instead of tight experimental control, games that enable players' control need to rely on more subtle gameplay features that nudge players to provide high-quality behavioral data while maintaining a sense of control.Creating theoretically grounded and empirically justified strategies for game-based nudging is a major challenge and a crucial step toward making better cognitive games that facilitate players' sense of control.Furthermore, allowing and encouraging players to correct mistaken first responses calls for conceptual and statistical advances that consider more than players' first responses, and could help cognitive games provide a more holistic understanding of players' cognitive functions [34,35].

Conclusion
Tunnel Runner effectively delivers valid and reliable cognitive measurements while substantially improving players' experience.We demonstrate that it is feasible to obtain high-quality cognitive measurements by employing cognitive games that prioritize players' sense of control-showcasing Tunnel Runner as a successful proofof-concept and an effective cognitive measurement tool.Our approach offers substantial experiential benefits and paves the way for engaging and motivating cognitive games that achieve high measurement validity and reliability with reduced experimental control.MEASUREMENTS

Figure 1 :
Figure 1: In Tunnel Runner's regular trials (top left), all rats are assigned the same color, whereas, in mismatching flanker trials (top right), the flanker rats have a different color than the central rat.In ice trials (bottom left), the tunnel is covered by ice, reversing how the rats react to players' input.A stop signal appears in lava trials (bottom right), penalizing players if they move the rats.

Figure 2 :
Figure 2: In Tunnel Runner's snacking trials, the player is asked to choose between 4 boxes.Two of which provide smaller immediate rewards yet are beneficial in the longrun, while the other two provide larger immediate reward yet are detrimental in the long-run.

Figure 3 :
Figure 3: Flow of a single trial in the flanker task.A single type of stimulus is presented per trial.

Figure 4 :
Figure 4: Flow of a single trial in the stop-signal task.A single type of stimulus is presented per trial.

Figure 5 :
Figure 5: Density plot of average responses to the Player Experience Inventory's 3 autonomy-related questions, as responded-to in reference to Tunnel Runner, the flanker task, and the stop-signal task.

Figure 6 :
Figure 6: Delta plot of Tunnel Runner's ice and flanker effects, separated per study.Reaction times (RT) are on the X-axis, while the magnitude of conflict effects on RT are on the Yaxis.Scale is in milliseconds.

Figure 7 :
Figure 7: Density plot of players' frequency of good responses (smaller immediate rewards with better results on the longrun) to the game's snacking challenge after 40 trials, per study.

Table 1 :
Sample description of the two studies.* IQR stands for interquartile range

Table 3 :
Reliability of the game's cognitive measurements.RT stands for reaction time, BIS stands for balanced integration score, and SSRT stands for stop-signal reaction time.

Table 4 :
Correlations between game-based cognitive measurements and external validation measurements: a flanker stop (first 3 correlations), a stop-signal task (4th correlation), and questionnaires (last two correlations).RT stands for reaction time, BIS stands for balanced integration score, and SSRT stands for stop-signal reaction time.Due to clear expectations, p-values are based on one-sided tests with alpha of .05. * Significant association.

Table 5 :
Internal consistency of user engagement and player experience subscales, assessed with McDonald's .