Independent Validation of the Player Experience Inventory: Findings from a Large Set of Video Game Players

Measuring the subjective experience of digital game players is essential to player experience research. Recently, the Player Experience Inventory (PXI) was developed, which assesses both functional and psychosocial consequences of digital gameplay. We present a pre-registered independent online study with a large sample to provide additional evidence of psychometric quality for the PXI. Responses from 1518 participants were collected, rating a recent or memorable experience playing a digital game using the PXI and related measures. While our results from standard psychometric reliability and validity analyses generally favored the PXI, we also identified challenges with the immersion construct. Further, we find a ten-factor model, or alternatively, an 11-factor should enjoyment be measured, to fit our collected data best. In sum, the PXI is a valuable tool to measure a variety of constructs central to player experience.


INTRODUCTION
In Games User Research (GUR), employing self-reports to measure players' subjective experiences is a popular method [9].However, few properly validated survey scales exist for measuring player experience (PX), especially with a strong focus on providing actionable insights for practical game design.To fll this gap, Vanden Abeele et al. [66] developed the Player Experience Inventory (PXI), a 30-item survey scale based on means-end theory [24,53].Since its initial development, the PXI has been translated into German [23], and both a short version [26] and a benchmark [25] for the scale have been created.Beyond this work on the PXI by people associated with its original authors, no independent evaluation of the PXI has occurred [64].Furthermore, the samples used to develop and validate the PXI were below commonly recommended sample sizes for essential scale evaluation and validation methods, such as confrmatory factor analysis (CFA) [39].In addition, past validation work on the PXI has primarily sampled students or volunteers.However, other samples, such as crowd-sourced participants, have seen increased use in research [e.g., 10].Similarly, participants were predominately young men, something acknowledged as a limitation by the original authors, and thus not representative of general player demographics, where an equal share of men and women report playing digital games [19].Thus, it remains to be seen whether the PXI also applies to more diverse populations.
Given that the quality of research fndings depends on the reliability and validity of the methods used [2], it is of utmost importance for researchers studying digital games to have an adequately validated scale for measuring players' experiences.The PXI is a promising scale, with its focus on both the psychosocial consequences of gameplay and the functional consequences of game mechanics.However, an independent investigation into the quality of the PXI has yet to be conducted because previous studies on the PXI were always run by people directly associated with the original team of researchers behind the PXI.Motivated by this research gap, we assessed the psychometric quality of the PXI in a pre-registered online survey following current best practices for scale quality investigation.Thus, this work's contribution is a large-sample independent validation of the PXI.Data from 1518 crowd-sourced participants was collected, who were asked to rate a recent or memorable experience playing a digital game using the PXI and related scales.Results generally demonstrated good psychometric quality for the PXI and supported the proposed ten-factor theoretical model behind the scale.Standard reliability and validity measures mainly favored the PXI but indicated room for improvement regarding certain constructs.In particular, the construct of immersion was negatively salient in several respects.Overall, the fndings of this study demonstrated that the PXI is a reliable and valid tool for measuring players' experience with digital games, contributing to a more accurate measurement of the gaming experience.

RELATED WORK 2.1 The PXI
The PXI is a 30-item survey scale developed and validated based on input from 64 GUR experts in two iterations and data collected from 529 players across fve studies.The PXI was designed to measure digital games' psychosocial and functional consequences.Functional consequences are defned as "the immediate and tangible consequences that are experienced directly by consumers, during the use of the product" [66, p. 3].Psychosocial consequences, in contrast, "exceed the immediate usage level and reach into the social or, psychological level" [66, p. 3-4].While there are numerous scales to gauge psychosocial consequences in PX research, the measurement of functional consequences is unique to the PXI [66].
The PXI measures ten diferent constructs, fve each for the functional and psychosocial consequences of playing digital games.The constructs and their respective defnitions are presented in Table 1.Per construct, three items are used, to which participants' answers are recorded on a seven-point Likert-type response scale, ranging from -3 (Strongly disagree) to +3 (Strongly agree).For functional consequences, the scale measures the following constructs: ease of control, challenge, progress feedback, goals and rules, and audiovisual appeal.The specifc constructs of the psychosocial consequences measured with the PXI are meaning, immersion, mastery, curiosity, and autonomy.Beyond the ten constructs of the PXI, there is an additional construct, namely enjoyment, suggested by the authors to be measured alongside the PXI with three dedicated items but not considered part of the actual scale.In the context of media entertainment, enjoyment has been described as "an individual's positive response towards media technology and its content" [ [68] in 47, p. 927].
Graf et al. [23] translated the PXI into German and validated the translated version in an online study ( = 506).Results showed that the translated version had good psychometric properties, although there was room for improvement concerning the scale's discriminant validity.Besides the original PXI, a short version exists, which consists of 11 items.With one item per construct, including enjoyment, the miniPXI [26] was developed across three studies based on data and insights from 15 experts and 628 digital game players.In addition to these scale versions, Haider et al. [25] developed the PXI bench, an online tool for the analysis and comparison of PXI response data, which can be accessed on the PXI's ofcial website.However, no additional investigation into the psychometric quality of the PXI has occurred since the original validation.Thus, there is a need for additional validation of the PXI to see if the initial evidence for the scale's quality can be reproduced.
Beyond the general psychometric quality, the theoretical model behind the PXI also calls for further inquiry.In their paper, the PXI's authors employed factor analyses to test a ten-factor model for the PXI, with a dedicated factor per construct [66].In this regard, it was left open what the psychometric quality of the enjoyment items was and how an enjoyment factor might ft into this model, if at all.Furthermore, the PXI's authors performed a mediation analysis to investigate the theoretical model and, thus, the relationship between the functional and psychosocial consequences, as well as game enjoyment.However, they did not report detailed results on how a model considering the ten constructs, the consequences, and game enjoyment would perform.Given the scale's name, one might further expect the PXI to measure an overall factor of PX, although the authors have never suggested this.Nevertheless, all of this leaves the question of whether a simple ten-factor model best fts the scale or if alternative models, including higher-order factors, such as for the consequences or overall PX, are better suited.

Usage of the PXI
As part of our initial investigation into the PXI, we frst conducted a literature review of the peer-reviewed articles that cited the original papers on the English and German PXI [i.e., 23,65,66] as of March 2023 ( = 45).Details on the literature review can be found in the supplementary materials on OSF.We aimed to understand how the PXI has been used within academia since its publication.This aided us in gaining a bottom-up perspective of the measurement models to validate for the PXI.We found that quality investigations of the scale, even when used in a novel context or with selected dimensions, were infrequent.Further, the quantifcation of the scale was sometimes unclear, meaning it was uncertain how researchers averaged the response items for further calculations.However, in some instances where quantifcation could be assessed, we found the authors computed a general PX score.In other words, researchers would average all the item scores from the ten, or sometimes eleven, dimensions of the PXI into a single overall score per participant.Indeed, some researchers also described using the PXI to measure PX generally without referring to the actual factors of the scale.As such, it is important to investigate a model of the PXI that includes a general PX factor to understand whether such usage is psychometrically justifed.Additionally, we found many researchers employing the suggested enjoyment items, which, as described above, have not been validated alongside the other ten factors of the scale.Another frequently measured dimension was immersion.This dimension was often used in conjunction with other scales for specifc contexts, specifcally virtual reality applications.
We also found that researchers administered the scale diferently from the originally proposed version despite the PXI's authors stressing the importance of using the original scale and response options on their website.Instead of the -3 to +3 range of the Likerttype response scale, often a range from 1 to 7 or even 1 to 5 was employed.

Construct Defnition
Functional consequences ease of control "The extent to which a player fnds the actions to control the game clear and intuitive" challenge "The extent to which the specifc challenges in the game match the players skill level" progress feedback "The extent to which it is clear to the player how well he or she is doing in the game" goals and rules "The extent to which the overall objective and rules are clear to the player" audiovisual appeal "The extent to which a player appreciates the audiovisual styling of the game" Psychosocial consequences meaning "A sense of connecting with the game, resonating with what is important" immersion "A sense of immersion and cognitive absorption, experienced by the player" mastery "A sense of competence and mastery derived from playing the game" curiosity "A sense of interest and curiosity roused by the game" autonomy "A sense of freedom and autonomy to play the game as desired"

The importance of independent validation
As the authors of the PXI themselves emphasized, "scale development and validation is an ongoing process" [66, p. 10].Regarding the quality of a survey scale, researchers are typically interested in three criteria: objectivity, reliability, and validity [17].Objectivity signifes that "any statement of fact made by one scientist should be independently verifable by other scientists" [50, p. 6].Reliability considers "how accurately a test measures the thing which it does measure" [38, p. 14].Finally, validity is concerned with "whether a test really measures what it purports to measure" [38, p. 14].Only if all three quality criteria are met can researchers have confdence in the data gathered using survey scales and, consequently, in the conclusions derived from them.While the original work on the PXI provided extensive results that speak to the quality of the scale, these results are limited to the sample and setting of the original paper.Ideally, a scale's psychometric quality should be assessed whenever it is used [21].Because this is not always realistic, independent validation in other settings can provide valuable additional insight into the quality of a scale.Furthermore, the PXI was developed and validated based on data from predominantly young men, contrasting general demographics of digital game players [19].This limitation was also acknowledged by its authors, who called for further studies "to assess how the PXI performs across diferent game audiences" [66, p. 10-11].Thus, additional studies are needed to determine how the PXI performs in other populations, as the psychometric properties of a scale can vary considerably between diferent groups of people [21].In addition, concerning the popular recruitment approach of crowd-sourcing [10], evaluation of the PXI has yet to be conducted, given that past samples mainly consisted of students and volunteers.Several previous studies [e.g., 35,41,48,67] have shown that other scales proposed to measure players' experiences do not hold up under second inspection or at least require certain modifcations to achieve satisfactory psychometric quality.For the PXI, such independent validation is still pending [64].In summary, although the initial results on the psychometric quality of the PXI are promising, additional evidence is needed in order for researchers, both in industry and academia, to use the scale with confdence.

METHODS
A pre-registered online study was conducted to evaluate the psychometric quality of the PXI.During the online study, participants were asked to think about a digital game they recently played or know well before responding to several standardized survey scales, including the PXI.The study was reviewed and approved by the ethics committee of the authors' university and pre-registered on OSF (https://doi.org/10.17605/OSF.IO/BUQ5T).

Measures
After choosing a game to think about, participants responded to all PXI items, in addition to several additional scales related to the PXI, namely the Player Experience of Need Satisfaction scale [PENS, 58], the AttrakDif [29], and the interest/enjoyment subscale from the Intrinsic Motivation Inventory [IMI, 55,57].The selection of the additional scales was mostly based on the original work on the PXI.All items were presented in a randomized order with one individual page per scale unless stated otherwise.The exact wording of all items is provided in the supplementary materials, except for the PENS [58] due to copyright reasons.Reliability for all scales was investigated using the internal consistency coefcients [12] and [45], which delivered satisfactory results (≥ .70)for all scales, except for the PXI's immersion construct and the AttrakDif's pragmatic quality [29], which fell just below the desired threshold (see subsection 4.3 for results on the PXI, and the supplementary materials on OSF for the other scales).
3.1.1PXI.All 30 items of the English PXI were used alongside the three suggested items for enjoyment.Items were distributed across three pages, likewise to the survey on the PXI website, and responses were collected using the recommended seven-point Likerttype response scale ranging from -3 ("Strongly disagree") to +3 ("Strongly agree").

PENS.
Participants responded to all 21 items of the PENS [58].We chose the PENS because it was already used in the original validation for the PXI and contains several constructs related to those of the PXI: autonomy, competence, relatedness, and intuitive controls, with three items each and the construct presence with nine items.In the context of the PENS and self-determination theory, autonomy "concerns a sense of volition or willingness when doing a task" [ [15,16] in 58, p. 349].Competence refers to "a need for challenge and feelings of efectance" [ [14,70] in 58, p. 349] while "[r]elatedness is experienced when a person feels connected with others" [ [40,56] in 58, p. 350].The intuitive controls construct considers "whether [the game controls] make sense, are easily mastered, and do not interfere with once sense of being in the game" [sic, 58, p. 350].Finally, presence describes "the sense that one is within the game world, as opposed to experiencing oneself as a person outside the game, manipulating controls or characters" [58, p. 350].Responses to the PENS were collected on a seven-point Likert-type response scale from 1 ("Do not agree") to 7 ("Strongly agree").

AtrakDif.
As in the original PXI paper, participants responded to the AttrakDif semantic diferential scale [29].We used the most recent 28-item version of the scale, available on the ofcial website,2 which measures four constructs with seven items each: pragmatic quality (PQ), hedonic quality -identifcation (HQ-I), hedonic quality -stimulation (HQ-S), and attractiveness (ATT).Pragmatic quality concerns attributes of a system "connected to the users' need to achieve behavioral goals" while "hedonic attributes are primarily related to the users' self" [29, p. 322].Stimulation, alongside novelty and challenge, is considered "a prerequisite of personal development [...] which in turn is a basic human need" [29, p. 322].Identifcation, on the other hand, "addresses the human need to express one's self through objects" [29, p. 322].Finally, attractiveness "is a global assessment based on the perceived [product] qualities" [30, p. 3, translated from German].Responses were collected on a seven-point semantic diferential response scale, with, for example, the words "ugly" and "attractive" at two opposing poles.
3.1.4IMI -interest/enjoyment.In addition, we had participants fll out the subscale for interest/enjoyment from the IMI [55,57].Responses to the seven items were collected on the seven-point Likert-type response scale from 1 ("Not at all true") to 7 ("Very true") recommended by the authors, and items were slightly adapted to ft the gaming context, which is commonly done [55].We chose the IMI because it is the most frequently used scale to measure game enjoyment [47] and because enjoyment was also measured in the original PXI paper.

Procedure
Participants provided informed consent on the frst page of the online survey.Next, they were given instructions for the task to be completed.Following the original PXI paper, we asked participants to recall an experience with a game they recently played or know well.For this, we used the critical incident technique, commonly used in HCI research [e.g., 5, 61], asking participants to describe the game in at least 50 words.The exact wording of the critical incident question was as follows: "Please describe the digital game you recently played or that you remember well.Try to describe this particular game as accurately and detailed as you remember in at least 50 words, and try to be as concrete as possible.You can use as many sentences as you like." Participants were further instructed to provide the name of the chosen game, which was then used in subsequent questions to personalize the survey.This ensured that participants would think about the described game while flling out the survey (e.g., "Please fll out the following questions for the digital game you recently played or that you remember well ([name of game]).").After the critical incident question, participants responded to several closed questions concerning the chosen game, which we adopted from the survey on the PXI website (e.g., controls used, the platform played on) before flling out the PXI and the three items for game enjoyment [66].On the following survey pages, participants flled out the other scales in a randomized order.Next, participants provided demographic information (age, gender, country of residence, game experience, playtime).Lastly, participants could give feedback before receiving their compensation.To ensure sufcient response quality, the survey included two instructed response items [13] embedded among the survey scales and a single item for self-reported data quality [46] at the end of the survey.Completing the survey took participants an average of 11.71 minutes ( = 5.54, = 3.48, = 52.92).

Pre-study
Before pre-registering the study, we tested the procedure and task with a small-sample pre-study ( = 50) to examine if participants could complete the task and if there were any major issues with the study procedure.The recruitment criteria for the pre-study were the same as for the main study (see below).Participants encountered no issues, and all responses, including the critical incident question, were satisfactory.Thus, no changes in the study procedure or the recruitment criteria were necessary.For this reason, the data from the pre-study was combined with the main sample for the analysis.

Participants
Prolifc, a crowd-sourcing platform recently shown to have high data quality [18,51], was used for recruitment.A total of 1501 participants from the United Kingdom (UK) were recruited and reimbursed £1.50 for completing the study.Participants were screened on Prolifc on whether they play digital games at least occasionally.A target sample size of at least 1050 responses after data cleaning was set based on rules of thumb for structural equation modeling, recommending at least ten observations per estimated model parameter [39].We followed recommendations by Brühlmann et al. [10] for data cleaning, fltering out participants using two instructed response items [13], a seriousness check [46], and responses to open answers.Responses from nine participants were removed based on the seriousness check, and another three responses were removed due to an incomplete or interrupted survey.Five additional participants were removed for indicating their current country of residence outside the UK, and 16 were removed based on lowquality critical incident game descriptions (e.g., repeating words to reach the word minimum, not describing a digital game, indicating that they could not accurately remember the game).The fnal sample, including participants from the pre-study, consisted of 1518 responses.Thus, no additional recruitment was needed to achieve the target sample size.Of the participants, 639 were women, 864 were men, nine were non-binary people, one person preferred to self-describe, and fve people chose not to provide information on their gender.The average age of participants was 37.47 years ( = 12.18, = 18, = 79).The most frequently used game platform was consoles (574 participants), followed by PC (497) and smartphones (435).Participants most frequently stated that they played the rated game alone (1086), followed by playing online with other players (481) and playing locally with others (131).Most participants used controllers to play the game (620), followed by touch controls (507) and keyboards (438). 3The most frequently rated digital game was FIFA (mentioned 60 times), followed by Candy Crush (57), The Sims (37), Mario Kart (33), Call of Duty (30), Grand Theft Auto (30), Fortnite (27), and Minecraft (27).The most popular genres, self-reported by the participants, were puzzle games (282), action-adventure (270), and action role-playing (155).On average, players rated their game expertise at 4.84 on a seven-point response scale ( = 1.38, = 1.00, = 7.00), and most often indicated a playtime of 5 to 10 hours per week (421 participants), 2 to 5 hours (396), and 10 to 20 hours (291).

RESULTS
The following section describes diferent forms of psychometric quality investigation for the PXI.The complete analysis can be found in the supplementary materials on OSF.The analyses mostly followed those methods used in the original work on the PXI.All results were obtained using the statistical software R [52, version 4.3.0].Overall descriptive statistics for the collected data are presented in Table 2.

Item analysis
We began the psychometric investigation into the PXI using item analysis.We considered descriptive statistics, item difculty and variance, discriminatory power (i.e., item-total correlation), and inter-item correlations for all 30 PXI items and the three enjoyment items.In summary, the item analysis showed no problematic values for most PXI items.However, a few items exhibited conspicuous results.Namely, descriptive statistics deviated from the other items for item immersion_1, which exhibited a lower mean and diferent distribution of responses compared to other items.Further, the item variances were below 1 for multiple items (see supplementary materials for details).We thus continued with the analysis while keeping those items in mind for the interpretation of further results.

Confrmatory factor analyses
As pre-registered, we next performed multiple CFAs to investigate the model ft of the PXI.We tested multiple models for the following reasons.First, we encountered diferent conceptualizations regarding the application of the PXI in our literature review on the current usage of the PXI.Second, the original work on the scale likewise ofers multiple conceptualizations, specifcally concerning the distinction of the PXI's constructs into functional and psychosocial consequences.Given that these applied or proposed models difered regarding the inclusion of certain higher-order factors for the functional and psychosocial consequences and/or for PX, we conducted multiple CFAs corresponding to these conceptualizations, as recommended by Brown [8].The multivariate normality assumption was not met, tested using the Henze-Zirkler test [31] and Mardia's test [43].We thus chose to use a robust maximum likelihood estimator with a Yuan-Bentler scaling correction for all CFAs, which is recommended for non-normal data and reduces the risk of Type I error [8].In all analyses, the factor loading for the frst indicator of each latent variable was constrained to one, as is standard procedure when defning a metric for each factor [8,39,54,62].For the judgment of model ft, we opted for the same criteria used during the original PXI validation (see Table 3), combining multiple criteria to improve the acceptability of Type I and Type II error rates [ [33] in 8].Based on the information provided by the original authors of the PXI and the fndings of our literature review on how the PXI is currently used in research, we investigated the ft of fve diferent models to the collected data.Following the factors proposed in the original work on the PXI, we started with a ten-factor model for the 30 PXI items, with one factor each for the PXI's subscales.All items were specifed to load on their designated factor.In addition, we investigated a model with two higher-order factors, one each for the functional and psychosocial consequences, upon which the fve respective factors of the PXI constructs loaded.This model was based on the originally proposed theoretical structure behind the PXI, which suggests further separating the ten factors of the scale into functional and psychosocial consequences.Two further models included an overall general factor for PX, once with and once without the higher-order factors for the functional and psychosocial consequences.We tested those models with an overall PX factor based on our fndings from the literature review that some authors form an overall score using the items of the PXI.Finally, an additional 11-factor model was tested, including a factor for the three enjoyment items.Items were specifed to load on their designated factor, following the theoretical structure proposed by the PXI's authors.All results from the CFAs are presented in Table 4.
Results from the CFAs indicated an acceptable to excellent ft of the models without higher-order factors to the data, both with and without the enjoyment items (three criteria excellent, two acceptable), judging by the cut-of criteria for model ft as used in the original work on the PXI (see Table 3).The model including higherorder factors for the functional and psychosocial consequences exhibited slightly worse but still mostly acceptable to excellent model ft statistics (two criteria excellent, two acceptable, one not acceptable).Regarding the model including a higher-order factor for PX in addition to the consequences, the ft was also slightly worse compared to the models without higher-order factors (two criteria excellent, two acceptable, one not acceptable), and a warning suggested that the model might not be identifed.For the model including just a higher-order factor for PX, without consequences, the ft indices mostly fell just outside of the desired thresholds (one criterion excellent, one acceptable, three not acceptable).
In addition to comparing multiple ft indices to judge model ft, we used 2 diference tests to see if the model ft would difer signifcantly among the three nested models one through three.Given the warning for model four (not identifed), we did not include it in this analysis.Results are reported in Table 5.In general, the results were in line with the fndings thus far.Given that the 2 diference test was signifcant, the "larger" model with more freely estimated parameters (model one) ft the data better than the "smaller" models (two and three) in which the parameters in question were fxed [8,69].Thus, the ten-factor model without higher-order factors ft the data best (model one), followed by the model including two factors for the consequences (model two).In contrast, the third model with an overall factor for PX ft the data worst.Finally, both the Akaike information criterion [AIC, 3] and the Bayesian information criterion [BIC, 60] also reported in Table 5 favored the 10-factor model over all other models [8].

Reliability
We calculated both coefcients [12] and [45], including 95% confdence intervals, as indicators of reliability, based on recommendations by Dunn et al. [20].Table 6 contains all values for both coefcients, separated by PXI subscale and for the overall scale.All values were above .70,indicating adequate internal consistency [22], except for the immersion subscale, which was just below the desired threshold.

Convergent and discriminant validity
In the original PXI paper, the convergent and discriminant validity of the constructs was assessed through composite reliability (CR), average variance extracted (AVE), and maximum and shared variance (MSV).We followed this procedure.Values for CR, AVE, and MSV were calculated based on the 11-factor CFA (model fve).All results are presented in Table 6.We also calculated CR, AVE, and MSV based on a ten-factor CFA without enjoyment (model one), which yielded comparable results (see supplementary materials).Results were interpreted as follows [28]: CR should be ≥ .70 as evidence for reliability.Concerning a construct's convergent validity, the AVE should be ≥ .50.For discriminant validity, a construct's AVE should be larger than its MSV, and the square root of the AVE of a construct should be greater than any inter-construct correlation (reported in Table 7).
Regarding CR, all subscales met the desired value of ≥ .70.AVE was good for more than half of the PXI's constructs but slightly below the desired value of ≥ .50 for mastery, immersion, challenge, and ease of control.MSV values were smaller than AVE for all constructs except immersion, indicating predominantly good discriminant validity.Further evidence for the PXI's discriminant validity was also shown by the results in Table 7, as the square root of most constructs' AVE was greater than the inter-construct correlations, although at times just barely.Only the inter-construct correlations between immersion and three other constructs, meaning, audiovisual appeal, and enjoyment, were greater than the square root of immersion's AVE.

Criterion validity
To assess the criterion validity of the PXI constructs, we considered bivariate correlations (Pearson's r) between the PXI and selected constructs of the other scales as indicators of criterion validity (see Table 8).The mapping of the PXI's constructs to those measured with the other scales was taken from the original PXI paper. 4urthermore, we considered the correlation between the PXI enjoyment items and the IMI.Results mainly showed strong positive correlations between the PXI constructs and their mapped counterparts, as expected based on the original PXI paper.However, the PXI construct challenge correlated only moderately with the construct of competence from the PENS, while meaning showed a moderate positive correlation with attractiveness from the AttrakDif.At the same time, progress feedback and goals and rules showed Table 4: Fit indices for CFA models of the PXI.Models one and fve were assessed without higher-order factors, and models two, three, and four included varying higher-order factors..61Note: Both coefcient and CR are reported, although these terms refer to the same statistic, given that diferent methods were used to calculate these values (i.e., CR based on 11-factor CFA and with the MBESS R package [36,37]).only weak correlations with pragmatic quality from the AttrakDif.Thus, correlation results mostly favored the PXI constructs' criterion validity.
Finally, we calculated a hybrid structural equation model to investigate the theoretical relationship between the psychosocial consequences, the functional consequences, and the enjoyment items of the PXI.Based on the original work on the PXI, we expected the following relationships: • The functional consequences positively predict enjoyment.

DISCUSSION
We have presented results from an independent psychometric evaluation of the PXI using a large sample.The PXI is a promising scale for measuring the functional and psychosocial consequences of playing digital games [66].However, independent validation of the scale was yet to be conducted [64].For this reason, we implemented a pre-registered online study and collected data from 1518 participants.With the collected data, we conducted various forms of psychometric quality analysis to evaluate the PXI.Results, in general, show that the PXI performs well regarding commonly used scale reliability and validity indicators, with good CFA model fts and satisfactory internal consistency values for all constructs except immersion.In addition, results mostly favored the convergent and discriminant validity of the PXI while further supporting the scale's criterion validity.
Results from the present study are comparable to those reported in the original work on the PXI [66] and for the German version of the scale [23].Most of the sample initially used to develop the PXI consisted of students.In contrast, the German PXI was developed using volunteer participants recruited over mailing lists, online groups, and social media.Also, samples from previous work on the PXI consisted predominantly of young men, not refecting the demographics of digital game players [19] and were below common sample size recommendations [39].We recruited a large sample of participants using a crowd-sourcing platform, resulting in a more balanced sample regarding gender and age distribution.This is also important considering that about an equal share of men and women and a substantial amount of older adults play digital games [19].Our fndings thus highlight that the PXI retains consistent psychometric quality across various populations and performs just as well with a large sample of crowd-sourced participants and a more balanced sample concerning gender and age as with previously used samples.However, the results also revealed particular challenges with the PXI, some comparable to previous work, which must be addressed.Concerning item analysis, multiple items exhibited low item variances.However, given the scale's performance for other psychometric analyses, we do not see these results as strong evidence against the PXI's quality.Potentially, the item variance was low for some items, such as enjoyment, because of the wording of the critical incident technique, which likely caused participants to pick games they generally enjoyed, resulting in primarily high ratings and thus low variance.Furthermore, four constructs had AVE values below the desired threshold, indicating room for improvement concerning convergent validity: mastery, immersion, challenge, and ease of control.This suggests that the items for these constructs are not as closely related as would be expected if they formed a common factor.In the original validation study of the PXI, ease of control was the only construct that exhibited a suboptimal AVE value.Further, it barely met the desired threshold in the German PXI paper.Consequently, there appears to be room for improvement regarding the convergent validity of ease of control.However, for the constructs of mastery and challenge, past work reported no problematic results.It thus remains to be determined if similar challenges with these constructs will arise in future work or if they are unique to the present study.Given that the construct of immersion was conspicuous not only regarding the AVE but also in other analyses, we will return to it in the following subsection 5.1.Finally, regarding criterion validity, the constructs of progress feedback and goals and rules exhibited only weak correlations to the related construct of pragmatic quality.Given that these constructs correlated as expected in the original PXI validation, it is unclear whether our experimental setup caused the low correlations or whether these results were not due to the PXI but rather because of the construct of pragmatic quality measured with the AttrakDif.The criterion validity of progress feedback and goals and rules, as well as the question of whether the AttrakDif holds up in the context of digital games, could thus be further explored in future work.

Challenges of the immersion construct
Based on the present work's results, the PXI's immersion construct was negatively salient in several respects.This is especially concerning given the frequent measurement of the immersion construct in research employing the PXI.The challenges of immersion in our independent validation are varied.Concerning reliability, immersion was the only construct of the PXI with internal consistency values below the desired threshold, although just barely.Regarding convergent validity, immersion fell below the desired value of ≥ .50.Furthermore, immersion was the sole construct with MSV values smaller than AVE while exhibiting a lower square root of AVE than the inter-construct correlations with multiple other constructs, which speaks against the construct's discriminant validity, suggesting that the immersion items are too closely related to the other factors.While the original PXI paper reported no challenges with immersion, the German PXI also showed issues with the discriminant validity of immersion.Finally, the item immer-sion_1 showed conspicuous descriptive statistics deviating from the other items during item evaluation.
Immersion being a difcult construct to both defne and measure is not a novel problem within PX research.Previous measurements, such as the Game Engagement Questionnaire [GEQ, 7] have already encountered difculties when externally validated [6,49].Further, we can see the presence of circular defnitions of immersion in the GEQ's developmental paper, i.e. the researchers explain immersion as being engaged in an activity, but immersion is also a construct in their engagement questionnaire.
"Immersion is typically used to describe the experience of becoming engaged in the game-playing experience while retaining some awareness of one's surroundings." [7, p. 624] The original development paper of the PXI ofers a defnition of immersion, which also appears to be tautological in nature; "A sense of immersion and cognitive absorption, experienced by the player" [66, p. 5] By comparing the GEQ's and the PXI's defnition of immersion, we fnd the GEQ treats absorption as its own construct.Following, the two operationalizations disagree on whether an individual would be aware of their surroundings or not when experiencing immersion.See the PXI's item immersion_1 "I was no longer aware of my surroundings while I was playing." in comparison to the GEQ's defnition.This item was also answered with the greatest variance by our sample, with answers tending towards either end of the scale.This warrants further theoretical and methodological investigation of whether a lack of awareness of one's surroundings is an aspect of immersion or should constitute an alternative construct, such as absorption.Although recent literature has attempted to increase the clarity for the immersion construct [1] and other similar constructs concerning psychological absorption in a task [34], there remains more work to be done in regards to a more consistent defnition of what immersion entails.
Additionally, according to MacKenzie [42], a good defnition needs to be unambiguously distinguished from related constructs.However, it has remained difcult to diferentiate immersion.In the case of the PXI, we are unable to clearly diferentiate immersion from the constructs of meaning and enjoyment, which is evidenced by the results on divergent validity.As such, there is more theoretical work necessary to distinguish immersion from other constructs and subsequently achieve a more robust operationalization.

Model behind the PXI
One question we wanted to answer with the present study was which theoretical model should be employed when working with the PXI.As stated in the related work, the information provided by the authors, both in the original paper and on the PXI's website, provided no defnitive suggestions on what exact theoretical model should be used for the PXI, if a model should include higher-order factors such as for the functional and psychosocial consequences, and how the suggested items for enjoyment would relate to the items of the PXI.Furthermore, our literature review of the usage of the PXI also showed diferent measurement models being used for the scale.
Based on the present results, especially from the CFAs, we found the most robust evidence for a ten-factor model with one factor per construct of the PXI.A simple ten-factor model exhibited a better model ft compared to more complex models that also considered higher-order factors for the psychosocial and functional consequences or an overall factor for PX.Consequently, we recommend that authors working with the PXI stick to a model with one factor per construct and do not form higher-order scores for the psychosocial and functional consequences or an overall PX factor.Such a ten-factor model is also in line with most of what the original authors themselves suggest, both in their paper and on the ofcial website of the PXI.Finally, we also considered the enjoyment items the original authors suggested to be used alongside the PXI, but for which the psychometric quality still was to be investigated.Our results showed that the enjoyment items perform comparably well to the PXI items.At the same time, their inclusion in the scale and the resulting consideration of an enjoyment factor and an 11-factor model did not negatively afect the quality of the overall scale.Thus, these items can be used alongside the PXI in good conscience.

Weak evidence of a general player experience score
One common discrepancy between the PXI's theoretical foundation and its application in practice is the averaging of all items into one overall score of PX.As seen above, we fnd the statistical model does not lend itself to this interpretation of the scale, with a better model ft exhibited for those models that do not contain a higher-order factor of PX.However, some researchers average all responses to items of the PXI into one singular score.For example, one paper employed the PXI to score their game on reaching a certain amount of points out of the 90-point total score achievable in the PXI.While such a general PX score was not validated in the original paper, nor theoretically proposed by the original authors, we investigated such a model to see whether an average score would be appropriate.
The results showed that the introduction of such a general PX factor into the model worsened the model ft compared to a tenfactor model without a higher-order factor.Given this fnding, we caution both researchers and practitioners against using the PXI to measure the construct of PX and rather interpret the responses to the individual constructs with intention and care.Furthermore, we see two additional reasons that speak against a general PX score.First, digital games and other interactive media relevant to the PXI come in a wide variety of forms with many diferent goals.
A certain digital game might not have been designed to provide ease of control and instead was designed for a particularly high level of difculty.A low score in ease of control would thus not mean this is a design problem to be fxed.Second, there are no guidelines or cut-ofs from the original authors as to what would constitute a satisfactory score, e.g. for enjoyment.Therefore, we can not recommend applying the PXI to determine whether a game has good or bad PX in a simplistic manner.

5.
3.1 Applicability of the PXI.Following, we fnd the strength of the PXI in comparing diferent games and, specifcally, diferent versions of the same game in terms of their experiential quality.Indeed, the PXI provides a variety of relevant constructs along which player experiences can be compared and contrasted.For practitioners, the PXI can aid in the incremental development of games and testing for version improvements.This interpretation of the PXI's strengths is also in line with recommendations by the original authors.The applicability regarding comparison is further enabled through the use of the PXI bench, which has collected PXI data across diferent games and genres [25].For researchers, we fnd the PXI useful to study its constructs as a dependent measure to compare between player experiences, similar to the applicability for practitioners.However, theoretical engagement with the constructs prior to measurement is still required, especially regarding constructs such as meaning or immersion, as they are not fully diferentiated in theory or construct validity.

LIMITATIONS
As a frst limitation, we did not have participants interact with a digital game but rather recall a memorable game using the critical incident technique well-established in HCI [e.g., 5,61].While this task is comparable to those used in other work on the PXI, we cannot exclude that the chosen experimental task infuenced certain results, such as the low item variance for some items.Because participants were instructed to think of a game they recently played or remembered well, they presumably mainly chose games they liked well, which might have caused the low variance for some items, such as for enjoyment.This is also refected in the skewed distribution of responses to most items in the present study.While also likely due to the wording of the critical incident technique, which probably caused participants to pick games they generally enjoyed, it does afect the generalizability of the present results to other contexts.More research is needed in this regard to investigate whether this is an issue generated by the research methodology used or if it is a general challenge afecting the applicability of the PXI.Furthermore, it is possible that participants could not remember the games very well, thus infuencing their responses compared to actual interaction with a digital game and consequently the external validity of the reported fndings.Hence, the psychometric quality of the PXI still needs to be further investigated after actual interactions with digital games.This approach would likely be closer to the scale's intended use compared to the critical incident technique, increasing the external validity of such results.Initial fndings in this regard were already reported in the original work on the PXI [66], and just recently, Haider et al. [27] reported on a preliminary investigation on the miniPXI's potential to evaluate prototypes during game development.Second, we collected data using an online study setting.While this is comparable to the procedure used in past work on the PXI and allowed us to collect a sufciently large sample needed to conduct certain analyses (e.g., CFA), results from data collected in a lab study might difer from the ones reported here.One general limitation of the statistical analysis of construct validity is the infuence of the chosen wording per item on the consistency of the subjects' responses.Maul [44] found that items of nonsensical expressions, but with consistent wording, would still show acceptable ft in factor analysis.Indeed, the PXI is constructed of items that display consistent wording within their respective sub-factor.For example, all items relating to autonomy begin with the wording "I felt [...]", and two of the three of them end with "[...] I wanted to play this game".These choices regarding wording have an infuence on the statistical validation process.However, we cannot account for the magnitude of this infuence.Furthermore, these consistent wordings can also lead to complex sub-factors, such as immersion, being limited in the breadth with which they construct this experience.Statistical validation can not account for content validation, and therefore, we fundamentally can not determine whether the items presented in the PXI genuinely refect the experiences they wish to measure [2].These challenges to construct validity are as old as the method itself [4].As such, we aim to provide a fair and balanced interpretation of our work and the general fndings on the evidence of validity for the PXI rather than a defnitive endorsement for the measurement.

FUTURE WORK
As mentioned above, the present study worked with self-reported experiences collected using an online survey.While this procedure closely matches past work on the PXI, it comes with certain limitations.Consequently, it would be interesting to see how the PXI performs in a lab study setting more comparable to a GUR evaluation in the industry.Initial evidence for the PXI's performance in such a setting was reported in the original paper, indicating that the PXI had confgural but not metrical invariance between an online study collecting recalled experience data and ratings from experimental investigations or play tests relying on immediate recall after playing [66].While out of scope for the present work, gathering additional data on the scale's performance in such a setting would be intriguing for future independent validations of the PXI to see how the quality of the scale compares across settings.We further see an opportunity for future research to investigate if the PXI can diferentiate between diferent versions of the same game, for example, after improvements and changes have been made, and if changes made to particular aspects of the game are also refected in corresponding ratings of the PXI (e.g., audiovisual design changes resulting in a diferent score for the PXI's audiovisual appeal rating).Such eforts could also be used to examine the PXI's criterion validity, for example, by showing that the experimental manipulation of certain game design elements leads to changes in the respective constructs of the scale.Furthermore, our results were strongly positively skewed.Therefore, we fnd a potential for future work to investigate whether the PXI can diferentiate between diferent player experiences, such as comparing particularly positive experiences to mainly negative experiences with the same or other digital games.In addition, the present work did not investigate the psychometric quality of the 11-item short version of the PXI, the miniPXI [26].While beyond the scope of the current study, re-investigating the quality of this scale version poses an additional opportunity for future work.Finally, while immersion was conspicuous in our sample, previous research on the PXI has not reported on comparable problems for this specifc construct but for others (e.g., low AVE for ease of control in the original PXI paper).To deepen the understanding of the PXI's psychometric quality and the stability of its constructs across various settings and populations, researchers who use the PXI should, if the sample size permits it, investigate the psychometric quality again or otherwise provide their data so that future validation studies could do such analyses.

CONCLUSION
The present paper reported on a large-sample independent validation of the PXI, a scale measuring the psychosocial and functional consequences of playing digital games.In a pre-registered online study, 1518 participants rated a recent or memorable digital game using all items of the PXI and a selection of related scales.Results showed that the PXI performs well, with common indicators of psychometric quality delivering acceptable to excellent results.Furthermore, results showed that the enjoyment items proposed to be used alongside the PXI are also of good quality and can thus be employed alongside the scale.However, immersion was identifed and discussed as a challenging construct as it could not be clearly distinguished from meaning or enjoyment.Finally, results demonstrated that the theoretical model behind the PXI is best understood as consisting of one individual factor per construct of the PXI, without any higher-order factors.Overall, the results demonstrated that researchers can confdently use the PXI in their studies.

FUNDING AND DECLARATION OF CONFLICTING INTERESTS
This work is fnanced entirely by our research group, as we received no additional funding.The authors declare that the research was conducted without any commercial or fnancial relationships that could be construed as a potential confict of interest.

AUTHOR CONTRIBUTION
SACP, FB, and LFA conceived the initial idea and designed the online study.SACP implemented the online study with support from LFA. SACP and NvF pre-registered the study and conducted the data analysis.SACP and LFA wrote the initial draft.SACP, NS, FB, NvF, KO, and LFA contributed to the fnal version.

Table 2 :
Descriptive statistics for all collected measures.Responses could range from -3 to +3 for the PXI and AttrakDif, and from 1 to 7 for the other scales.

Table 5 :
[59] BIC, and results from 2 diference tests for the comparison of the nested models one, two, and three of the PXI.The 2 column contains standard (non-scaled) test statistics.The 2 diference test used a Satorra and Bentler[59]correction.

Table 6 :
Coefcients and for the PXI, including 95% confdence intervals, as well as composite reliability (CR), average variance extracted (AVE), and maximum shared variance (MSV) for the subscales.

Table 7 :
Square root of AVE (in bold) and inter-construct correlations for discriminant validity.

Table 8 :
Correlations between constructs of the PXI and conceptually related constructs from the PENS, AttrakDif, and IMI, including 95% confdence intervals.