The Perception of Agency

The perception of agency in human robot interaction has become increasingly important as robots become more capable and more social. There are, however, no accepted or consistent methods of measuring perceived agency; researchers currently use a wide range of techniques and surveys. We provide a definition of perceived agency, and from that definition we create and psychometrically validate a scale to measure perceived agency. We then perform a scale evaluation by comparing the PA scale constructed in experiment 1 to two other existing scales. We find that our PA and PA-R (Perceived Agency–Rasch) scales provide a better fit to empirical data than existing measures. We also perform scale validation by showing that our scale shows the hypothesized relationship between perceived agency and morality.


INTRODUCTION
Does a person who observes or interacts with a robot think it is making its own decisions?Or has it been programmed for that exact situation?These questions are of perceived agency and have implications for design [46], law [4], interaction studies [77], philosophy [10], morality [5], and social psychology [32].While agency has been well studied, there are also many different overlapping concepts-animacy, mind perception, anthropomorphism, intentionality, and others.These concepts are similar but distinct from perceived agency.Animacy in robotics focuses on making the robot lifelike, frequently focusing on how the robot moves [6].Anthropomorphism concerns attributing human characteristics or behavior to a robot or other non-human entity [42].Mind perception is concerned with how people conceptualize others' minds-the number of dimensions and what those dimensions are [31,55,90].Intentionality is how deliberately and goal directed a robot acts and is frequently associated with perceived agency [75].In this report, we focus on perceived agency.
In their seminal work from 2007, Gray et al. [31] explored how people think about other people's minds.Specifically, they were interested in the number of dimensions that people thought others' minds consisted of.They found, contrary to current beliefs, that people conceptualized others' minds along two dimensions: experience (the extent to which an entity is capable of being hungry, feeling rage, desire, pleasure, pain, etc.) and agency (the extent to which an entity is capable of recognizing emotions, having self-control, planning, communication, morality, and thought).Gray et al. [2007] used Principal Component Analysis (PCA) to analyze their data.PCA and factor analysis are statistical approaches that reduce the dimensionality of large datasets.For example, Gray et al. found that when individuals answered questions like "Which one do you think is more capable of feeling hungry, a robot or a 5-year-old girl?" on a 5-point Likert scale, some capabilities were answered similarly (e.g., feeling hungry and feeling pain had comparable scores across a range of entities).Some capabilities were associated with each other more frequently than another set.Each set of questions that were strongly correlated with each other but were correlated less with other questions could be considered a dimension or factor.These dimensions are considered latent-not directly observed but inferred from the combination of associated questions.
A decade later in 2017, Weisman et al. [90] changed the original methodology and suggested that instead of two dimensions of mind perception, there were three.Weisman et al. argued that the three dimensions of mind perception are body (e.g., getting hungry, experiencing pain, feeling tired), heart (e.g., feeling love, having a personality), and mind (e.g., remembering things, detecting sounds).Both the experience and agency concepts from Gray et al. [31] were scattered across all body, heart, and mind dimensions of Weisman et al. [90].
In 2019, Malle [55] used a methodology similar to that of Weisman et al. [90], but used different initial items.Interestingly, Malle also found three dimensions, but they were slightly different from those of Weisman et al. [90] and in some cases showed five dimensions.Malle found affect (positive and negative emotions and feelings), moral (e.g., telling right from wrong)/social cognition (planning and theory of mind), and reality interaction (verbal communication and moving through the environment).Again, both the experience and agency concepts loaded on different dimensions.All three of these studies used a strong bottom-up approach to search for items that were associated with mind perception.
Because both Weisman et al. [90] in 2017 and Malle [55] in 2019 did not find evidence that agency was one of the core dimensions of mind perception, other researchers have been understandably uncertain about the status of perceived agency and how to measure it.We believe that while agency is not a core dimension of mind perception, it can be measured as a component of how people perceive other entities, much like the Robotic Social Attribute Scale (RoSAS) measures warmth, competence, and discomfort of robots [11] or how the Multi-Dimensional Measure of Trust (MDMT) measures different dimensions of trust [87].
Previous experimental work in perceived agency has focused on determining whether nonorganic entities (i.e., robots, Artificial Intelligence (AI) characters) can be perceived as having agency and what cues lead people to judge whether an entity has agency.Multiple researchers have shown that people do, in fact, ascribe agency to non-organic entities.For example, in 1944, Heider and Simmel [36] constructed an animation of geometric shapes and noticed that people frequently ascribed the shapes with goals, emotions, and perceived agency [75].This work launched an entire subfield investigating how adults and children perceive intentional motion and the relationship to goal-directed cognition and perceived agency [26,75,76].
The Perception of Agency 14:3 Researchers originally hypothesized that when an entity looks like or acts "like a person, " the entity is more likely to be perceived as having agency [42,57].Later researchers, however, have attempted to find better and more specific cues over the general "like a person" hypothesis.For example, in 2010, Short et al. [77] used a very clever experimental paradigm with a robot playing a game of rock paper scissors to examine perceived agency.In one condition, the robot played in a standard way throughout multiple rounds with a participant.In another condition, the robot seemed to make a mistake when calling out who won or lost.In a final condition, the robot actively cheated by changing their throw after both the robot and the participant had completed the round.The cheating robot was perceived as having more agency than the other two robots.
Another group of researchers have shown that robots with social norms may be a cue that leads people to believe that the robot has agency.For example, in 2019, Korman et al. [45] found that robots that follow social norms are perceived as having more agency than robots that disregard social norms or that seem to make a mistake.In 2020, Yasuda et al. [94] refined this hypothesis and found that a robot that cheated was perceived as more agentic than other types of social norm violations (cursing or insulting), suggesting that cheating itself may be one of the features that encourages people to think of robots as having agency.

Measuring Perceived Agency
It is clear from this brief review that a great deal of research has already been done on perceived agency, and a large number of claims have been made about perceived agency.However, this review also masks a serious problem: we do not have a reliable, robust, theoretically meaningful method of measuring perceived agency.This problem can be shown by examining how a number of influential papers from the past few years have measured perceived agency.Some researchers have measured perceived agency through qualitative coding of written comments [77,94].Another group of researchers have used overlapping concepts of animacy or anthropomorphism to make claims about perceived agency [6,91].A different group of researchers have used idiosyncratic measures of perceived agency where they created measures of perceived agency for their specific study or used incomplete scales from other sources (e.g., [33,48,68]); unfortunately, all of the idiosyncratic measures were different from each other, and each paper made strong claims about perceived agency.Finally, some of these measures show inconsistent results across experiments [54,77,94].
This lack of a good measurement tool inhibits our theoretical understanding of what perceived agency is, but also how it impacts other constructs (or vice versa).Because the measurement of perceived agency is so different across studies, the conclusions and opportunities for replication are limited.All of these reasons suggest that a reliable method of measuring PA is needed to increase theory and practice of Human-Robot Interaction (HRI).Our goal in this article will be to construct a method for measuring perceived agency in entities of all types.

PERCEIVED AGENCY: DEFINITION
The first step in most survey development research is to create a strong conceptual definition [8,59]; this definition can then be used to construct or select items that are consistent with the definition.We take the 1978 work of Dennett [19] as inspiration and suggest the following: People perceive agency in another entity when the entity's actions may be assumed by an outside observer to be driven primarily by its internal thoughts and feelings and less by the external environment.
The importance of another's thoughts and feelings in the perception of agency has been highlighted before, both in Section 1 and by others [31,40,75].We felt that it was important to include a locus of control component in our definition as well.One of the traditional strengths of robots and AI agents is that they excel at performing repetitive tasks, but usually only in a specific environment.Locus of control is an individual's perception about the causes of their actions, whether self-generated or caused by external forces [64,72,81].Thus, our definition allows the possibility that the external world could be a possible cause of an entity's actions.

GENERATING MEASUREMENT INSTRUMENTS
The most common method of generating and validating a scale is to use factor analysis [25,69].The general approach to construct a validated instrument from factor analysis is described in detail by others (e.g., [8]), and the factor analytic approach has had success in HRI [11,63,87].The factor analytic approach to survey construction typically consists of generating a large number of possible items that relate to the dimension of interest.Participants then use those items to rate an entity, an interaction, or themselves.Factor analysis provides loadings that describe how related each item is to different dimensions (factors).Items that highly load on a specific factor can be considered consistent with that factor.Different factors are usually considered different aspects of the primary area of interest.
In the factor analytic approach, items are selected to maximize reliability which leads to items that are similar in terms of endorsability [22,78].This is an excellent approach when the researcher is attempting to understand the many dimensions and nuances of the construct (e.g., the mind perception work described earlier).Indeed, in 2002, Smith [80] suggested using factor analysis when the data have multiple uncorrelated factors.
The factor analytic approach has at least two disadvantages when attempting to create a measurement scale.First, because factor analysis identifies how close an item is to the underlying latent variable, it can be more difficult to select items that cover a wide range of the latent variable.This can make it more difficult to differentiate levels or amounts of the specific dimension the scale is measuring.
Second, the ideal of scale development is to measure a single dimension, or latent factor [17,60,71].A single dimension is desirable because it makes interpretation and understandability easier and more straightforward.When a construct does have multiple constructs, difficulties in interpretation can arise because the analyst needs to show how the multiple factors create a general factor [71]: this occurs more commonly for understanding the factor's structure and less when the focus is scale creation.Factor analysis can measure and show unidimensional constructs, but the number of dimensions in factor analysis is still a hotly debated topic [69].
A final feature about factor analysis is that it measures the latent construct of a person and does not account for an external target.Usually this is not a concern-an individual's attitudes or opinions can be well measured.However, when the target is an external event or entity, factor analysis has no way to account for differences in those external targets.
Because our intention is to construct a unidimensional scale of perceived agency about external entities (e.g., robots or AIs), we will be using a Rasch analysis.
The Rasch model is a mathematical formulation that describes the relationship between raters, items, and entities and is part of the item response theory framework.All three components are measured on the same latent scale, which is a logit as the unit of measurement.Rasch analysis models the fact that raters, entities, and items can all vary along the latent variable: in our case, some raters will have a (pre-)disposition to believe an entity has more or less perceived agency; an item may be easier or more difficult to agree with; and an entity may have more or less perceived agency.Because Rasch puts all three components on the same measurement scale, it is straightforward to determine the location of each rater, item, or entity.
Rasch models have measurement invariance [9,21,23]: when a set of observations fits a Rasch model, entity measures are invariant across different sets of items or raters, and items and raters are The Perception of Agency 14:5 invariant across different entities.Measurement invariance suggests that test scores are sufficient statistics for estimating rater measures.Measurement invariance is tested by fit statistics [79]; unidimensionality and reliability can be measured as well.

Rasch Analysis
Our approach will be to have raters (participants) answer items (survey questions) on different entities that will be judged.Each of these "facets" is a separate source of information and bias, and each can be measured along the same dimension.Rasch analysis can construct measurements for each element in each facet.
Items in a Rasch analysis perform best if there is a range where some items are easier to agree with and some are more difficult to agree with.Each item will be expected to measure some aspect of the latent trait (perceived agency) that we are interested in.
Entities in our case will consist of a variety of videos that show a robot, AI character, or person performing some task.Like items, entities will be expected to have a range of perceived agency.
Raters are people who watch a video and answer items about the entity.A rater who may be more likely to agree that many entities have some amount of perceived agency would be considered to have more of the latent value.Similarly, a rater who felt that very few entities could have perceived agency would be considered to have less of the perceived agency latent value.For example, a person who felt that very few entities have much perceived agency would be scored as having relatively little perceived agency as a latent value.These latent values for each rater can be considered a (pre-)disposition for whether an entity may have perceived agency.
We can operationalize these intuitions using a Rasch rating scale, which can be defined as where c is the category of the rating scale or the Likert value (in our case, it will be 1-5), -P eir c is the probability of entity e receiving a rating of category c of item i from rater r , -P eir (c−1) is the probability of entity e receiving a rating of category c − 1 of item i from rater r , -θ e is the amount of perceived agency of entity e, -β i is how difficult the item i is to agree with, -α r is the severity or (pre-)disposition of rater r , and τ c is the difficulty of receiving a rating of category c relative to a rating of category c − 1.
The category value τ c is the location where adjacent categories c and c − 1 are equally probable to be observed, also known as Rasch-Andrich thresholds [51].
The Rasch model is an additive linear model based on a logistic transformation of ratings to a logit scale.Critically, each facet (entities e, raters r , items i) are all on the same logit scale, and all can influence the final rating.Conceptually, this means that the logit scale represents the latent value or dimension-the amount of perceived agency.
The Rasch model makes some basic assumptions about measurement [9,16,74,93].For example, if a rater with a high (pre-)disposition of perceived agency (α) gives an entity with a high perceived agency (θ ) an especially low score on an item (β), that item may have have a measurement problem or mis-fit.Rasch models allow us to find and inspect these mis-fits; entities, items, or raters who have consistently large mis-fits suggest a concern: the video may be misleading; an item may be confusing; or a rater may be answering randomly.Rasch models have several strengths, including generalizability across entities and raters (e.g., different robots or AIs can be measured accurately by different raters), perform measurements in an interval scale (not an ordinal) allowing parametric statistical analysis, can identify items or entities that do not behave as expected, and produce an ordered set of items and entities.

EXPERIMENT 1: SCALE CONSTRUCTION
The goal of experiment 1 was to generate a set of items that could accurately measure perceived agency across a wide range of entities.The focus here will be on robots, but we will also include AI characters and humans.

Method
All studies, including this one, were approved by the NRL IRB.All participants consented to participate.

Participants.
The suggested number of participants for a Rasch analysis to achieve a 95%+ confidence of measures within .5 logits is 150 [49].A total of 195 participants were recruited through Cloud Research and paid $12 for participation in the study; 9 participants were removed because they missed an attention check ("has a face"), leaving 186 participants.The average age of participants was 35 (SD = 12) years.A total of 108 participants were women, 77 participants were men, and 1 participant was unreported.The study took 29 minutes on average.

Materials (Videos).
A total of 14 videos were selected and collected from a wide range of sources, including YouTube, academic conference proceedings, and personal communication with leaders of the field in robotics.The majority of our stimuli were robots (10), but also included AI agents (2) and humans (2).The entities portrayed a range of engagement with people (from none to speaking and interacting), and the non-human entities had different morphologies, differed in their sensing, and had different perceptual, navigation, mobility, cognitive, and social capabilities.The videos were divided into two groups of seven that, according to pilot testing, had a comparable range of perceived agency.
Table 1 provides a label, the group the entity was placed in, a brief description, the morphology of the robot, and a citation of the source of the robot.The citation of each video is either a YouTube location or a paper or website describing the video.
Our goal was to keep the videos between 30 seconds and 3 minutes.In some cases, the video was trimmed or cut.In all cases, we attempted to show the core aspects of the entity and their activity while making sure that participants would not become bored while watching the video.

Materials (Survey Items).
Data collections A and B in the appendix show initial attempts at item development for perceived agency.Based on those data collection efforts as well as difficulties that participants in those studies mentioned, we created a set of items that captured core aspects of thoughts and feelings, and the impact of the external environment on an entity.
In addition to items that covered thoughts, emotions, and environmental impacts on behavior, we also included two integrative items.The actor scenario was "Imagine the robot/character/person was asked to be an actor in a local theater production.How well do you think they would do?"The dinner scenario was "Imagine the robot/character/person was asked to host a dinner party for your friends next weekend.This includes coming up with a menu, cooking, and hosting.How well do you think they will do?"These integrative items were included to examine whether only a combination of thoughts, features, and environmental impacts on behavior would be able to predict perceived agency.
As mentioned previously, Rasch analysis benefits from having items that range in their difficulty to agree.In this set, there were some items that were on average easy to agree with (i.e., "acts with The Perception of Agency 14:7 purpose") to items that were on average more difficult to agree with (i.e., "can show emotions to other people").
A complete list of the survey items used are shown in Table 2.All items were on a 5-point Likert scale.
4.1.4Procedure.Participants were randomly placed in either group 1 or group 2. After answering a series of demographic questions, participants were given a brief description of the task and told they would answer a series of questions about seven different videos.For each of the seven videos, each participant was randomly shown one of the videos from the group they were assigned.At the end of the video, they were taken to a single page with the same video that they could watch again if desired.They were first asked to describe the video in at least one sentence.Next they were asked to answer the survey questions in Table 2 about the entity in the video.A thumbnail of the entity was provided as well to reduce confusion.The words "The robot/character/human" was at the top of the column for the survey questions.
After participants completed all items, they could advance to the next video, and the entire process repeated for each of the seven videos.All Likert questions had to be answered to progress to the next video.After the fourth video, the participant was offered a break before continuing.Additionally, to provide pacing for the participant, the number of videos that had been seen and the number remaining were provided (e.g., four out of seven).
Finally, at the end of the session, participants were invited to provide experimental feedback.

Results
We performed a Rasch analysis using Facets version 3.83.6 [52].Because there were two groups and we wanted to create an integrated scale for both groups and all items, we linked the two groups by assuming the two groups had equal amounts of the latent value.All raters came from the same population and data was collected concurrently for both groups (alternating).Entities are typically ordered from highest to lowest, but other facets are conventionally reversed; here, items and raters have their sign reversed so that items that are more difficult to agree with and raters that are the most lenient have the highest latent value.
The Rasch model will be evaluated by ( 1) examining the extent of item unidimensionality; (2) examining reliability and separation; and (3) examining fit statistics for entities, raters, and items.

Unidimensional.
Using a Rasch analysis for scale construction works best when one latent variable is sufficient to explain most of the variation in the responses.One common way to examine whether the items are measuring a single latent dimension is to perform PCA of the standardized residuals [14]; if the standardized residuals are 2 or more, there is evidence that another dimension exists in the data [66].PCA of the standardized residuals showed that the eigenvalues for the first contrast was 1.0, suggesting that the residual was noise, not another latent factor.This result suggests that the resulting logit scale was unidimensional.

Reliability.
Rasch analysis provides two different measures for reliability.The first, separation, indicates how many different levels can be distinguished.A small separation value suggests that different levels cannot be distinguished, whereas a larger value is more desired for measurement.The second reliability measure, separation reliability, is equivalent to Cronbach alpha reliability and is a measure of internal consistency.Separation reliability ranges from 0 to 1; over .8 is considered acceptable for scale creation.
The separation reliability value for entities was > 0.95, and the separation index was 31.2.These values indicate that more than 30 levels of entities can be distinguished with this scale and that there was very high reliability.
The separation reliability value for raters was .95,and the separation index was 4.3.These values indicate that approximately four levels of raters can be distinguished with this scale and that there was very high reliability.Some raters were much more predisposed to attributing agency to an entity than others.
The separation reliability value for items was > 0.95, and the separation index was 23.9.These values indicate that the ordering of the items is reliable.In aggregate, reliability and separation is quite high for this experiment.

Fit Statistics.
While traditional factor analysis has a set of methods to determine how well items relate to latent constructs (loadings, item scale correlations, etc.), Rasch uses different methods.Rasch identifies departures in the data for persons, items, and even data points from the The Perception of Agency 14:9 ideal of unidimensionality.These are reported with fit statistics that guide the improvement of the instrument and point out possible flaws in the data.There are two common fit statistics used for Rasch analysis: infit and outfit.Outfit is an unweighted fit statistic, a measure of how well the data fit the model and is the most common method for evaluating Rasch fit, and what we will use here.The outfit statistic is sensitive to large departures from model expectations: if an otherwise high perceived agency rater gives a high perceived agency entity a low score, this would show a high outfit and highlight a potential concern.Low outfits signal that there is very little additional information provided.A low outfit is considered < .5, whereas a high outfit is considered > 1.5 [53].Rasch analysis provides outfits for each facet-in our case, items, entities, and raters.
Recall that all latent values are on a logit scale with a mean of 0 and a standard deviation of 1. Logit scores have an infinite range but typically range from ± 5. Table 3 shows the modeled latent value for items, β (see Equation ( 1)).The higher the value of β, the more difficult it is to agree that an entity has perceived agency.Thus, it is relatively easy to agree that most tested entities act with purpose (β = −1.51),but it is much more difficult to agree that the tested entities would do well as an actor (β = .90)or can show emotions to other people (β = .64).Table 3 also shows the outfit for items.All items are within the acceptable outfit ranges from 0.5 to 1.5 except for "acts with purpose, " which has an outfit of exactly 1.5.
Table 4 shows the modeled latent value for entities, θ (see Equation ( 1)).The higher the value of θ , the more perceived agency the entity was measured to have.Not unexpectedly, the humans (teacher and musician with θ s of 2.6 and 1.7, respectively) have the highest rated perceived agency, whereas the most repetitive robots have the least (palletizer with θ of −1.0).Table 4 also shows each entity calculated outfit.All entities are within the acceptable outfit ranges from 0.5 to 1.5 except for the video of the "musician." We can also examine rater latent values α and outfits.Space considerations prevent us from showing a complete table, but their modeled latent value perceived agency, α, ranged from 1.93 to −1.80.Of the 186 raters, 26 (14%) had an outfit that was > 1.5.While this number is a bit higher than recommended, it is not too unexpected with online participants.The reliability of the rater metrics was .95,which is excellent.Overall, most of the participants were well modeled by the Rasch analysis.There are several options to deal with high outfitting entities, items, or raters.Raters that have high outfits can be removed (trimmed), but because Rasch enforces a normal distribution, other raters are likely to become tails.It is also possible to remove some selected scores for raters that had high outfits, but that approach seemed unnecessary given the overall reliability of the data.Since we collected a large number of raters and the impact of individuals is relatively minor (verified by removing the highest 10% outfitting raters), we kept all raters who passed the attention check.
After observing that "musician" had a high outfit, we examined the comments that raters made on that specific video.Musician was a video of a young woman playing a short concert using a trumpet and a flugelhorn (four-way splitscreen).A bit to our surprise, many raters thought that the video showed a robot playing a musical instrument, showing a fundamental confusion of the video.Re-running the Rasch analysis without the musician showed that all entities and all items were within acceptable ranges (.5-1.5); revised results are shown in Tables 3 and 4 under the "Revised" headings.

Discussion
Experiment 1 successfully developed a scale to measure perceived agency.Items were based on our definition, and a wide range of entities were rated by more than 180 people.We found that the scale was unidimensional and had excellent reliability and separation for each of the facets (entities, items, raters).
There was one item and one entity that showed a moderate outfit; examination of rater comments suggested that one of the entity videos was misinterpreted, so it was removed from the analysis.After removing the high outfitting video, a Rasch analysis confirmed that all items and entities were within acceptable ranges.
The Perception of Agency 14:11 4.3.1 Scale Usage.After a scale is created, most researchers will apply it by averaging all items together for a single value which is then analyzed using traditional statistics (t-test, ANOVAs, linear regression, etc.).Of course running parametric statistics on nominal or ordinal data is not recommended because it loses its inherent meaning (e.g., the difference between 1 and 2 is not necessarily the same as the difference between 4 and 5).Averaging is used because it is easy and because there is usually a monotonic relationship between the raw items and the latent dimension the researchers are trying to measure.Rasch analysis converts the individual scale items into an interval measurement scale, which then can be used in parametric statistics.In our case, we are interested in the amount of perceived agency an entity is judged to have, which is θ in Equation ( 1) and shown in Table 4.
As Equation (1) shows, calculating the Rasch measure requires knowing the values for each item's β and each person's α.We can calculate an individual's α by asking them to rate known entities, which are used as calibration videos.These calibration videos will allow us to determine an individual's α using the algorithm described in the work of Linacre [50].Then we can use each rater's α (predisposition) using the calibration videos, each item's β (item difficulty from Table 3), and their item scores for each entity to determine the amount of perceived agency that a novel entity has according to that rater [50].
Experiment 2 will examine how well the scale constructed in experiment 1 can predict the perceived agency of novel entities.

EXPERIMENT 2: SCALE EVALUATION
The goal of experiment 2 was to evaluate the scale constructed in experiment 1 and compare it to other survey methods of measuring perceived agency.As suggested in Section 1, there are two other existing survey approaches that researchers have used to measure perceived agency.The most common is the work by Gray et al. [31]; these items or a subset of these items have been used by others to measure or explore perceived agency [33,56,84,90].A greatly reduced set of items to measure perceived agency was created by Korman et al. [45].The items generated by Korman were not psychometrically validated, but they represent one of the typical [33,48,68] approaches used to measure a latent construct: scour the literature and create a "reasonable subset" of items.
Finally, both the averaged raw items and the calibrated items from experiment 1 will be used.There will be four measures of perceived agency that experiment 2 will evaluate on novel entities: (1) the agency dimension from Gray et al. [31], (2) the agency items from Korman et al. [45], (3) the average of all 13 items from experiment 1, and (4) the logit scale from experiment 1.

Method
All studies, including this one, were approved by the NRL IRB.All participants consented to participate.

Participants.
A Monte Carlo simulation based on experiment 1 effect sizes showed that 70 participants were needed to have an 80% chance of showing a significant ordinal relationship between different entities.A total of 75 participants were recruited through Cloud Research and paid $12 for participation in the study; 10 participants were removed because they missed an attention check ("has a face"), leaving 65 participants.The average age of participants was 39 (SD = 9) years.A total of 32 participants were women, 32 participants were men, and 1 participant was unreported.The study took 47 minutes on average.No participants took part in experiment 1.

Materials (Videos).
Seven new videos were selected and collected using methods similar to those in experiment 1.None of the videos in experiment 2 were used in experiment 1.  Korman Were they aware of engaging in their behavior?Korman Did they want to perform their behavior?Korman Table 5 provides a label, a brief description, the morphology of the entity, and a citation of the source.The citation of each video is either a YouTube location or a paper or website describing the video.

Materials (Survey Items
).There were three sets of items.One set was developed in experiment 1 and shown in Table 3; these are the PA items.Another set of items came directly from the agency dimension of Gray et al. [31]; these are the GGW items and are shown in Table 6.Finally, the items from Korman et al. [45] and Frazier et al. [24] (under review) are the Korman items and are shown in Table 6.The PA and GGW items used a Likert scale range from 1 to 5, whereas the Korman items used a Likert scale range from 1 to 7.

5.1.4
Procedure.The procedure for experiment 2 was identical to the procedure from experiment 1 except for three differences.First, because there were three different scales, we kept each set of survey items together, but the order of each block was randomly determined with a Latin square design.
The second difference from experiment 1 was that after all videos had been watched and all items were answered for each video, a ranking screen was displayed.Participants were provided the preceding definition of perceived agency and asked to rank all videos from least to most by dragging a thumbnail of each video to the desired rank.They were able to watch any video again if they desired.When this task was completed, they pushed a submit button.
The third difference from experiment 1 was that participants performed a calibration task for three of the entity videos from experiment 1 by answering the PA items from Table 2.The The Perception of Agency 14:13 Korman [45] and GGW [31] were developed in their respective sources.PA is the average of the perceived agency scale developed in experiment 1, whereas PA-R uses the weights from the Rasch scale developed in experiment 1.
calibration videos selected were "service" (θ = .85),"cheating" (θ = .26),and "feeder" (θ = −.73):these were selected because they covered a range of perceived agency while not being at the extremes, although in theory any number of the original videos could be used.The data from the calibration videos was used only for the PA-R (Perceived Agency-Rasch) scale and did not impact any of the other scales since it was at the end of the experiment.

Calculating Scale Values.
For the GGW, Korman, and PA scales, the respective items were averaged to give a single score for each rater for each entity.Because of the way Rasch calculates the logit score, only total measures are calculated; reliability cannot be calculated for the Rasch measure.However, it is possible to calculate reliability for the raw PA scales; reliability was calculated using α and ω total from the psych package [70].
For the PA-R measure, the calibration videos were used to calculate each rater's α, or (pre-)disposition to perceived agency.Each rater's α was then used with their item ratings to calculate a logit value on an interval scale for each entity [50].

Comparing Scales to Each
Other.It is traditional when generating and comparing scales to show the correlations between the different scales.We expect the scales to have moderate to high correlations to each other, since they all attempt to measure the same underlying construct.Indeed, as Table 7 suggests, all correlations are moderate to high.

Comparing Each Scale to Empirical
Ranking.Our overall goal was to determine which scale best predicts the rank ordering of the entities that were ranked from least to most perceived agency by participants.An ordinal regression is the most appropriate analysis for ordered data.An ordinal regression uses an ordinal outcome variable (e.g., rank orderings), whereas the predictors can be of any type (categorical, ordinal, interval, etc.).Four different ordinal regression models were created, one for each scale.
Figure 1 shows a graphical representation of the ordinal model fits and the empirical data.Model fits were calculated using each rater's scores for each scale to predict a model rank for each entity using the respective ordinal regression model.
There are several aspects of Figure 1 that should be highlighted.First, notice that all four models fit the empirical data quite well.Even the Korman model [45], which was not constructed or validated in a psychometrically strong manner, fits the overall pattern well.The GGW model [31], which was based on a PCA of an agency dimension and is currently the most common method of measuring perceived agency, does a very good job of capturing the trends in ordering, although  Korman [45] and GGW [31] were developed in their respective sources.PA is the average of the perceived agency scale developed in experiment 1, whereas PA-R uses the weights from the Rasch scale developed in experiment 1.
it seems to have the most difficulty in the middle range (i.e., RSR1 and Bargaining are nearly identical in model scores, but quite different empirically).The PA model that consists of the average of items from experiment 1 does an excellent job of predicting the empirical ordering.The PA-R model that was calculated using the the items and three calibration videos from experiment 1 to convert into a logit score using Equation ( 1) also shows an excellent fit to the empirical data.The difference between the PA and PA-R model is relatively small.We can empirically compare each of the models shown in Figure 1 to determine which is the best predictor of rater rank orderings.
Most importantly, all four models are significantly better than chance (p < 0.05).We can evaluate how well each model fits the data by using the Akaike Information Criterion (AIC); a lower relative score is better.The AIC statistics derived from the ordinal regression model fits to the data are shown in Table 8.These AIC scores show that the GGW model is significantly preferred over the Korman model, the PA model is significantly preferred over the GGW model, and the PA-R model is significantly preferred over the PA model [89].
The Perception of Agency 14:15

Discussion: Experiment 2
Experiment 2 collected data from a new set of raters on a novel group of seven entities that consisted of a large range of perceived agency.These raters answered items from three different surveys on perceived agency, and the results of each of those scales were compared to raters' ranking of the entities.
The ordering of the different entities by the PA and PA-R scale presents a nuanced result of perceived agency.In 2022, Li et al. [48] found that humans were rated as having more perceived agency than robots, but our results suggest that not all robots have the same amount of perceived agency.It is not impossible that some robots could have more perceived agency than some humans, and our PA-R scale has the potential to show these more nuanced differences.
Experiment 2 found that all evaluated surveys were acceptable measures of perceived agency and all better than chance.However, the two best surveys were those developed in experiment 1: PA and PA-R.Both PA and PA-R were substantially and significantly better than the other two methods of measuring perceived agency.

Rasch Analysis
Rasch analysis allowed us to construct a measure of perceived agency where all three important facets (entities, items, and raters) were on the same logit scale.Critically, this allows us to examine the hierarchical order of the items: the item βs are estimates of how difficult it is for a rater to agree to each item.This hierarchy allows us to make some important inferences about how people conceptualize perceived agency.First, the items that are most likely for raters to rank highly focus on goals-"acts with purpose" and "has goals": this is not surprising since an entity without goals can hardly have any perceived agency.In contrast, the items that are the most difficult for raters to rank highly are integrative items (the two scenarios) and emotional items ("can show emotions to other people" and "can change their behavior based on how people treat them").The integrative items highlight that raters will think an entity has a high degree of perceived agency when that entity apparently behaves according to their thoughts and feelings, and not purely responding to the environment.In addition, when an entity responds based on their apparent internal feelings, it is more likely to be rated highly on perceived agency.This analysis suggests that robots and other entities that seem to behave according to their internal feelings will be likely to be perceived as having agency.
For the remainder of this article, we will use PA-R.

EXPERIMENT 3: SCALE VALIDATION
Previous researchers have suggested that there is a link between morality and agency [7,31,32,83].One of the implications of this hypothesis is that entities that have more agency should be protected from harm [7,31].We adapted a study from Strait and Scheutz [83] to explore the relationship between perceived agency and harm as well as to validate our measure of perceived agency.
Our hypothesis was that participants should want entities with higher perceived agency to be kept from harm.

Method
All studies, including this one, were approved by the NRL IRB.All participants consented to participate.
6.1.1Participants.We used the pwr package [13] in R to conduct a power analysis for a correlational study.Our goal was to obtain .8power to detect a medium-sized effect (r = .3)at 0.05 α error probability, so 84 participants were required for this study.
A total of 92 participants were recruited through Cloud Research and paid $3.75 for participation in the study; 1 participant was removed because they missed an attention check, leaving 91 participants.
The average age of participants was 41 (SD = 12) years.A total of 43 participants were women, 47 participants were men, and 1 participant was unreported.The study took 16 minutes on average.No participants took part in experiment 1 or 2.
6.1.2Materials.Eight different images, two instances of four classes, were selected.The four classes were human (images of a man and a woman), dog (images of two dogs), robot (images of an NAO robot with a high human likeness and a homemate with a low human likeness [65]), and artwork (pictures of a painting and blown glass).
The scenario was "You are walking down the street and you see an office building across the street from you catch on fire.You call the fire department but rush in to see what you can do to help.You enter and see a single room with [randomly presented] a dog, a robot, a person, and artwork.None of them can move on their own; you will need to carry them outside.Unfortunately, you can only move one at a time.Please drag each item in the order you would move them to safety.You realize that the fire is getting worse." 6.1.3Procedure.Participants viewed one image of each category (person, dog, robot, artwork) in a random order and answered the perceived agency scale for each entity.
Next, participants were given the preceding scenario and dragged each image in the order they would save it.After they completed dragging each image, they received a message saying, "Congratulations!You managed to save everyone!"Finally, as in experiment 2, participants saw three calibration videos and answered the perceived agency scale for each calibration video.

Calculating Scale Values.
Logit scale values for each image were calculated the same way as in experiment 2. The reliability of the perceived agency scale was quite high as well; ω total = .98;α = .97.

Comparing Each
Scale to Empirical Ranking.Our overall goal was to determine whether there was a relationship between perceived agency and willingness to save an entity or object.Specifically, the higher an entity's perceived agency is, the more likely the entity is to be saved.Statistically, this will be expressed as a negative correlation: higher perceived agency should be negatively correlated with a lower number (i.e., 1 is the first entity/object to save).A simple uncorrected correlation between the saved order of entities and the logit of perceived agency shows there was a strong negative correlation, r (362) = −.67,p < 0.001.While this analysis is not technically correct (i.e., it does not take the ordinal variable into account, nor does it take possible inter-dependencies between participants or entities into account), it does give an understandable metric of the relationship.
We can perform a more sophisticated statistical analysis using an ordinal mixed model.Ordinal regression allows us to use an ordinal dependent variable (ordinal data is assumed to violate normality assumptions because the distance between numbers is not metric).A mixed model allows us to take into account correlations between participant or entity ratings.
We can also examine the impact of whether an entity is alive accounts for the preceding negative correlation; it could be that participants will simply want to save entities that are alive (people, dogs) over inorganic objects (robots, artwork).
We used an ordinal mixed model to analyze the effects of the perceived agency scale and whether the entity was alive on the order of the entities ranked to save considering random variation across participants and images.The analysis was performed using the R package ordinal [15].We included the entity-saved order (ordinal 1 -4) as the dependent variable and two independent variables: the logit scores calculated from the perceived agency scale and whether the object was alive or inorganic (binary).We included as random effects factors of participants and the stimulus on the intercept.
Unsurprisingly, we found that participants wanted to save entities that were alive sooner than entities that were not, β = −8.00,z = −6.7,p < 0.001.In addition, and consistent with our hypothesis, the greater an entity's perceived agency, the higher the likelihood to save the entity, β = −.26,z = −2.8,p = 0.005.

Discussion: Experiment 3
Experiment 3 examined the hypothesis that when an entity has more perceived agency, people will want it to come to harm less, a component of moral reasoning.We found that an entity's perceived agency did impact the order that it would be saved from destruction.This result goes beyond a simple "save entities that are alive first" heuristic; perceived agency had an effect even after statistically removing the "alive" component.
Experiment 3 also provided construct validity for the PA-R scale: it examined a link between perceived agency and morality that had been suggested in the literature and successfully supported that relationship.
Participants in experiment 3 also used images (not videos) to rate the perceived agency scale.The fact that reliability was excellent suggests that the scale can be used on a variety of different entities and stimulus types.

GENERAL DISCUSSION
The goal of this research report was to generate a scale to reliably measure perceived agency and use that scale in a predictive, productive manner.To accomplish this goal, we began with a definition of perceived agency.From our definition, we constructed a set of items that were based on each aspect of the definition of perceived agency; this is in contrast to some of the more bottom-up approaches (e.g., [31]).Experiment 1 used a Rasch analysis and showed that the scale items were well fitting and that the overall scale had high reliability across all three facets (entities, items, and raters).Experiment 2 used the scale developed in experiment 1 along with two other scales that have been used to measure perceived agency; experiment 2 showed that the scale developed in experiment 1 better captured empirical data than two other current measures of perceived agency.
The PA scale has been developed and tested on a wide variety of entities: videos of humans (3), videos of robots of dramatically different morphologies (15), and videos of AI characters (3).There were also static images of people (2), animals (2), robots (2), and artwork (2).Note that while the majority of entities were humanoid, we also included non-humanoid animals (dogs) and robotic arms and industrial robots and robots with wildly different humanoid features (wheels, no ears, large eyes, etc.).The successful usage of our scale across this range of entities is encouraging for other entity types.
We should note that these experiments have at least two possible concerns.First, to capture a wide range of morphologies, we used videos instead of in-person interactions or observations.Second, the videos were relatively short-less than 3 minutes-and longer interactions may impact the results.We believe, however, that the strength of our approach will overcome these possible weaknesses.
One benefit of measuring how people perceive agency is that we can examine previous work in HRI with our new understanding.We want to emphasize that the success of the created measures supports our definition of perceived agency.We can also provide insight to one of the most influential pieces of work on perceived agency in HRI-the research of Short et al. [77], who showed that a robot that cheated had more perceived agency than a robot that did not cheat.First, all conditions had "easy" cues of perceived agency: they had goals and could communicate with others, and because they could communicate and move around, they could perform many different types of tasks and could do well in other environments.However, the cheating robot, in order to cheat, needed to treat others as if they had a mind (thoughts), needed to create novel goals (thoughts), needed to want to perform the cheating action (feelings), and could adapt to different situations (losing; environment).These differences are subtle, but note that they also came from each of the three definitional components.
It is our hope that this scale of perceived agency will enable other researchers to accurately measure perceived agency, improve our understanding of how people conceptualize robots' minds, and build robots that have different levels of perceived agency.

APPENDIX A EARLIER DATA COLLECTION
Before experiment 1 occurred, we ran two previous data collections.The method was similar to experiment 1, although there were some small methodological differences (i.e., we had each participant rate 14 videos instead of splitting them up; participants were encouraged to provide feedback on the clarity of the items); here we show the items that were generated and why they were changed or removed and how we came to the items in experiment 1.These two data collections allowed us to go into experiment 1 with high expectations that they would be good items for measuring perceived agency.

A.1 Data Collection A
Data collection A collected responses from 109 online participants and had 23 items (Table 9).To generate a broad item pool, we started with previous definitions and previous studies of perceived agency.The items were based on physical similarity and associated affordances [3,12,27,36,42,57,61,76], emotional expression and recognition [28,29,31,40], self-directed goals [20,35,36,39,75], interaction with others [28,36,40,47,73,77], and having general mental capabilities [31,40,90].There were two primary concerns with the items after running a Rasch analysis.First, many of the items and raters had especially high outfits, suggesting that the raters had difficulty rating the videos in a consistent manner.Second, Rasch analysis shows the probability of item categories being in a different category; for this analysis, three of the four category thresholds were within .05logits, suggesting that either the number of categories was too large or raters could not use the items to consistently differentiate the entities in the videos.We interpreted these results as showing that this set of items was too broad to be measured unidimensionally with consistency.We therefore narrowed and crystallized our definition (see Section 2) while removing items that participants found difficult.

A.2 Data Collection B
Data collection B collected responses from 130 online participants and had 14 items (Table 10).The 14 items came from our definition (see Section 2).Data collection B mixed Likert responses with semantic scales, which some participants found confusing.Semantic scales were removed for future experiments and/or converted to Likert responses.Experiment 1 was the result of the updated items.

Fig. 1 .
Fig. 1. Results of all models and empirical ranking.(R) is a robot entity, (C) is a character entity, and (H) is a human entity.Black circles are empirical data with a 95% CI; model fits are ordinal fits based on each individual model.

Table 1 .
Description of Videos Used

Table 2 .
Survey Items Used

Table 3 .
Item βs and Outfits All β SEs are ≤ 0.04.Original are results with Musician; Revised are results after Musician was removed from the analysis.

Table 4 .
Entity θ s and Outfits Original are results with Musician; Revised are results after Musician was removed from the analysis.

Table 5 .
Description of Videos Used in Experiment 2

Table 6 .
Items and Sources Used in Experiment 2

Table 8 .
Ordinal Regression Model Summaries