Self-Efficacy and Security Behavior: Results from a Systematic Review of Research Methods

Amidst growing IT security challenges, psychological underpinnings of security behaviors have received considerable interest, e.g. cybersecurity Self-Efficacy (SE), the belief in one’s own ability to enact cybersecurity-related skills. Due to diverging definitions and proposed mechanisms, research methods in this field vary considerably, potentially impeding replicable evidence and meaningful research synthesis. We report a preregistered systematic literature review investigating (a) cybersecurity SE measures, (b) SE’s proposed roles, and (c) intervention approaches. We minimized selection bias by detailed exclusion criteria, interdisciplinary search strategy, and double coding. Among 174 cybersecurity SE studies (2010-2021) from 18 databases with 55,758 subjects, we identified 173 different SE measures with considerable differences in psychometric quality and validity evidence. We found 276 variables as assumed causes/outcomes of cybersecurity SE and identified 13 intervention designs. This review demonstrates the extent of methodological and conceptual fragmentation in cybersecurity SE research. We offer recommendations to inspire our research community toward standardization.


INTRODUCTION 1.Cybersecurity Self-Efcacy
Data privacy and usable IT security are increasingly becoming concerns of public interest as a result of (a) the ever-growing presence of IT products in average consumer households, and (b) the amounts and immense value of personal data they process.However, protection of personal data is as much a technical challenge as it is a psychological one [20,120], which is why usable security research has increasingly focused on improving security-related behaviors of individual users [3,72,152,160].A key starting point from motivational psychology for infuencing IT security behaviors is a person's cybersecurity Self-Efcacy (SE).SE is defned as the belief about one's own ability to enact certain skills [23].Its high relevance for behavior through motivational, cognitive, emotional, and choice-related processes [23,26] has put it on the map of Human-Computer Interaction (HCI) researchers [e.g., 35,129,141].As indicated by an evidence review featured in the [59] report, selfefcacy stands as the sole factor on the human side that consistently predicts cybersecurity intention and behavior.
The interdisciplinary nature of cybersecurity SE research has led from a "general scarcity of theoretical models to guide IT researchers" [77, p. 526] to broad construct defnitions across scientifc disciplines and communities, and in consequence, a blurred terminology of self-efcacy has emerged: privacy self-efcacy [168], coping self-efcacy [149], computer self-efcacy [111], internet selfefcacy [55], self-efcacy in information security [129], security self-efcacy [113], cybersecurity SE [18], self-efcacy to comply with information security policy [34], and more.Hence, there is a risk that the cybersecurity SE literature sufers from the jinglejangle fallacy [65]: The jingle fallacy refers to the belief that two instruments measure the same constructs because they share a similar name, whereas a jangle fallacy would be the similarly incorrect belief that diferently named instruments indeed measure distinct constructs [65].Both fallacies afect the validity of interpretations of an empirical literature [103,161] as they obfuscate the true coherence (or lack thereof) of (seemingly) related concepts.We argue that before attempting to collate empirical evidence for substantive research questions in this feld (e.g., "Does cybersecurity SE predict security behaviors?"), it is essential to understand and evaluate methodological characteristics of that evidence that could increase risk of bias [cf. 123].This work systematically assesses the methodological heterogeneity in primary studies in the feld of cybersecurity SE.Specifcally, we examine reported measures of cybersecurity SE, the underlying theoretical assumptions of the role of cybersecurity SE, as well as designed interventions to support cybersecurity SE.This research could be used in future work, for example, to inform subsequent systematic reviews or meta-analyses on empirical evidence.Despite this study being about methodology rather than research outcomes, we use process elements that are common in substantive reviews, such as structured systematic search and study screening [cf. 153].

RELATED WORK 2.1 Background and Contribution
Eforts have been made to gain clarity about the concept of cybersecurity SE and to consolidate the rather fragmented literature.He et al. [76], for example, conducted a literature review and found 13 diferent cybersecurity SE measures with inconsistent terminology, item wording, and construct facet coverage (i.e., instruments were sometimes omitting aspects of cybersecurity SE relevant to measure the construct holistically).Besides the recommendation to provide a clear defnition of cybersecurity SE, they advise considering all dimensions of cybersecurity SE when constructing a scale to ultimately achieve consistent operationalization across studies.
There is reason to believe that research in this feld has not heeded He et al.'s [76] call toward a more consistent methodology.Publications continue to draw on incoherent cybersecurity SE defnitions and measures, both within and across research disciplines.Especially, since widely discussed security incidents seem to have motivated an infated use of self-efcacy (and derivatives of it) without prior standardization or coherent theory development.We strive to address the current expansion of the feld's knowledge base by assessing the heterogeneity in cybersecurity SE research methods in HCI almost a decade after He et al.'s [76] review was published.
More recent HCI reviews on cybersecurity research do usually not thoroughly consider methodological standardization of SE measurement.Some reviews focus on specifc populations, applications, contexts, or venues: Akinrotimi [8] reviewed educational tools for students with = 2 studies involving SE; Sari et al. [140] investigated healthcare staf or patients and found SE to be the most frequently studied human factor ( = 12); Quayyum et al. [124] conducted a review on children and identifed = 2 studies resorting to SE approaches; AL-Nuaimi [9] reviewed security behavior in organizations and found = 3 studies investigating the infuence of self-efcacy; Chowdhury et al. [42] considered the infuence of time pressure and counted = 4 papers regarding SE; and Rohan et al. [133] conducted a review limiting their search strategy to one human factor conference outlet and located = 2 studies on SE.Our review covers a variety of interdisciplinary databases while transparently recording framework and sample details to assess research methodology across the entire feld of cybersecurity SE studies in HCI.
There are also substantial review contributions that survey the relationship between particular constructs (e.g., Alshammari et al. [12] report the prevalence of studies on SE, = 5, within protection behavior research on emotions; Reddy and Dietrich [128] focus on the role of SE for security compliance, = 14, and include a methodological comparison between self-reports and non-selfreports attempting to clarify inconsistent fndings).Other reviews in the feld of HCI investigate practical implications of SE theories by assessing sources of cybersecurity SE.For example, Zhang-Kennedy and Chiasson [170] looked at intervention tools and found = 1 study that examined narrative learning materials and their beneft to SE, Jones et al. [90] reviewed design recommendations for warning messages to increase SE with = 1 study, Coenraad et al. [45] surveyed games to promote SE and focused on the content of the learning materials, and Jeong et al. [88] highlight with = 1 study the importance of tailored interventions especially with regard to diferent SE levels.Further, there are also reviews on SE theories themselves.Sulaiman et al. [154] present the most referred-to theories in cybersecurity compliance research in organisations, some of which involve SE as a determinant, and Maalem Lahcen et al. [109] consider relevant theories specifcally for cybersecurity behavior where SE is mentioned regarding Social Cognitive Theory.
Broad reviews on cybersecurity behavior demonstrate the current research interest in SE's infuence: Almansoori et al. [11] fnd SE to be the most frequent external factor, i.e., viewing SE not as part of an original theory, for security behavior with = 16 studies; Alsharida et al. [13] identify SE as the most common determinant of security behavior with = 37 studies, but do not consider methodological aspects of this body of work.Prior to our review, this was accomplished either within meta-reviews (e.g., Khan et al. [93] take note of SE solely in organizational settings ( = 4) but assess review methodology) or other research felds [e.g., 66,96,164].

Review Goals
In this preregistered literature review, we aim to systematically assess the extent of heterogeneity in cybersecurity SE research methods, with a particular focus on: Goal 1 -reported self-efcacy measures and their psychometric quality criteria, Goal 2the role of self-efcacy within its theoretical assumptions, and Goal 3implemented interventions designed to support cybersecurity SE.
Each goal aligns with a specifc research question, detailed in the following sections.In achieving our goals, we hope to raise awareness regarding heterogeneous research practices and aspire to encourage a shift towards greater consistency in measures, theories, and interventions within the realm of cybersecurity SE.

Measuring Cybersecurity Self-Efcacy
A prerequisite for meaningful research synthesis is standardization of empirical procedures and measures which ensures comparability of studies and adequate inferences from systematic reviews of the evidence [57].
Bandura [25,27] proposed guidelines for self-efcacy measures.First and foremost, self-efcacy is a domain-specifc characteristic, and consequently, items in self-report instruments need to refect behaviors and experiences specifc to an activity domain.For content validity, self-efcacy items need to be formulated with "can do" statements and be distinguishable from other closely related constructs, such as competence [130], hope and optimism [126], locus of control [150], or outcome expectation [77].The domain specifcation of self-efcacy beliefs demands prior assessment of controllable and multicausal behavioral factors required to succeed in the activity of interest.The granularity of self-efcacy, as well as challenges and impediments, such as self-regulatory task demands, diferentiate between negligible and highly efcacious beliefs.The item analysis (pretesting, factor analysis, and reliability computation) as well as validation process (face, discriminant, and predictive validity) are also outlined to establish an easy access to standard quality criteria.
Ambiguities of measures jeopardize study comparability and valid inferences from a research literature [65], whereas insights about methodologies notably facilitate the synthesis of related studies and reconciliation of conficting outcomes [57].In achieving goal 1 (assessing measurement heterogeneity), we hope to encourage a shift towards greater consistency in cybersecurity SE measures [cf. 162].Thus, this review examines information on scale development, scale structure, item wording, reliability as well as validity (content, construct, criterion, and incremental): RQ1: What measures are used to assess cybersecurity self-efcacy?What are their scale characteristics and reported psychometric properties?

The Role of Cybersecurity Self-Efcacy
Inconsistencies of measures can be rooted in the diferences between theoretical approaches and understandings of self-efcacy.Research involving cybersecurity SE may attribute diferent efect pathways to SE based on diferent theoretical assumptions.In Social Cognitive Theory [22], SE functions as a key construct that predicts human motivation, emotion, and actions more accurately than their actual abilities, knowledge, or skills [23].Ajzen's [7] Theory of Planned Behavior understands SE as part of perceived behavioral control, which afects behavioral performance jointly with an individual's intentions to perform that action.Self-Determination Theory [50] on the other hand, postulates self-efcacy to be an even more distal factor that infuences actions indirectly through its efects on self-determined motivation, and assigns autonomy the more important role in determining behavior [155].
Because SE is a relevant construct in IT security and privacy research, it has been studied from various angles.Refecting diferent conceptualizations of self-efcacy itself, other theories frequently used in IT security research also difer in their understanding of SE as a determinant of human behavior.The Health Belief Model [134,135] sees SE as a direct infuence on the probability of preventive behavior, whereas a current revision of the Protection Motivation Theory [118] highlights the role of biases, norms, and a sense of responsibility in decision-making processes that, together with SE, have a mediated efect on protective behavior by intentions, i.e., one's protection motivation [132].
This entails a wide range of factors as causes or outcomes of SE that are of high interest to efectively motivate and predict behavioral change in users' cybersecurity.In this review, we aim to assess the heterogeneity of cybersecurity SE's role within its theoretical assumptions in empirical research (see goal 2).Understanding on this level is most critical to identify prevailing theoretical assumptions and assess the current stage of theory assertiveness of the nomological network cybersecurity SE is integrated within.Therefore, our second research question is as follows: RQ2: What role does cybersecurity self-efcacy play in the theoretical or research models of empirical research?

Cybersecurity Self-Efcacy Interventions
An inconsistent theoretical understanding of self-efcacy in research models would lead to a multitude of potential directions for the design of interventions to support cybersecurity SE.Bandura [21,22] outlines SE as a belief which can be changed, and strengthened, by (a) mastery experience, (b) vicarious experience, (c) persuasion, and (d) emotional arousal.Mastery experience revolves around one's own performance accomplishments induced either by self-modeling, performance desensitization, performance exposure, or self-instructed performance.However, observing other people perform a behavior of interest, e.g.via live or symbolic modeling, can also build another individual's SE vicariously.A rather weaker source of information is persuasion, which can be achieved through social or verbal suggestion, exhortation, interpretive treatments, or self-instruction.And fnally, emotional arousal is in part judged as a physiological feedback for one's level of stress and anxiety.If highly aroused, people do not expect successful coping and accordingly, adjust their self-efcacy belief.This source of expectation can be infuenced by attribution, relaxation and biofeedback, symbolic desensitization, or symbolic exposure, but it is not always highly reliable.These sources point to potential key aspects of SE interventions, but importantly, it is behavioral information that foster the necessary belief in one's own abilities [21].
The general difculty to infuence security behaviors might be rooted in the nature of IT security tasks.Those tasks usually compete with other more salient and relevant goals for which the utilized systems were implemented for, e.g., its convenience functions such as email communication or sharing fles.Meaning that for most users, handling security settings is seen as secondary and foremost challenging [29].Given these challenges of cybersecurity interventions, we need particularly reliable methods that have consistently proven to support usable IT security.
In this review, our third goal is to assess the heterogeneity of intervention designs that support cybersecurity SE.Even though measures and theories might not be uniform and consistent, this review's intention is not to dismiss promising scientifc contributions of interventions; on the contrary, in achieving goal 3, we aspire to encourage a shift towards greater consistency in cybersecurity SE interventions.A more standardized methodological approach to cybersecurity SE interventions is crucial with regard to the construct's validation as well as instrumental for practitioners.This motivated our third research question of this review: RQ3: Do the studies report interventions that have been carried out to manipulate cybersecurity self-efcacy?If so, what was their approach?
In summary, this review examines measures of cybersecurity SE and reported psychometric quality criteria (see goal 1), cybersecurity SE's role within its theoretical assumptions (see goal 2), and interventions that are designed to support cybersecurity SE (see goal 3).We conducted a systematic literature search that aims to assess the extent of heterogeneity of these methodological aspects.

Preregistration
This review paper is part of an overarching project preregistered on the OSF (OSF registration link: anonymous preregistration) prior to data collection.In addition, we follow the international PROS-PERO scheme (OSF fle link: PROSPERO scheme) for documentation standards of systematic review protocols for research with human subjects, and report our fndings in compliance with the PRISMA guidelines (OSF fle link: PRISMA guidelines) for transparent reporting of systematic reviews and meta-analyses.

Selection Criteria
We included any empirical research published in English language (regardless of where the studies were conducted).
Since the review focuses on cybersecurity SE, we only included studies on the relationship between self-efcacy and IT security or privacy.Studies on self-efcacy in other contexts, or IT security/privacy without a link to self-efcacy were excluded.To this end, studies for inclusion had to specify the (hypothesized) importance of self-efcacy to IT security or privacy.We incorporated both qualitative and quantitative research as long as they included some measure of self-efcacy.Studies with experimental manipulations designed to afect cybersecurity SE were also included.
We included studies published between January 1, 2010 and March 18, 2021.We chose this timeframe to capture studies with modern IT devices and therefore modern cybersecurity.

Literature Search
To account for the interdisciplinariness of the literature, our search strategy covered a total of 18 electronic databases: ACM Digital Library, arXiv, dimensions.ai,EBSCOhost (incl.Academic Search Premier, APA PsycArticles, APA PsycInfo, Historical Abstracts, OpenDissertations, PSYNDEX Literature with PSYNDEX Tests), IEEE Xplore, Science Direct, Scopus, Web of Science (incl.WOS, KJD, MEDLINE, RSCI, SCIELO), and Wiley Online Library.These databases were selected to (a) cover relevant material of the diverse disciplines, (b) include grey literature, i.e. records released outside publishing houses including non-peer-reviewed sources, as well as late breaking work, and (c) exclusively rely on reproducible engines.
The syntax of our search string used an AND-connector to combine self-efcacy with IT security and privacy terms, whereas ORconnectors separated those somewhat synonymous terms.Our aim was to compile an exhaustive list of search terms covering the omnifaceted spectrum of relevant content, including technical, social, and psychological aspects of cybersecurity SE.Hence, keywords for IT security and privacy were generated in a two-step process.First, feld experts were asked to list search terms as relevant as possible ftting in this concept group without generating too ambiguous keywords.Second, to combat a potentially biased search string, we relied on a quasi-automated method that uses text mining and keyword co-occurrence networks to suggest further IT security and privacy terms.This method is implemented in the R package litsearchr [70] (R version: 4.0.3;package version: 1.0.0).Our twostep approach ensured a thorough and reproducible search strategy.The general keyword string was: "self-efcacy" AND ("cybersecurity" OR "cyber security" OR "information security" OR "IT security" OR "information technology security" OR "IS security" OR "information system security" OR "wireless security" OR "home wireless security" OR "usable security" OR "computer security" OR "data protection" OR "data security" OR "personal data" OR "privacy" OR "security threat" OR "wireless network" OR "device security").
For each specifc search string that was applied in the respective database, see preregistration fles in our OSF project (OSF fle link: search terms).Hits discovered just by matches of our search string in the full text, but not in the combined abstract, title, and keywords, were excluded, as cybersecurity SE was unlikely to be an important variable in the paper.EBSCOhost was the only database that could not be specifcally restricted to an abstract, title, and keyword search, but instead matched the search string with a full text search.Hence, we used a custom Python script (OSF fle link: search improvement script) which allowed identifying hits that were failing the search string in their combined abstracts, title, and keywords.

Study Selection
Data collection occurred March 18, 2021.In total, all database searches identifed 1769 records of interest.The chronological sequence of import into Citavi (Build Number: 6.4.0.35) is presented in Table 1.Of 376 EBSCOhost hits, the Python script identifed 201 false positives.Removing duplicates yielded a total of 696 records to be screened in the title sift.
Figure 1 illustrates a fow diagram summarizing the study selection process.We conducted three separate sifts to iteratively implement the pre-determined selection criteria: (1) a title, (2) an abstract, and (3) a full text sift.The reasons for exclusion were recorded for each record.During title sift, 134 records were excluded, leaving 562 records to be screened in the abstract sift.Here, another 276 documents were excluded, leaving 286 records for the full text sift.The third sift excluded additional 117 documents, and one record was excluded during the coding phase.
Consequently, 168 remaining records were eligible and included in the synthesis.Regarding publication information, the mode of publication year was year = 2020.The fnal sample split into 107 journal articles, 39 conference papers, 17 dissertations, 3 conference proceedings, 1 book chapter, and 1 report.Of these publications, 131 (77.98%) were peer-reviewed, 19 (11.31%) were not peer-reviewed, and another 18 (10.71%)publications did not provide information about the review process.See Appendix A for split analyses of differences between peer-reviewed and non-peer-reviewed records.In context of the sample description, these publications stem from various cultural backgrounds and represent 30 countries (at least 44.81% from USA, followed by 7.65% from Malaysia).Their combined sample size totals 55,758 study participants (53,586  Wiley Online Library EBSCOhost 25 448 [1]   15 168 10 210 [2]   Note. [1]Including 72 exact duplicates removed by EBSCOhost; [2] Prior to custom Python script implementation. when accounting for assumed sample duplicates) with a mean sample size of size = 313.25 ( = 250.61,size = 249), ranging between 4 and 1663 study participants.The median age weighted by sample size was age = 32.98.For = 47, 109, we calculated a gender distribution weighted by sample size of 50.05% male, 49.49% female, 0.02% non-binary, and 0.21% no response.30.05% of the studies reportedly used student samples.Samples were primarily recruited within organizations (51.37%), 22.95% via online panels, 13.66% recruited their samples ad-hoc, 6.01% had mixed recruitment strategies, and 6.01% did not describe recruitment.Concerning study information, the publications include 174 studies (142 surveys, 16 experiments, 6 quasi-experiment, and 10 studies of other types), 55.17% of which were conducted online, 25.29% were physical studies, 7.47% had mixed settings, under one percent were conducted via phone, and 11.49% did not report the study setting.
During the full paper sift and coding, 11 cases were excluded after group discussion: (a) multi-item self-efcacy scales containing only a single item or few items related to cybersecurity [71,82,83,163] were not considered further because they are more indicative of confounding cybersecurity elements within a diferent construct in focus rather than fulflling the requirements to be classifed as a measure of cyberseucrity SE, and (b) we excluded studies studying cybersecurity SE of practitioners [16,28,79,107,144,146] or hackers [108], rather than end users aiming to accomplish IT security or privacy.The group discussion was initiated by coders identifying some of these borderline cases, where the primary diagnostic aim and target population deviated from our specifc latent variable and review scope.After identifcation, each record was marked as such and discussed with all three coders to determine whether the inclusion criteria were met.In consequence of the frst instances, the inclusion criteria were more exclusively formulated concerning the diagnostic aim (operationalization as the latent variable cybersecurity SE) and target population (users).Coders were required to reach unanimous agreement on each subsequent borderline exclusion.

Coding Process
Three coders performed the study selection and data extraction.The reviewers were trained with = 10 studies from 2009 (which were excluded a priori) each round until inter-rater agreement [87] reached a satisfactory level ( > 0.6).This was accomplished after the frst round.Two iota coefcient indices were calculated using the R package irr [68] (R version: 4.0.3;package version: 0.84.1), one for nominal and one for continuous variables, whereby one key variable for each main query was accessed to determine the level of agreement for the training data sets: sample size, reliability alpha, as outcome variable, and intervention.The training yielded excellent agreement coefcients; for nominal data training = 1 and for interval data training = 1.
Prior to the sifts, study IDs were randomized using randomizer.organd and split into three blocks of about 232 publications each.Each reviewer was randomly assigned two of the three blocks, such that each study was coded twice.Moreover, we re-randomized all remaining studies after the full text sift, i.e., before data was extracted, and followed the same procedure as the frst randomization (each reviewer coded a random two-thirds of included records).This ensured equal contributions and thus, acted as a countermeasure to potential biases caused by varying quantities of coding results by one of the reviewers.Reviewers were blinded to each other's decisions and in case of disagreements, they were discussed and decided by all three coders using a majority vote.

Data Evaluation
All coded study characteristics were defned in codebooks.Coders considered all sections of publications to evaluate data.For our coding scheme and a detailed description of each variable see the preregistration (OSF fle link: codebook).The codebook included a-priori defned variables and value attributions to qualitative data or the specifc format of extraction.Only a few variables were subject to open coding or paraphrasing (e.g., interventions or scale changes).Any additional steps for codes can be accessed in the coded data fle (OSF fle link: data; please see worksheet "additional steps" for further coding schemes).The subsections of the results section were a-priori determined based on the preregistered codes.Our reporting approach of the results is predominantly inductive (e.g., number of measures, referenced publications, practices regarding statistical tests, number of SE related variables, and intervention characteristics) with some deductive elements that were more theoretically informed (e.g., mapping of cause variables according to Bandura [21,22]).We used a mixed-methods form of synthesis.For a qualitative approach to qualitative data, we relied on strategies from thematic synthesis (e.g., intervention types and methods) and analysis (to compute the centrality of original scale authors).For a framework synthesis (e.g., categorization of cause and outcome vari-quantitative approach to quantitative data, we utilized numerical ables according to a-priori theories).For a quantitative approach to presentations (e.g., descriptive statistics of sample size or reliability qualitative data, we incorporated content analyses (e.g., to quan-coefcients).Variables were grouped into six categories: (a) publitify the occurrence of cause and outcome variables) and a network cation information, (b) study information, (c) sample description, (d) scale characteristics, (e) scale psychometrics, and (f) SE research model and intervention.
The frst category concerning publication information surveyed the type, authors, year, title of the publication, and whether it was peer-reviewed.Additional title felds as well as the number of citations were also recorded.For study information, exclusion from synthesis, sample of multi-study papers, the type and setting of the study, and which technology the scale refers to were coded.Furthermore, we assessed the sample size, age and gender distribution, specifc professions, country of origin, and recruitment strategy.
Scale characteristics involved variables for the origin of the scale, including its development, original authors, and changes made to the scale.We noted the scale's name and language, its number of items, factors, facets, and whether a defnition for the construct was provided.If reported, we also extracted the wording of items.For all validation studies, we coded whether the defnition fts the items and the type of item generation strategy.The category of scale psychometrics gathered information on diferent reliability coefcients and validity types.Both are central quality criteria for the eligibility of test instruments.While reliability is determined by the level of measurement errors, validity is assumed when a scale captures what it is intended to measure with sufcient accuracy [56].Reliability variables were split to scope reported internal consistency (Cronbach's alpha and composite reliability), as well as test-retest and split-half reliability.For validity evidence, we looked at a) representative inference, which refers to the content validity of an instrument and the consultation of experts, b) theory-based interpretation, which refers to construct validity (including factor, discriminant, and convergent validity), and c) criterion-related inference, which involves both criterion validity with its three types (retrospective, competitive, and predictive validity) and incremental validity.Variables that refected the research models were coded for measured cause and outcome constructs.We also recorded whether there was a use of interventions that were designed to explicitly infuence self-efcacy or not and if applicable described the intervention and noted its replicability.In any cases of unreported data, authors were not contacted.Iota coefcient indices were calculated (as with training data) to assess the overall inter-rater agreement of our coding for multivariate observations using the R package irr [68] (R version: 4.1.3;package version: 0.84.1).We reached satisfactory values for both agreement coefcients; for nominal data review = .722and for interval data review = .982

Risk of Bias
We minimized selection bias by defning clear selection criteria before data collection occurred, we covered a multitude of databases from diferent research disciplines, made sure to have a holistic and replicable search strategy, randomized studies for the selection process, had two blinded coders for each record, and catalogued the reasons for attrition.Still, inaccessible papers were dropped during the selection sifts.Language bias is plausible as we only included studies published in English.To limit publication bias efects, we incorporated grey literature search results.We used no study quality bias criteria because this review addresses methodological quality (e.g., reliability and validity) rather than substantive questions about the outcomes of the included studies.

RESULTS
All coded data and R scripts can be accessed and downloaded from OSF (OSF links: data and code) or GitHub repository (GitHub link removed for anonymization).First, the subsequent sections highlight results regarding scale characteristics, followed by psychometric data, and at last, SE models and interventions.

Current Measures of Cybersecurity
Self-Efcacy On average, scales consisted of =150 = 5.41 items ( = 6.03, = 4, range = 1 to 54).We also assessed the latent variable structure via reported factors, i.e. structure based on mathematical information similarity between items, and facets, i.e. structure based on qualitative content similarity between items.Of the scales that featured a factor structure ( = 13), ten reported they consisted of one factor, two of 2, and one of 4.Where facets were reported ( = 8), they generally concerned specifc security behaviors, types of threats, emotions, knowledge, or self-efcacy sources.Hence, authors often did not disclose the structure they assumed for SE, possibly treating SE as a single construct without explicit acknowledgment.We found a total of 8 measures that reported multiple sub-constructs, whether mathematical or qualitative.133 measures ofered specifc defnitions or explanations of the construct, which were not necessarily foundational to the scale development process.A comprehensive summary of the criteria reported with respect to the methodological rigor of the scales can be found in Table B1 and B2 in Appendix B.
The measures were often used without consistent specifcation of the technological context.We found, e.g., 136 publications involving non-specifed technological contexts, 54 publications comprising computers, and 48 publications that involved general IT at workplaces.In contrast, only one publication studied smart home as a technological context (see supplementary materials for smart home results, fle link: smart home results).The measures were published in 11 diferent languages, with English being the most common language at 57.23% and Chinese the second most common at 5.20%.

Citation network of cybersecurity SE measures.
To identify potentially underlying similarities between scales, we investigated stated references (original authors) for item compositions with the help of a network analysis.Figure 3 shows the directed network graph of who cites who.The size of our author network was = if they lacked reported original authors for their item compositions.that were citing or were being cited.
Among the 161 measures, n = 118 (73.29%) were described as a mod-The most central nodes according to node strength (the sum of ifed version of a previous scale, n = 32 (19.88%) were developed inward edge weights of a node) were both, Bulgurcu et al. [34] and as part of the empirical work, n = 2 (1.24%) were translations, and Ng et al. [117], which were equally central with an InDegree cen-only n = 2 (1.24%) were equivalent to validating test developments.trality of IDC = 10.This centrality can be interpreted as the most Another n = 7 (4.35%) did not report information on their developreferenced (10 citations each within this review) author groups as ment at all.Ad-hoc modifcations, where reported, include changes sources for item composition.An additional centrality measure for to the wording (58 papers), number of items (11), translation (11), networks is betweenness (the frequency of a node in shortest paths or general modifcations (5).The two validation studies developed between other nodes), indicating nodes that bridge information their items via deductive approaches and provided conceptual defbetween felds and thus determining interdisciplinary used publi-nitions of cybersecurity SE that were, in our understanding, not in cations.With a betweenness centrality of BC = 18, Crossler [48] full accordance with the respective operationalization.was the most central node.
To explore network clustering according to the small world principle (high clustering and low average path length), which is repeatedly found in natural graphs, we calculated the small world 4.1.4Reliability of cybersecurity SE measures.Proceeding to the index.Our author network did not exhibit a small world structure, scales' psychometrics, Table 2 provides an overview of reported the index being SW = −2254.43.These results of the three cen-reliability information.Reliability estimates of frst use scales intrality measures reveal very little underlying similarities, as we dicated good coefcients when reported; weighted by sample size, mean coefcient alpha was =102 = .868( = 0.062) and comfound neither an established common literature source for scale developments nor a network of references in which most scales are posite reliability was CR =76 = .903( = 0.055) (unweighted linked by only a few contiguous publications.=102 = .862( = 0.073) and CR =76 = .897( = 0.058)).Reliability analyses for split-half or test-retest reliability were not 4.1.3First use cybersecurity SE measures.First use measures, i.e., reported.measures newly created or adapted for studies without being used Scales with repeated use (recurring from prior publications external = before and without previous validity evidence, account for the over-12 and recurring within review internal = 5), which we refer to as whelming majority of cybersecurity SE assessments, making up recurring scales, were similar to frst use scales; mean coefcient 161 out of 173 measures.Readers seeking to further diferentiate alpha weighted by sample size was =10 = .871( = 0.054) between newly created and adapted measures are encouraged to (unweighted =10 = .877,= 0.077) and the mean composite review our split analysis of key variables in Appendix C for more reliability weighted by sample size was CR =5 = .895( = 0.055) details.In the split analysis, we classifed measures as newly created (unweighted CR =5 = .902,= 0.072).

Legend
Authors IDC   Note.[1] Measures that were used more than once ( = 5) also count towards the frst use measures column.The total number of measures is = 173; [2] One study may report multiple reliability estimates.To provide evidence for discriminant and convergent construct validity, studies drew on an immense variety of constructs.In total, 330 diferent constructs were used with the intention to validate cybersecurity SE scales.13 constructs were used exclusively to discriminate from and another 19 exclusively to converge to cybersecurity SE.However, 298 were conceptualized as both discriminant and concurrently convergent across studies.The most frequent validation constructs in total counts were: perceived severity (88 models: 45 discriminant, 43 convergent), response efcacy (70 models: 36 discriminant, 34 convergent), perceived vulnerability (66 models: 33 discriminant, 33 convergent), response cost (32 models: 16 discriminant, 16 convergent), subjective norms (28 models: 14 discriminant, 14 convergent), and perceived susceptibility (22 models: 12 discriminant, 10 convergent).

Cybersecurity SE as Cause and Outcome
Evaluating research models, frames, and hypotheses, we found that 157 studies (90.23%) treated cybersecurity SE as a cause of another variable (such as security behavior), whereas 67 studies (38.51%) treated it as an outcome of other processes (e.g., awareness).Given that some research models conceptualized cybersecurity SE as a moderator, with an unclear causal positioning of self-efcacy, outcome variables of moderations were coded as outcomes of cybersecurity SE, even though the path diagram might have been more complex.We identifed 173 unique outcome constructs (infuenced by SE) and 103 cause constructs (infuencing SE).Of these variables, 12 constructs were reported as both cause and outcome of SE across studies.We consolidated strongly related or nearly identical constructs with diferent spellings, e.g., (a) information computer security behavior and desktop security behavior were both synthesized as security behavior, (b) intention to comply with privacy policy and security compliance intention were both synthesized as compliance intention, or (c) awareness of information security policies and information security awareness were both synthesized as awareness.Two coders were tasked with identifying similarities in these variables, and when uncertain, they independently evaluated the underlying theoretical conceptualizations in the original publications.This process enhanced consistency across models and yielded 55 distinct outcome constructs, 51 cause constructs, and 19 outcome-and-cause constructs.Appendix D includes a list of these constructs and the frequency with which they were examined in studies.The most frequent outcome constructs were security behavior (25 studies), compliance intention (19 studies), and security intention (17 studies).For causes of cybersecurity SE, the most frequent variables were awareness (10 studies), expertise (7 studies), gender (7 studies).The most frequent outcome-and-cause constructs were awareness (13 studies), concerns (13 studies), and expertise (10 studies).
We further recoded these constructs to refect originating theories or meta-levels of interest (see Figure 4).Since the reported theoretical perspective may be inconsistent within a publication (varying originating theories for defnition, frameworks, empirical claims, and measures), we based our coding on measured variables.Coders inspected all sections from a publication to extract this information (including introduction, related work, or hypothesis sections).It is important to point out that our conclusions were not grounded in the reported results or the evidence level of variables; instead, our focus was on capturing the a-priori adopted theoretical assumptions.
Theories most prominently difer in the assumed proximity of cybersecurity SE to impact behavior, i.e., its direct or indirect efect through intention or motivation, and hence can serve to group outcome variables respectively.Here, behavior comprises both, observed and self-reported behaviors.Frequencies for outcome variables shown in Figure 4 reveal that no single theory seems to dominate the current literature on cybersecurity SE.However, motivation was not a process that was frequently hypothesized as an outcome, cf.Self-Determination Theory.Other outcome processes of cybersecurity SE included non-behavioral cognitive variables, such as concerns, awareness, or coping appraisal.The broad range of non-behavioral cognitive outcomes underline the difuse role of cybersecurity SE in its nomological network as it is similarly posited by the Social Cognitive Theory.
Identifed causal factors of cybersecurity SE were categorized to ft the four theoretically established sources of self-efcacy [cf.21,22]: mastery experience, (verbal) persuasion, emotional arousal, vicarious experience (see Figure 4).Much research on cybersecurity SE did not conform with this foundational taxonomy, and due to unclear theoretical rationales within included publications, the categorization was often unclear (50 out of 67 studies).Still, vicarious experience seems to be rather understudied in comparison to mastery experience and persuasion.The potential impact it's believed to have on cybersecurity SE could be utilized through approaches such as group interventions.Investigating this aspect with a focus on its scalability and efectiveness would be intriguing.Research interest in emotional arousal as a source of SE seems also limited, which could be attributed to its posited unreliable nature [21,22].Among the studied cause variables were other additional sources of self-efcacy that were of cognitive (33 incidences) and socio-demographic (14 incidences) nature (e.g., knowledge, awareness, or age).These additional cause variables can be taken as an opportunity to systematically study an expansion of theoretical assumptions of the Social Cognitive Theory.This also applies to reciprocal variables identifed in this review.Bidirectional causeand-efect pathways emerge as a relatively frequent phenomenon, Note.[1] Measures that were used more than once ( = 5) also count towards the frst use measures column.The total number of measures is = 173; [2] One study may report multiple validity types.

Current Interventions to Manipulate Cybersecurity SE
Only 13 out of 174 studies (7.47%) included a manipulation of cybersecurity SE (see Table 4).Generally, implemented interventions were designed to increase rather than decrease cybersecurity SE.These interventions included instructional components, learning materials, cybersecurity activities, and salience or awareness strategies.Interventions with activities consider mastery experience as a major source of self-efcacy in this review.However, this is our deduction as even in publications that included interventions, adherence to the foundational taxonomy provided by Bandura [21,22] regarding the four SE sources was infrequently observed.Explanations of the underlying mechanisms by which specifc intervention designs are expected to infuence cybersecurity SE were also rarely provided.The interventions were evaluated by experimental or quasi-experimental designs with sample sizes ranging between = 19 − 442 participants.Close to half of the interventions ( = 6) targeted students and interventions were primarily conducted with US American samples ( = 11).We found no replications of any intervention study.Regarding replicability, 2 out of 13 interventions [33,169] provided the complete stimulus materials.Given that our focus is on reviewing research practices and methods, we did not further examine the fndings or outcomes of the interventions and due to the scarcity of replication studies, we argue that the efectiveness of the identifed interventions remains speculative.[2] training instructional control elements experiment 197 Amo [15] training cyber security related activity quasi-experiment 34 Arachchilage [19] game cyber security related activity quasi-experiment 20 Booth [33] exposure to messages cyber privacy risk awareness experiment 201 Chen et al. [37] game cyber security related activity experiment 178 Clark [43] awareness campaign compliance communication quasi-experiment 246 He et al. [75] training text and video experiment 119 Mamonov and Koufaris [110] exposure to messages government surveillance news experiment 442 McGill et al. [112] course cyber security related activity quasi-experiment 19 and career awareness Mwagwabi et al. [116] training fear appeals experiment 210 Smith et al. [151] training in-house and third-party video quasi-experiment 204 Zarouali et al. [169] exposure to messages privacy control salience experiment 178

Implications for Cybersecurity SE Research
By assessing methodological practices of cybersecurity SE research conducted during the last decade, this systematic literature review provides meta-scientifc evidence on heterogeneity, based on 168 publications concerning: (1) reported self-efcacy measures and their psychometric quality criteria, (2) the role of self-efcacy within its theoretical assumptions, and (3) implemented interventions designed to support cybersecurity self-efcacy.Regarding RQ1, we found 173 diferent cybersecurity SE measures, mostly used in just a single study.This implies that the issues of measurement inconsistency identifed by He et al. [76] remain relevant, and have even intensifed, given the increasing number of measures being published (see Figure 2).He et al. [76] also found scales that blend technology-focused items with more general ones, a practice that might lead to user confusion.This trend continues to be common [e.g., 10,15,40,62,94].However, the research community has recently addressed dimensions of mobile and social media security [e.g., 5, 31, 54, 158], which were previously underrepresented, as highlighted by He et al. [76].The scales show good reliability coefcients on average [for discussion on coefcient alpha and composite reliability see, 121], but critically neglected validity evidence.Unfortunately, most studies do not meet the guidelines available on how we ought to consolidate cybersecurity SE scales [76] and demonstrate validity [25,27].Although reliability is unquestionably important, it is validity evidence that grants meaningfulness to research fndings.Systematically lacking validity evidence is a substantial threat to the usefulness of a research literature [57,64].Validated scales will improve resource allocation (time and efort invested in developing ad-hoc measures), research consistency (the ability to compare and combine data from studies on usable security), and quality control as the feld converges on a measurement standard [105].Otherwise, there is a risk of unreliable conclusions and an incoherent evidence base.
As there is no consensus on the operationalization of cybersecurity SE, the same is to be said for its theoretical understanding.He et al. [76] found that the defnitions of SE are inconsistent among authors.Remarkably, these authors cited in He et al.'s [76] review are still frequently referenced for scale development [see Figure 3, references 85,117,129].This suggests that diferent theoretical assumptions continue to be a fundamental aspect in the feld.Nonetheless, we found that (a) a vast majority of publications provided defnitions or construct clarifcations, and (b) in several instances specifc scale names were diferentiated according to the context of SE [e.g., 31,41,49,171], as suggested by He et al. [76].As for RQ2, we observed a critical quantity of distinguishable frameworks, amounting to at least 55 outcome, 51 cause, and 19 outcome-and-cause variables of cybersecurity SE. References to self-efcacy theories are particularly evident for outcomes, with no theory clearly dominating the literature.He et al.'s [76] fndings already hinted at the prominent role of SE in infuencing a variety of dependent variables.Regarding the sources of self-efcacy, the literature has a limited ft with established frameworks, but we found two additional important research foci of causes of cybersecurity SE: cognitive and socio-demographic variables.This general scarcity of reporting specifc and consistent theoretical underpinnings is also salient in other self-report measures published in HCI outlets [6].As the feld advances, achieving a unifed understanding of the literature should be an important objective [60].
This fragmented picture persists for RQ3, which addresses cybersecurity SE interventions.Not one of the 13 studies with interventions was replicated.Conclusions about the general efectiveness of the interventions' methods are therefore speculative at best.Interventions rarely derived their methods explicitly from theoretically established SE sources; still, several interventions relied on cybersecurity activities implying the relevance of mastery experience.These fndings extent He et al.'s [76] observations about confusion surrounding the impact of SE.We advocate not to neglect practical implications that can be carefully deducted from SE theory and dismiss the opportunity to provide detailed reasoning for specifc decisions about intervention methods.Alternatively, researchers are limited in their exploration of methodological evidence and it remains uncertain, also to any practitioner interested in implementing SE interventions, how robust efect mechanisms are that infuence security behaviors [cf. 122].An unclear onus of proof may reduce trust in the intervention's capacity to resolve the underlying issue.It may be that authors shy away from replications because (a) original studies lack detailed information, (b) translation processes might be necessary as the feld is rather diverse (30 countries in this review alone), and (c) replications are much harder to publish than original works [cf.84,131].

Recommendations
Researchers draw on a wide range of measures and interventions, though this decision is not consistently based on best performing quality criteria, which further emphasises the need for cybersecurity SE validation and replication research.Valid conclusions about genuine efects and efective interventions to increase cybersecurity SE are only possible to the degree of the primary studies' quality level [106,131].Which leads us to the following three recommendation sections based on our research questions: (1) measures, (2) theoretical assumptions, and (3) interventions.

Measures.
Transparency of measures should be habituated by always providing a scale manual in the supplementary material section of a publication.Manuals need to include at least instructions, items, their origin, response scale, and scoring strategies.We recommend including thorough assessments of psychometric quality criteria as well [cf. 51].Researchers seeking to create transparent manuals may fnd Aeschbach et al.'s [6] prescriptive model for the measurement selection process benefcial.This increases the scales' reusability, warrants criteria based decision-making when or how to include an instrument, and allows access to necessary information for reproducibility and replicability.For notable examples of transparent reporting and consistent measure usage, we recommend the User Experience Questionnaire (UEQ) [100,143] or Raven's Progressive Matrices (RPM) [127,166] to readers.Although this might appear trivial, it was our experience that the current state of reporting did not allow us to, e.g., diferentiate indisputably between newly created and adapted measures.Publications may cite another, original paper and thus be categorized as using an adapted measure, but (a) make substantial changes to domain, item wording, or number of items [e.g., 30,97,148], (b) cite multiple original authors [e.g., 40,85,136], which could as well indicate a common and even recommended strategy in literature to construct new scales, or (c) report no items [e.g., 4,14,32], making the level of adaption or novelty ambiguous to the reader.Other publications may not report original authors [e.g., 69,73,92].To optimize transparency, we suggest even including an item-level change log in the manual.
We also urge researchers to not modify or develop scales within the same empirical work from which it draws substantive inferences.When researchers fnd it necessary to change existing measures or create new ones, they should frst conduct a study to evaluate the psychometric quality of these measures [58].Subsequently, a separate study should be undertaken using a new sample to investigate the relevant research question with the then-validated measures.In other words, mingling the interpretation of a substantive efect on a variable (e.g., whether an intervention increases SE) and the suitability of its operationalization (e.g., whether a scale actually measures SE) within the same study ought to be avoided.Best practices and guidance for scale construction processes are widely available [e.g., 25,67,115,142].A concerted dedication of the research community's efort and time in constructing a reliable and valid cybersecurity SE scale, will facilitate its impact and applications.In an attempt to reduce measure heterogeneity, an exemplary scale that one could build upon if construct specifcity and contemporary security issues were to be addressed is the Self-Efcacy in Information Security scale by Rhee et al. [129].We further recommend that those preceding validation studies adopt a more advanced psychometric perspective, specifcally item response theory (IRT), which encompasses more appropriate measurement models for more detailed evaluations of item qualities [cf.67].For instance, IRT allows researchers to examine option characteristic curves for each item, as well as item or test information functions (see Choi and Asilkalkan [39] for a helpful guide).5.2.2 Theoretical assumptions.Measurement standards will then set the foundation for theory and model comparisons (and elimination, cf.138,139).Based on our understanding of scientifc progress, we strongly recommend striving for parsimony and falsifcation of SE theories across scientifc disciplines.An original theory of self-efcacy that otherwise meets the criteria of theory evaluation, such as consistency and testability [63], should be frst trialed for its adequacy.Our experience showed that the varied interpretations of SE's role stem not only from difering theoretical assumptions but also, more signifcantly, from deviations from the respective original theory (see the extent of the categories labelled "other" in Figure 4).However, the proposition of new and more complex models is only reasonable when it signifcantly enriches the theory's explainability and should always be comprehensively justifed.The results presented in Figure 4 might though imply such a reasonable revision, suggesting a diverse understanding of both outcome and cause variables, potentially considering them as reciprocal.
In particular, we caution against mixed theory referencing across the introduction of the SE construct and its measurement.Citing, e.g., Bandura's works [21,22] or related sources for the defnition of SE, and then applying SE scales based on users' perceived cybersecurity knowledge [see 73,165], can cause ambiguity.This is due to Bandura's [24] objection to confating knowledge with self-efcacy.An unclear or divided theoretical understanding can jeopardize cross-disciplinary collaborations due to the lack of a common language (while expertise from multiple disciplines is required for many HCI research questions) and the feld's scientifc progress by delaying the discovery of relevant patterns (given that solid foundations are critical for valid research designs) [125].Hence, we recommend the following: (a) consistently adhere to an adequately tested theory and revise with prudence across publications; (b) maintain the assumptions of that framework within each publication; and (c) apply this consistency to both cause and outcome variables of SE as well.The consequences of not following the latter point are evident in the heterogeneity of the terminologies found in our results, although the full extent is uncertain due to potential redundancy.For researchers interested in illustrating commonalities among psychological constructs, we would like to refer to Hodson [80].We specifcally encourage authors to contribute to such more robust categories, i.e. to minimize the "other" categories as presented in Figure 4.

Interventions.
In contrast to the substantial amount of measures and related constructs, we discovered only a limited number of interventions designed to support cybersecurity SE.We recommend that more theory-driven paradigms for interventions should be developed.Based on our observations, presumed pathways for infuencing cybersecurity SE were rarely reported explicitly.Researchers or practitioners interested in the development of conceptually grounded interventions might fnd publications by Bandura [21,23,24] in combination with the introductory book by Cooper et al. [47] a useful foundation.As a proposition for future work, we suggest to evaluate newly designed interventions regarding: (a) the level and sustainability of efectiveness, (b) its generality across specifc samples and situations, and (c) economic application factors, e.g. through the dose-response relationship [cf.74].It will be decisive to see whether interventions afect individuals uniformly or whether they interact with specifc personological or situational factors.
In order to obtain robust empirical evidence from these interventions, we recommend replicating the interventions.None of the currently published interventions (see Table 4) has been replicated.The call for replication research is imperative as other large-scale replication projects have demonstrated the uncertainty of original empirical evidence in the social sciences [119].To which extent the same is true for cybersecurity SE research is difcult to estimate given the current research practices in this domain.Overall, the HCI community has made progress in transparent reporting for better replicability.However, the sharing of data and artifacts, essential for replicating interventions, is still relatively limited [137].In our review, only 2 out of 13 interventions ofered complete stimulus materials.Even if authors may feel confdent about replicability based on the information they provide [159], we suggest that both authors and reviewers adopt reporting screening systems, such as proposed by Salehzadeh Niksirat et al. [137], to increase access to intervention materials.In other words, we recommend transparency for designed manipulations by sharing all instructions and materials involved [see 81].This can be achieved via permanent links to public repositories, e.g., OSF.io or PsychArchives.org.
We fnd that our recommendations might extend beyond the realm of cybersecurity SE research and are also relevant to a broader issue encountered in various disciplines involving psychological measures [cf.58].However, the context of IT security and privacy is of imminent relevance due to the increasing state of data proliferation, which includes sensitive information that can have a signifcant personal and economic impact when exploited.As cybersecurity is also a multidisciplinary feld, it encounters a distinct set of challenges not necessarily found in all scientifc disciplines.Some disciplines, such as cognitive performance or personality research, have more established and homogeneous methodological approaches, where our recommendations might not fnd as well-suited an environment [61,89,98].
This systematic review was crucial for substantiating our recommendations empirically.While one could also consider replications of specifc studies for empirical evidence also concerning robustness, our primary objective was to evaluate the overall extent of heterogeneity in the cybersecurity SE literature across publications.Prior to this review, it was unclear whether researchers' current understanding of the relevance of methodological consistency would render our recommendations unft.Yet, our review provides empirical data for the indispensability of our recommendations: whether it is to (a) encourage a shift towards greater transparency and replicability in methodological practices and reporting standards in research or (b) raise the public's awareness of the validity (or lack thereof) of existing recommendations for motivating security behavior.

Limitations
There are two important types of limitations inherent in this literature review: (a) limitations of evidence and inferences, and (b) limitations of review methods.The former is mainly shaped by the simple diference between reporting standards (or more likely reporting constraints, such as limited word counts for publications) and performed back-end research processes.This limits the possible inferences made with regard to the current heterogeneity of the literature, two of which we would highlight exemplarily: First, more detailed information on scales (e.g., a standard reporting of items) might have lead to a diferent estimate of the number of (unique) cybersecurity SE measures in use.Second, structured reporting of the scale development process might have revealed more commonalities between scales.Beyond missing information, the measurement heterogeneity would be smaller if on an empirical level, cybersecurity SE measures were to measure the same individual manifestation (mean cybersecurity SE scores) across diferently constructed scales.The consequences of using diferent measures could thus be mitigated if scores would be identical, proving that no jingle fallacy occurred.
Similar arguments can be made with regard to the large number of cybersecurity SE cause/outcome constructs.Similar operationalizations of some of these would imply redundant constructs (see also "jangle fallacy", cf.80), and hence, the picture of cybersecurity SE's role within frameworks would be more consistent than it might appear.The prominence of theories might also be more evident if publications consistently referred to one perspective throughout, but this was not observed.Therefore, we concentrated on hypothesized causality, often depicted as measurement models, which did not always align with any specifc theory.One also might consider the possibility that though constructs were formulated and measured as behavioral intentions, authors were in fact hypothesizing direct efects on behavior but did not have the resources or opportunity to implement a behavioral measure.All four aspects could be causes for false dividedness across publications.
Other limitations concern the quality of reported scale validation techniques.There were profound diferences in quality of the performed studies which were not highlighted in our review.In particular, we found that the methods used for construct validity did not consistently refect an understanding of the purpose of demonstrating the convergent and discriminant validity.As for interventions to foster cybersecurity SE, the theoretical mechanisms were in most cases merely implicitly retraceable (e.g., the connection between implemented cybersecurity activities and mastery experience, see, 21), and often not explicitly justifed.Other interventions in this review incidentally showed an intervention efect on cybersecurity SE; however, those were not explicitly designed to afect cybersecurity SE, and were not included as cybersecurity SE interventions due to their lack of a theoretical rationale.
Limitations of review methods involve the date of data collection, search strategies, and the coding process.Data collection occurred in March 2021, excluding more recent publications in the feld.Periodically updating this review will eventually enable a trend analysis of the methods used.This is a call to future work as we fnd it valuable to consolidate practices and their pattern of progression.Updating is also of great importance when reviews synthesize meta-analytic evidence on substantive research questions about the outcome of studies (e.g., does cybersecurity SE predict security behaviors), where omitting new studies is a relevant issue.The objective of this review was to assess the heterogeneity of research practices (see goal 1-3), and these fndings remain valid as (a) they refect the respective understanding of the subject matter and (b) continue to be relevant for even the latest publication in the feld, which still reveal non-adoption of methodological standardization [such as, 36,38,44,52,53,86,91,95,101,102,104,114,145,147,157].
Additionally, biases could result from search terms we may have missed (e.g., names of brand specifc IT devices) or unpublished studies remaining undiscovered in the fle drawer.If those works were to more homogeneously rely on similar measures and theory principles, they would shift our review fndings towards a more unifed cybersecurity SE literature respectively.And at last, though the inter-rater agreement coefcient for nominal data is satisfactory, there were some diferences in coding, ultimately resolved by group discussion, when there was too much room for interpretation in the research.

Conclusion
This systematic literature review paints a fragmented picture of current cybersecurity SE research methods.Over the past decade, studies on SE and IT security have revealed limited use of standardized measurement, model, and intervention methods, which can constrain our ability to draw meaningful conclusions on the subject.We identifed 168 relevant publications for synthesis including 173 cybersecurity SE measures.Most indicated good reliability coeffcients, however missed essential validity analyses.There were 173 outcome as well as 103 cause variables, some having ambiguous causal links, and some being conceptualized as both outcome and cause of cybersecurity SE.Of 13 intervention studies to improve cybersecurity SE, none was replicated.The lack of consensus might be rooted in the current state of self-efcacy theories that prevail side by side, resulting in deviating methods.The feld's multi-disciplinary nature may be another important context factor, as each feld may focus on a diferent aim than the replicability of fndings beyond their discipline.We propose steps that we hope will encourage a shift towards greater consistency in cybersecurity SE methods.These recommendations will enable researchers to more clearly assess the extent to which the presumed relevance of self-efcacy for security behaviors mirrors today's strong visibility of cybersecurity SE research and will provide practitioners with efective material to impact modern IT security and privacy.

A PEER-REVIEWED AND NON-PEER-REVIEWED PUBLICATIONS
Table A1 in this Appendix shows results of key variables of interest when separate analyses were run for peer-reviewed and non-peerreviewed publications.Please note that, e.g., non-peer-reviewed publications report a higher number of scales in relation to the number of publications compared to peer-reviewed publications.

C NEWLY CREATED AND ADAPTED MEASURES
Table C1 in this Appendix presents the outcomes for key variables of interest from separate analyses conducted for both newly created and adapted measures.We categorized measures as newly created if they did not report original authors for their item compositions.

D CYBERSECURITY SE RESEARCH FRAMEWORKS
The three Tables in this Appendix list all the identifed and collapsed variables that are hypothesized to either infuence cybersecurity SE (see Table D1), be infuenced by cybersecurity SE (see Table D2), or both (see Table D3).Tables are sorted by frequency and then in alphabetical order.Note.Data from publications that did not provide information about the review process ( = 18) was not included in this exploratory analysis.Note.This exploratory analysis divides the outcomes reported in the results section for frst use scales into newly created and adapted scales; [1] One study may report multiple estimates or types.

Figure 1 :
Figure 1: PRISMA Flow Diagram of Study Selection.Note.*Multiple reasons may apply.

Figure 2 :
Figure 2: Histogram of Measure and Study Publication Rates

Figure 3 :
Figure 3: Network Graph of Authors Developing Cybersecurity Self-Efcacy Measures.Note.The ten most referenced publications as sources for item composition within this review are labelled.Colors are specifc to each reported (non-)reference.To identify the authors of each node, please use our interactive hmtl widget of the network provided on OSF (fle link: authors network).

Note. 1 :
information about the criterion was reported; 0: information about the criterion was not reported; : coefcient alpha; CR: composite reliability.

Note. 1 :
information about the criterion was reported; 0: information about the criterion was not reported; : coefcient alpha; CR: composite reliability.

Table 1 :
study participants Sequence of Import into Citavi [76]1Heterogeneity of cybersecurity SE measures.Across 174 studies, we found 173 unique cybersecurity SE measures.A data set of all unique cybersecurity SE measures that includes publication information, scale name, referenced authors for item composition, and items are provided on the OSF: measures data set.Figure2visualizes the publication rate of measures in relation to studies for the review period under consideration.In this fgure, no consolidation of measures after He et al.'s[76]review publication in 2014 can be observed.Of these 173 unique measures, only 5 were used more than once.No measure was used more than three times.Collapsing versions of measures (i.e., treating versions with minor word changes as the same) still yielded 155 cybersecurity SE measures (of which 9 were used more than once).

Table 2 :
Reliability Overview of Cybersecurity Self-Efcacy Measures 4.1.5Validity of cybersecurity SE measures.Validity information was reported by 10 studies with recurring scales (55.6%) and 117 studies with frst use scales (75%).Table3provides an overview of reported validity types.As one strategy for content validity, 45 of 173 measures (26.01%) consulted experts to assess items.

Table 3 :
Validity Overview of Cybersecurity Self-Efcacy Measures

Table B1 and
Table B2 in this Appendix show whether or not information about certain criteria that concern the rigor of the scale development process was reported in sufcient detail; for frst use and recurring measures respectively.The Tables are arranged by scale number for easier comparison of related measures.

Table A1 :
Results for Separate Analysis of Peer-Reviewed and Non-Peer-Reviewed Publications

Table B1 :
Summary of Criteria Reported with Respect to the Methodological Rigor of First Use Measures Table B1 Cont.Summary of Criteria Reported with Respect to the Methodological Rigor of First Use Measures Table B1 Cont.Summary of Criteria Reported with Respect to the Methodological Rigor of First Use Measures Table B1 Cont.Summary of Criteria Reported with Respect to the Methodological Rigor of First Use Measures

Table B2 :
Summary of Criteria Reported with Respect to the Methodological Rigor of Recurring Measures

Table C1 :
Results for Separate Analysis of Newly Created and Adapted Measures