Certification Labels for Trustworthy AI: Insights From an Empirical Mixed-Method Study

Auditing plays a pivotal role in the development of trustworthy AI. However, current research primarily focuses on creating auditable AI documentation, which is intended for regulators and experts rather than end-users affected by AI decisions. How to communicate to members of the public that an AI has been audited and considered trustworthy remains an open challenge. This study empirically investigated certification labels as a promising solution. Through interviews (N = 12) and a census-representative survey (N = 302), we investigated end-users’ attitudes toward certification labels and their effectiveness in communicating trustworthiness in low- and high-stakes AI scenarios. Based on the survey results, we demonstrate that labels can significantly increase end-users’ trust and willingness to use AI in both low- and high-stakes scenarios. However, end-users’ preferences for certification labels and their effect on trust and willingness to use AI were more pronounced in high-stake scenarios. Qualitative content analysis of the interviews revealed opportunities and limitations of certification labels, as well as facilitators and inhibitors for the effective use of labels in the context of AI. For example, while certification labels can mitigate data-related concerns expressed by end-users (e.g., privacy and data protection), other concerns (e.g., model performance) are more challenging to address. Our study provides valuable insights and recommendations for designing and implementing certification labels as a promising constituent within the trustworthy AI ecosystem.


INTRODUCTION
In recent years, the promise of artificial intelligence (AI) in transforming our lives has seen widespread advances in all sectors of society.AI is increasingly guiding our consumer choices [52], reshaping service by automatizing tasks [28], assisting managers in hiring decisions [42], or augmenting clinical decision-making [71].In light of increasingly ubiquitous AI and its profound impact on human lives, various government institutions, scientific communities, and the general public are engaged in a widespread discourse on how to ensure trustworthy AI [31,33,36,43] for both low-, and high-stake scenarios [11].
To this end, a large body of work has focused on identifying the principles that underlie trustworthy AI [36].They include mitigating bias and unfairness in AI systems [41], explaining the reasoning of AI decisions [39], setting up mechanisms to hold AI accountable [36], and ensuring user privacy [60].However, as trust is determined by people's perception [40,43], efforts to design trustworthy AI are hampered by a lack of understanding of how to communicate trustworthiness to people, for instance, through documentation or other transparency affordances [43].Particularly for end-users 1 , trusting AI can be a challenge, as they lack the necessary expertise and knowledge to evaluate the various trustworthiness principles (e.g., robustness, privacy, fairness) [4,37].
Motivated by these challenges, this work builds on research highlighting the pivotal role of auditability as an enabler of trust in AI [7,65] and its crucial role in creating an "AI trustworthiness ecosystem" [2] by ensuring that the principles of trustworthy AI are met.Auditing refers to mechanisms that evaluate and ensure compliance with regulations and ethical standards [54].Various methods have been proposed to increase AI systems' transparency and, thereby auditability, such as through the use of model documentation or information about datasets [14,21].While AI documentations are valuable artifacts to inform audit decisions, they are tailored to regulators and experts and not intended to certify and communicate to end-users that an AI has met the auditing criteria.
For this reason, our work focuses on communicating the outcomes of auditing processes to end-users, a topic that has received little attention in previous work.Specifically, we investigate the use of certification labels, which are commonly used in other domains, such as food and energy [10,16,62].Certification labels are relevant in the context of trustworthy AI for three reasons.First, through the use of simple language, icons, or color-coding, they are usually designed to be accessible to various stakeholder groups, including end-users with limited knowledge and time [24].Second, if reflecting a genuine and credible auditing process, certification labels can communicate the criteria used in an audit, thereby serving as a "trustworthiness cue" for end-users [44,57].Third, labels have shown to promote trustworthiness of a product in other domains [64] facing similar challenges on how to certify that a product meets certain criteria, such as agricultural standards (e.g., organic foods [16]) or low ecological impact (e.g., sustainable hotels [10]).However, end-users' attitudes toward AI certification labels and their effectiveness in communicating the trustworthiness of AI remain to be explored.
We addressed this gap by conducting a mixed-method study with both interviews ( = 12) and a census-representative survey ( = 302) with end-users.Our results provide evidence that certification labels can effectively communicate AI trustworthiness.Qualitative findings revealed that end-users have positive attitudes toward AI certification labels and that labels can increase perceived transparency and fairness and are regarded as an opportunity to establish standards for AI systems.Particularly, data-related concerns expressed by end-users, such as privacy and data protection, can be mitigated through the use of certification labels.However, labels may not be able to address all raised concerns, such as model performance, suggesting that they should be considered one promising constituent among others for trustworthy AI.Furthermore, our results provide insights into facilitators and inhibitors for the effective design of certification labels in the context of AI.For example, end-users expressed strong preferences for independent audits and highlighted the challenge of communicating subjective criteria such as "fairness, " whose meaning can be ambiguous.
Quantitative findings showed that a certification label significantly increases end-users' trust and willingness to use AI in both low-and high-stake AI scenarios.Nevertheless, end-users reported a higher preference for certification labels in high-stake scenarios (e.g., hiring procedure) than in low-stake scenarios (e.g., price comparison), and the positive effect of a label on trust and willingness to use AI was more pronounced in high-stake scenarios.This suggests that compliance with mandatory requirements for AI in high-stake scenarios could be effectively communicated to end-users through certification labels in addition to the proposed voluntary labeling for low-stake AI scenarios [11,61].
To summarize, our study is the first to demonstrate the potential of certification labels as a promising approach for communicating to end-users that an audit has certified an AI to be trustworthy.We contribute to the trustworthy AI literature by highlighting opportunities and challenges for designing and effectively implementing certification labels.

AUDITING FOR TRUSTWORTHY AI
A growing body of work recognizes the critical role of algorithmic or AI auditing in enabling the trustworthiness of AI systems [2,37,65].Prior work suggests that auditing improves fairness [69], accountability [13], and governance [17], among others.These elements are considered to contribute to trust in and acceptance of AI 2 .Moreover, audits have the ability to expose problematic behavior, such as algorithmic discrimination, distortion, exploitation, and misjudgment [3].In safety-critical industries such as aerospace, medicine, and finance, audits are a long-standing practice [13].However, only recently have researchers recognized that these areas could inform AI auditing and acknowledged the importance of considering insights from the social sciences, where audits have emerged from efforts toward racial equity and social justice [66].
While the importance of AI auditing has been identified, the development of common audit practices, standards, or regulatory guidance is ongoing [3,13] and efforts to create auditing frameworks throughout the AI development life-cycle are still in their early stages [54].Auditing can be defined as "an independent evaluation of conformance of software products and processes to applicable regulations, standards, guidelines, plans, specifications, and procedures."[29, p. 30].At least three types of AI auditing can be distinguished, including first-party internal auditing, secondparty audits conducted by contractors, and independent third-party audits [13].However, whether auditing should be conducted by independent third-parties or internally within organizations is a topic of ongoing academic discussion [17,38,54], with both approaches having their advantages and drawbacks.Raji et al. argue that external auditing may be constrained by a lack of access to organizations' internal processes and information that are often subject to trade secrets.In contrast, Falco et al. point out that the outcomes of internal audits are typically not publicly disclosed and that it often remains unclear whether the auditor's recommendations are effectively implemented or not.The question of whether end-users prefer internal or external audits remains to be investigated.
In addition to defining standards and best practices for AI auditing, it is crucial to consider how the outcomes of audits can be communicated to different stakeholders with varying knowledge and needs [72].Current research has mainly focused on approaches for documenting machine learning (ML) models and training datasets.These artifacts play an important role in the AI trustworthiness ecosystem by increasing transparency and allowing auditors and regulators to determine whether principles of trustworthy AI (e.g., fairness, robustness, privacy [36]) have been met [37].For example, "model cards" [14,49] disclose information about a model's purpose and design process, its underlying assumptions, and the model's performance characteristics.Similarly, Gebru et al. introduced "datasheets, " which summarize the motivation, composition, collection process, and recommended uses for datasets, and Floridi et al. recommended the use of "summary datasheets" and "external scorecards." The former is aligned with the goals of "datasheets" and synthesizes key information about the AI, including its purpose, status, and contact information.The latter is conceptually closely related to "model cards" and evaluates the AI system along several dimensions to form an overall risk score [18].
However, these documentations are tailored to AI practitioners, and regulators [37,58,72], rather than end-users affected by AI decisions.Often, end-users have neither the access nor the expertise to understand the technical information that AI documentation provides [1].It is unlikely that end-users can effectively utilize ML model documentation or data documentation to make informed judgments about trusting or using AI [37].For this reason, endusers depend on auditors and regulators who can use these artifacts to verify and ensure the trustworthiness of AI.Yet, it remains an open research question of how to effectively communicate to endusers that an audit has considered an AI trustworthy.End-users require accessible communication tailored to their specific values and concerns [72].A potentially effective way to provide such information is through the use of certification labels, which we will introduce in the following.

CERTIFICATION LABELS FOR AUDITED AI
Labels are widely used for displaying specific product or service attributes to help consumers make more informed decisions.They are well-established in various fields, such as agriculture [23], food [34], energy [59], and e-commerce [63].Different kinds of labels exist, and various classification systems have been proposed [30,61,62].For example, in the food industry, "nutrition labels" provide consumers with simplified and easily understandable information to identify a product's nutritional content.While this information can also be found in detailed tables on the back of food packing, for many consumers, this information is too complex, revealing similar challenges end-users face with AI documentation.This is where labels can provide information in a clear and accessible manner, utilizing simple language, icons, and color coding, which makes labels accessible to individuals from different backgrounds [22,24].Prior work in consumer research has shown that labels can communicate the outcomes of audits and thereby enhance trust in a product [64].
In this study, we focus on certification labels, which certify that a product or service meets one or several criteria and are thus suitable for the case of audited AI.Certification labels are exclusively awarded to products that have undergone an auditing process, typically conducted by a third-party organization [62].By communicating an institutional assurance of trustworthiness, third-party organizations can serve as "trust surrogates" for the consumer, shifting the trust relation from trust in the AI to trust in the institution that provides the certification [64].In this case, a certification label serves as a trustworthiness cue [57] that signals compliance with governance structures.Our work thus closely aligns with the proposal by Liao and Sundar, highlighting that the trustworthiness of AI is not inherently given but must be communicated and perceived as such by the user, for instance, through transparency affordances.According to the authors, people then use heuristics (i.e., mental rules of thumb) to evaluate these affordance cues to form judgments about the trustworthiness of AI.The authors further suggest that certifications from regulatory bodies that have audited the AI could serve as trustworthiness cues, invoking these heuristics.Therefore, certification labels in the context of AI are a promising approach to communicate that a regulatory body has audited an AI and considered it trustworthy.
There have been several initiatives at a national and international level to introduce AI labels in both industry (e.g., [20], [25], [19]) and government (e.g., [15], [46]).These initiatives vary in their intended scope but are mostly still in an early stage.Previous studies have also emphasized the potential of labels as a means of AI certification [27,58,61].Holland et al. proposed the concept of a "Data Set Nutrition Label, " which would summarize key aspects of a dataset (e.g., metadata and the data source) prior to the development of ML models.Seifert et al. further suggested labels for trained ML models that independent reviewers have evaluated based on properties such as accuracy, fairness, and transparency.A recent study by Stuurman and Lachaud commented on various labels to provide information to end-users affected by AI decisions.Drawing from the EU Act on AI [12], the study distinguished between lowstake and high-stake AI systems and proposed a voluntary labeling system for AI not considered high-stake.This distinction aligns with recommendations from the EU's "white paper on artificial intelligence, " [11] which encourages organizations to use labels to demonstrate the trustworthiness of their AI-based products and services.A survey conducted with individuals and organizations directly or indirectly engaged in audits found that while respondents believed that AI audits should be mandatory, 53% supported mandating them only for high-stakes systems [13].End-users' perceptions of certification labels in low and high-stakes AI scenarios have not yet been investigated.
Despite this extensive theoretical work on labels in the context of AI and their gradual adoption in industry and government, there is currently a lack of empirical research exploring end-users' attitudes toward AI certification labels and their effectiveness in communicating trustworthiness in low-and high-stake AI scenarios.This study aims to address this research gap and inform current industry and government initiatives.

RESEARCH QUESTIONS
Based on the aforementioned considerations, we investigated the following research questions: RQ1: What are end-users' attitudes toward certification labels in the context of AI? RQ2: How do certification labels affect end-users' trust and willingness to use AI in low-and high-stake scenarios?

METHODS
To answer these research questions, we used a mixed-method research approach consisting of semi-structured interviews and a subsequent survey to collect quantitative data as part of a withinsubjects design study.For both the interviews and the survey, we used a scenario-based approach to investigate people's attitudes and the effects of a certification label, inspired by past research [5,32,35].
In the interviews, we asked participants about their attitudes toward AI and certification labels.As a follow-up within-subjects study, we implemented a survey to investigate the effect of a certification label quantitatively.The semi-structured interviews served as a basis for the survey and a means to enrich the quantitative results.The quantitative survey complemented the qualitative interviews by extending our results to a larger census-representative sample.
In the following, we will introduce the certification label used in our study before describing the procedures of each method in more detail.

The certification label
To investigate labels in the context of AI, we used a certification label that has already been developed for the broader context of digital trust.Using an existing label had the advantage that it had undergone an extensive design process and thus did not need to be created from scratch.The non-profit foundation Swiss Digital Initiative laid the groundwork for developing this certification label.At the label's core lies a catalog of verifiable and auditable criteria, co-developed by an academic expert group based on a user study on digital trust.A panel of independent experts from academia, data and consumer protection, and digital ethics further developed the label catalog.Involving digital service providers and auditors in the designing process ensured that the criteria were auditable and verifiable.The catalog that forms the basis of the audit currently contains 35 criteria that are summarized into four categories: (1) Security (criteria 1 -12): What is the security standard?The service provider shall, e.g., ensure that the data is encrypted as it transfers so that third-parties cannot access it.(2) Data protection (criteria 13 -20): How is the data protected?
The service provider shall, e.g., assume responsibility for the appropriate management of the data.(3) Reliability (criteria 21 -29): How reliable is the service or product?The service provider shall, e.g., take all actions required to safeguard the continuity of the service.(4) Fair user interaction (criteria 30 -35): Is automated decisionmaking involved?The service provider shall, e.g., ensure that all users receive equal treatment and that there is no data-based service or price discrimination.If an organization would like its digital product or service (e.g., a chatbot) to receive the certification label, it can voluntarily request an audit and thus participate in the certification process.After a scoping call with third-party auditors, an audit is performed along the criteria catalog.The audit leads to an audit report detailing the performance per criterion, which is double-checked by an independent label certification committee composed of auditing experts.If non-conformities are identified, the organization applying for the label must fix the identified issues, e.g., adjust its privacy policy.After a successful auditing report, the certification label is awarded for a period of three years with two audits during that period.

Scenario selection
Participants were presented with real-world examples of AI systems, adapted from Kapania et al., namely medical diagnosis, loan approval, hiring procedure, music preference, route planning and price comparison (see materials on OSF: https://osf.io/gzp5k/).One advantage of using hypothetical scenarios instead of real consumer applications is that differences in participants' prior experience with the applications can be controlled for Kapania et al. and Woods et al. proposed that people's behavior in scenario-based experiments corresponds to their real-life behavior.To answer our second research question and following Kapania et al. we explored both low-stake scenarios (music preference, route planning, price comparison) and high-stake scenarios (medical diagnosis, hiring procedure, loan approval).This distinction was crucial since other researchers [18,61] and the "EU AI Act" [12] have discussed the use of AI labels for "lowstake" and "high-stake" scenarios.This classification was based on the AI's respective impact on affected parties and the involvement of significant risks, in particular with respect to safety, consumer rights, and the use of personal data.

Interviews
5.3.1 Participants.Initially, we invited 16 participants to an interview on-site at the university.The recruitment was carried out through a university-internal database and an online marketplace where scientific studies can be advertised.To ensure that our sample consisted of end-users (i.e., laypeople who may be affected directly or indirectly by the outcomes of AI systems), we used screening questions following Kapania et al. and asked potential participants about their knowledge of AI and experience working with AI-based systems.We selected participants who indicated that they have heard about AI but did not work with it and provided a comprehensible description or adequate example of what AI is without overly restricting the valid responses (e.g., "robots" was valid while obvious nonsense answers such as "E.T. the alien" was deemed invalid).In addition, we asked participants to indicate their age, gender, profession, and English language proficiency so that we could design the interviews as balanced as possible and present materials in English.However, four interviews did not take place due to noshows.We, therefore, conducted 12 interviews with end-users of different backgrounds, ages, and genders that lasted 60 -90 minutes.The interviews were conducted in German and recorded through field notes and audio recordings.Each participant received compensation in the form of a gift card worth CHF 10.00 from a Swiss retail company.The final sample (  = 35.42,  = 12.50,   = 23,   = 66) consisted of students (P2, P3, P4, P8, P11) enrolled in linguistics and literature (P2), fine arts (P3), and psychology (P4, P8, P11), as well as individuals who described their occupation as a bike messenger (P12), waitress (P1), dancer (P9), course manager (P7), management assistant (P6), intern (P10) and retired teacher (P5).The sample was predominantly female, with ten women and two men.5.3.2Procedure.Before the interviews, participants had to read and sign a declaration of consent.In the declaration, we informed participants of the purpose and rationale of the study, the researcher affiliations, the voluntary nature of study participation, and how their data will be analyzed and shared.All personally identifiable information was deleted to ensure privacy, and the anonymous data was stored without actual reference to the participants.
During the interviews, we asked attitudinal questions about AI, specifically where participants saw opportunities and challenges in using AI.We then presented the six scenarios to the participants without specifying the low-and high-stake categorization we had made in advance.Based on the respective headings of the scenarios (e.g., music preference), without further information, we asked participants to order the scenarios via drag and drop from "most impactful" (rank 1) to "least impactful" (rank 6).To ensure comparability, we defined "most impactful" for participants as "the scenario that would have the greatest impact on your personal life." This question aimed to validate our categorization in low-and highstake scenarios.Next, we presented participants with one low-stake and one high-stake scenario and asked how they differed from one another.After this, participants were introduced to the certification label and asked how they perceived it, whether the label criteria were comprehensible or not, and where they saw opportunities and drawbacks of a certification label.The goal of the interviews was not only to gather qualitative data, but also to identify and determine which questions best suited the subsequent survey.We, therefore, made sure the questions were comprehensible and free of ambiguities.Any difficulties encountered during the interviews were discussed within the research team, and, if necessary, the respective questions were revised or removed.We refer to the digital repository for the complete interview manual.

Participants.
To gain insights into how a general population perceives a label in the context of AI, we hired a market research agency (https://www.bilendi.ch/)to provide us with a Swiss censusrepresentative sample regarding age and gender (quota sampling).We used the same screening questions as in the interviews and initially recruited 395 participants that received CHF 3.00 for taking part in the 15-minute online survey.Following a quality assessment using a self-reported single item as an indicator of careless responding [6,48], 302 participants remained for data analysis.The sample is census-representative regarding age (  = 43.88,  = 16.08,  = 18,   = 79) and the gender distribution (150 women, 151 men, one non-binary person).

Procedure and measures.
The survey consisted of three parts.First, after providing informed consent and a brief introduction to the study, participants were free to select one scenario from the low-stake and one from the high-stake categorization.After making their choice, they received full descriptions of the two scenarios (see Appendix A) and were asked to rate their trust ("how much would you trust the AI in the scenario presented?")and willingness to use ("how much would you be willing to use the AI in the scenario presented?") on a scale from 0 (= not at all) to 100 (= absolutely).In addition, participants were asked in which scenario they would more readily accept the AI's decision/recommendation (i.e., "in which of the two scenarios would you be more willing to accept the decision/recommendation made by AI?").
Participants were introduced to the certification label in the second part of the survey.They were asked for their impression and rated the importance of each criterion (i.e., "how important are the label criteria for you in the context of AI?") on a scale from 0 (= not at all) to 100 (= absolutely).Participants were also asked what effect the certification label had on their acceptance (i.e., "would you be more likely to accept an AI's decision/recommendation if it had received a label?") and preference (i.e., "in which one of the two scenarios would you prefer the use of a label?").To understand end-users' preferences regarding external and internal auditing, we included an open-ended question (i.e., "who do you think should be responsible for awarding such a label?").
Finally, in the fourth part, we again let participants rate the AI in the same low-and high-stake scenario on trust and willingness to use, this time with the information that the AI had been awarded a certification label.This second assessment allowed us to examine the certification label's effect on trust and willingness to use ratings.Similarly to the first assessment, we asked participants to justify their ratings and why a label led to increased/decreased or unchanged ratings.At the end of the survey, we asked the participants for feedback and the question, "in your honest opinion, should we use your data in our analyses in this study?Do not worry, this will not affect your payment.You will receive the compensation either way," as an additional quality check.The complete survey can be found on the digital repository.

Analysis and coding procedure
We used the qualitative interview data to answer RQ1 and the quantitative survey data to answer RQ2.The interview data was evaluated using qualitative content analysis [47], more specifically summarizing content analysis.We followed the procedure according to Mayring and Fenzl by determining the coding unit, paraphrasing, generalization to the level of abstraction, first reduction, and second reduction to form a cross-case category system.Coding was carried out by three researchers who independently went through four interviews each.To ensure consistency, one interview was evaluated by all researchers.Any ambiguities and discrepancies were resolved through open discussions, and the final cross-case category system was formed in a group session.The quantitative data analysis was carried out in R (version 4.2.2.[53]).We used the ggstatsplot package (version 0.9.1.[51]) to conduct statistical testing and report -values, standard deviations, and the corresponding -values.We set the level of statistical significance to  = .05.

Attitudes toward certification labels
The content analysis of the interview data resulted in 127 casespecific categories, which were further consolidated across participants into 25 categories.These cross-categories were grouped into the following topics: "AI-related concerns, risks, problems, ", "AIrelated opportunities, advantages,", "attitudes toward certification labels,", and perceived "differences between low-and high-stakes scenarios".For the purpose of this study, we focus on the topic "attitudes toward certification labels, " as this was the most relevant to our current research objective.Categories may consist of further subcategories.Table 1 contains the subcategories and corresponding example quotes from end-users' attitudes toward certification labels.The complete content analysis with all topics is available on the digital repository.

Opportunities and facilitators.
Participants in the interview study indicated that the label covered essential concerns.The content analysis revealed that the topic "concerns, risks, and problems" predominantly consisted of data-related concerns such as data privacy (i.e., protecting data from attack and malicious use), data storage (i.e., how data is handled and stored), and third-party involvement (i.e., unwanted and unknown disclosure of data).Regarding data-related concerns, a certification label for AI systems was perceived as an effective tool to convey compliance with these requirements and hold the certified parties more accountable.In particular, the security and data protection criteria were perceived as minimal standards that must be met for them to consider using AI.
Participants emphasized that a certification label provides a certain level of transparency that removes the burden of examining these criteria from end-users.In addition, they viewed the certification labels and corresponding auditing process as an opportunity for more fairness and to establish standards for AI systems, allowing them to compare products and services critically.The interviewed participants indicated that a certification label could increase their trust for all these reasons.For a label to be convincing, participants emphasized that additional information regarding the label is needed.This includes information about the label's criteria (i.e., how were they formed?), the auditing process itself (i.e., how were these criteria weighted?), and the auditors (i.e., who was responsible for awarding a label?).Participants also placed a strong emphasis on the independence of the auditing process, noting that the auditors should have no financial ties to or other direct dependencies on the organizations for whose products or services the label is awarded in order not to undermine their credibility.Additionally, participants stressed the importance of widespread participation in the auditing and certification process, as this was deemed necessary for adopting AI standards and the label's credibility.As a crucial factor for the effectiveness of a certification label, participants identified regular updates that align with industry standards and best practices to ensure that the label remains relevant and useful.

Limitations and inhibitors.
While participants acknowledged that a certification label covers essential issues, they also noted that it does not address all their AI-related concerns.These concerns included the lack of model performance (e.g., accuracy measures).Some participants noted that a certification label alone could even lead to "blind trust" in AI systems without accuracy measures.Additionally, participants noted that while a certification label provides some level of transparency, it does not provide complete documentation (e.g., source code) of the AI system and the ethical reasoning behind the auditors' decision to approve the use of AI in a particular application in the first place.As a result of these limitations, participants felt that a certification label might not be sufficiently persuasive to convey trustworthiness for critical individuals.
Furthermore, participants identified several reasons why a certification label may not be effective.One reason was a potential overabundance of labels with different standards, diluting compliance with regulations and leading to confusion among end-users.In line with this, participants emphasized the importance of ensuring that the label's criteria are not just "empty promises" but that they are actually adhered to by organizations.They also pointed out the difficulty of measuring the label's criteria and the degree of subjectivity involved.Concepts such as security and fairness can mean different things to different people.Results showed that some criteria were more easily understood (e.g., security) than others (e.g., fair user interaction).For example, 11/12 participants implied that the definition of the security criteria covered what they had in mind.For data protection, this was the case for 9/12 participants, followed by 8/12 participants for reliability.However, merely 2/12 participants indicated that the criterion "fair user interaction" captured what they thought it would encompass.In addition to these differences in comprehension, participants pointed out conceptual overlaps for some criteria (e.g., security and data protection) that were not readily understood without further clarification.All these factors might diminish the effectiveness of a certification label.

Effects of certification labels
Participants in the survey study were asked to select one case each from the high-stake (medical diagnosis, hiring procedure, loan approval) and one from the low-stake (music preference, route planning, price comparison) scenarios without explicitly being informed of this distinction.Validation of this distinction between low-and high-stake was provided by participants' "impactfulness" rankings.Calculating a mode revealed that the three high-stake scenarios were perceived as the most impactful ones (i.e., 1 = medical diagnosis, 2 = hiring process, 3 = loan approval, 4 = price comparison, 5 = music preference, 6 = route planning).The majority of participants indicated that they would be more likely to accept the AI's decision/recommendation in low-risk scenarios (74.2%,  = 224) than in high-risk scenarios (17.9%,  = 54) and 7.9% ( = 24) indicating no preference, which we considered an additional confirmation of the distinctiveness of the two scenarios.Participants in the interview study distinguished between low-and high-stakes scenarios primarily on the level of risk associated with the scenario.They reported that high-stakes scenarios carry higher self-relevance and long-term consequences.
The different ratings depending on low-and high-stake scenarios become evident when considering the violin plots and boxplots (see Figure 2).The ratings for high-stake scenarios are relatively symmetrically distributed across the scale.In contrast, the low-stake scenarios' distribution is heavily left-skewed, with approximately 75% of the data above a rating of 50 for trust and willingness to use.Introducing a certification label for both scenarios leads to a further shift of the distribution to the right and, thus, higher ratings.Plotting the non-aggregated scenarios individually reveals the distributional differences more clearly (see Figure 3).The ratings of the individual high-stakes scenarios are more spread out on the scale than in the case of the low-stake scenarios.Differences in the effectiveness of a label also become apparent from this perspective.The median trust and willingness to use ratings in all scenarios increases in the presence of a label and are more pronounced in the high-stake scenarios.
A majority of the survey participants directly indicated that they would prefer the use of a certification label in the selected highstake scenario (63.2%,  = 191), compared to preferring a label in the low-stake scenarios (22.2%,  = 67), with 14.6% ( = 44) of participants indicating no preference.Regarding the different preferences for certification labels in low-and high-stake scenarios, participants from the interview study expressed a greater demand for a certification label in high-stake scenarios because of the higher scenario complexity, limited individual expertise, and a lack of prior experience with the system.Overall, 81.1% ( = 245) of survey participants stated a preference for using an AI with a label, compared to 6% ( = 18) that would prefer to use an AI without a label and 12.9% ( = 39) that stated no preference.Also, 70.9% ( = 214) indicated to be more likely to accept an AI's decision/recommendation if it had received a label, compared to 14.2% ( = 43) that indicated "no, " and 14.9% ( = 45), that made no statement.Survey participants rated the importance of the existing label criteria in the context of AI at a high level with similar ratings for security ( = 87.The plots also depict the medians, means, and distribution of the aggregated low-and high-stake scenarios.All comparisons revealed statistically significant differences. see that come with the use of AI, while 20.9% ( = 63) stated "no" and 23.8% ( = 72) indicated that no statement was possible.
When being asked the question of who should be responsible for awarding a label, the open-ended responses from the survey revealed that a majority of participants expressed a preference for external entities to conduct the auditing, with 48.7% ( = 147) of the answers being coded as "government" and 37.4% ( = 113) as "NGO." Only 5.3% ( = 16) of the answers were coded as "company." Additionally, 8.6% ( = 26) of the responses were coded as "other, " which included mentions of entities such as "ethic committee," "consumer protection, " or "citizen's association."

DISCUSSION
The quantitative findings reveal that the presence of a certification label significantly increases participants' trust and willingness to use AI in both low-and high-stake scenarios, thereby answering our second research question.Most participants (81%) of the censusrepresentative survey preferred using AI with a certification label, and a large proportion of participants (71%) responded that they would be more likely to accept an AI's decision or recommendation if it had been awarded a certification label.The results further show that a majority of participants (63%) not only indicated a preference for certification labels in high-stake scenarios, but that certification labels also had a larger effect on trust and willingness to use AI in high-stake scenarios.For example, willingness to use ratings for the "hiring procedure" scenario increased from 36 to 64 points, compared to an increase from 75 to 80 points for the "price comparison" scenario.While Stuurman and Lachaud and the EU's "white paper on artificial intelligence" distinguish between regulating high-stake  AI through mandatory requirements and proposed voluntary labeling only for low-stake AI, our results demonstrate the relevance of certification labels for end-users, specifically in high-stake scenarios.Based on these findings, we argue that parallel to voluntary labeling for low-stake AI scenarios, compliance with mandatory requirements for AI in high-stake scenarios could also be communicated through certification labels, potentially increasing end-users' trust in and willingness to use awarded AI systems.
Qualitative findings allowed us to answer our first research question and provide a more nuanced picture of which aspects to consider for effective certification labels in the context of AI.The certification label we investigated in this study was designed for digital trust more generally.However, end-users' attitudes toward the certification label were primarily positive, and the label's criteria of security, data protection, reliability, and fair user interaction were also relevant to end-users in the context of AI.We derive this from survey participants' high "importance" ratings for the existing label criteria.Concerning opportunities for AI labels, participants in the interview study indicated that a certification label could increase perceived transparency and fairness and serve as a means to establish standards for AI systems.It became apparent from the interviews that certification labels can especially cover end-users' data-related concerns (e.g., privacy, data protection, and third-party involvement) that map to previous work [65].
However, our results also reveal that certification labels have limitations and do not alleviate all issues end-users face regarding the use of AI.Only half of the participants in the survey indicated that a certification label addresses their AI-related concerns/challenges/risks, suggesting that end-users seem to hold differentiated needs.For example, participants in our interviews pointed out that a certification label does not provide indicators about the AI's performance (e.g., accuracy measures).They remarked that performance indicators are essential in deciding in which cases the AI can be trusted and when it must be questioned.This led participants to remark that a label could inadvertently foster "blind trust" if performance indicators are absent.Thus, we suggest that certification labels should either include performance indicators as part of the label criteria or be supplemented with them.Based on these results, we argue that certification labels can more readily signal trustworthiness than untrustworthiness.This is because it is not possible to distinguish if a digital product or service has not yet been audited or whether it has failed to meet specific audit criteria, particularly if certification labels remain voluntary.We regard certification labels as one component of an "AI trustworthiness ecosystem" [2] that meets essential needs for end-users but which ideally should be combined with other transparency approaches to signal untrustworthiness (e.g., accuracy measures) and form a "chain of trust" [65].
As potential inhibitors for effective certification labels, participants in our interviews pointed out certain overlaps and the subjective nature of the label's criteria.Ultimately, "fairness" and "security" are subjective judgments that vary from one person to the next, and our results showed that the criterion "fair user interaction, " in particular, did not reflect what study participants thought it encompassed.The challenge for auditing of defining and measuring concepts that are inherently difficult to quantify has been discussed by previous research [37,58,66].Our results indicate that this subjectivity is recognized by end-users and can impair the effectiveness of a label.To avoid a discrepancy between, for example, the auditors' definition of fairness and what people commonly associate with this term, auditors should be in dialogue with end-users so that their values are represented in a label.This is in line with Costanza-Chock et al., who had criticized that the involvement of affected communities plays a minor role in AI audits.They argued that real-world harms and sociological phenomena could only be understood by engaging with people to inform auditing.
Our interview results highlight that end-users request not only information on the label's criteria but also information regarding the criteria content (i.e., how they were formed), the auditing process itself (i.e., how the criteria informed the audit), and particularly about the auditors (i.e., who awarded the label).We identified this demand for additional information as a potential facilitator, indicating that an effective certification label is more than just a list of evaluation criteria.A large majority (86%) of survey participants responded that either the government (49%) or a non-governmental organization (37%) should ideally be responsible for awarding a label, with only 5.3% of responses indicating that a company should be responsible.Participants in the interview study emphasized the auditors' independence (e.g., financially, with no conflict of interest) as a prerequisite for the effectiveness of a certification label.These findings support the notion that auditing can only foster trust if the auditors themselves are trusted [2] and are in line with results of label studies in other domains [23,64], which show that third-party certification positively affects trust in eco-labels.We contribute to the ongoing discussion regarding internal vs. external auditing by showing that end-users favor independent auditors.To account for this independence on the one hand and the structural advantages of internal audits on the other, "cooperative audits" [69] could be a way forward, balancing between the advantages and challenges of the two approaches.In addition to these facilitators and inhibitors, auditors and regulators should also be mindful that an overabundance of labels with different standards can inhibit the persuasiveness and trustworthiness of their certification label.Such effects have been reported for eco-labels, where an extensive number of existing labels result in different standards that remain unclear to consumers [26].These findings speak for a certain harmonization and regulation of certification labels.Moreover, organizational compliance with a label's criteria should be established so end-users do not perceive them as "empty promises" but instead as a means for increased accountability for organizations and more trustworthy AI [37].A prominent instance of such a challenge is the case of the CE (conformité européenne) marking, in which some products use the mark without actually being manufactured to EU quality standards [45].This illegitimate use has led, among other things, to the introduction of supplementary certification labels to certify product quality, which unintentionally contribute to consumer confusion [61].To realize their full potential, certification labels should have a thorough auditing process, be regularly updated to reflect current industry standards, and ideally, be used by a wide range of organizations to increase recognition.

LIMITATIONS AND FUTURE WORK
We conducted a within-subjects survey study where participants were presented with the AI scenarios with and without a certification label.While this provided valuable insights into the general effectiveness of certification labels, future work could compare label classes or designs (e.g., nutrition labels vs. certification labels) in a between-subjects experimental design.Certification labels are limited in their ability to communicate untrustworthiness.While other kinds of labels have a more differentiated rating system (e.g., color-codings or grades) that allows comparisons, certification labels only provide dichotomous information by either being present or not.Thus, it is not possible to differentiate if a product without a certification label is untrustworthy because it failed to meet a label's criteria or has yet to be audited.A between-subjects design could provide evidence about the effectiveness of different kinds of labels and identify the factors that make labels more or less effective in communicating trustworthiness and untrustworthiness.
Moreover, we used single-item questions to measure trust and willingness to use.Trust, in particular, is a complex psychological construct [56] and might not be adequately operationalized using single-items measures.However, a recent study has shown that single-item trust measures are equivalent to validated questionnaires regarding sensitivity to changes in trust and a reliable tool in longer surveys where questionnaires are not feasible [50].Future work should confirm the effectiveness of certification labels in fostering trust with validated psychometric measures and explore their effect on trusting dynamics that emerge over time in real-world human-AI interactions.

CONCLUSION
This study empirically investigated certification labels to communicate trustworthy AI to end-users.For this purpose, we explored end-users' attitudes toward certification labels in the context of AI and how labels affect trust and willingness to use AI in both lowand high-stakes scenarios.We used a mixed-methods approach to collect both qualitative and quantitative data through interviews ( = 12) and a census-representative survey ( = 302) with endusers.The quantitative results of this study show that certification labels can be a promising way to communicate the outcome of audits to end-users, increasing both trust and willingness to use AI in low-and high-stake AI scenarios.Based on the qualitative findings, we further identified opportunities and limitations of certification labels, as well as inhibitors and facilitators for the effective design and implementation of certification labels.Our work provides the first empirical evidence that labels may be a promising constituent in the more extensive "trustworthiness ecosystem" for AI.

Figure 1 :
Figure 1: The "Digital Trust Label," which we adopted as a certification label for AI.©2023 Swiss Digital Initiative

72 , 37 µmeanFigure 2 :
Figure2: Plots showing the individual scores for trust and willingness to use and their respective changes from T1 (without label) to T2 (with label).The plots also depict the medians, means, and distribution of the aggregated low-and high-stake scenarios.All comparisons revealed statistically significant differences.

Figure 3 :
Figure3: Plots showing the different distributions for trust and willingness to use ratings for the different high-stake (hiring procedure, loan approval, medical diagnosis) and low-stake (music preference, price comparison, route planning) without a label at T1 and with a label at T2.

Table 1 :
End-users' attitudes toward certification labels would like to] find out what this "Fair User Interaction" means, what it refers to, how my data is protected . . .how is it designed and who monitors this label.Exactly by whom was it created and by whom it is administered, awarded and so on, that's what I would like to know." "What you could include is a criterion for the AI.That an AI has been used enough times and has, for example, been 99% correct and always had the right answers, rather than 80%." (P4) Lack of persuasiveness "I think there are still a lot of people, or some people, who will be critical of these systems even though it has a label." (P3) Inhibitors for effective certification labels Overabundance of labels "Because you can see that in the organic sector, there are now 20 labels and as a consumer you can almost no longer categorize them, so I think it's so important now that there is also Bio-Suisse [an organic label] or something like that in Switzerland, they have established themselves well, but I think you always have to stick to that as a label." (P6) "Overlap; I think it all goes a bit in a similar direction, except maybe the last point [Fair User Interaction], which is a bit different again." (P10)