Farsight: Fostering Responsible AI Awareness During AI Application Prototyping

Prompt-based interfaces for Large Language Models (LLMs) have made prototyping and building AI-powered applications easier than ever before. However, identifying potential harms that may arise from AI applications remains a challenge, particularly during prompt-based prototyping. To address this, we present Farsight, a novel in situ interactive tool that helps people identify potential harms from the AI applications they are prototyping. Based on a user's prompt, Farsight highlights news articles about relevant AI incidents and allows users to explore and edit LLM-generated use cases, stakeholders, and harms. We report design insights from a co-design study with 10 AI prototypers and findings from a user study with 42 AI prototypers. After using Farsight, AI prototypers in our user study are better able to independently identify potential harms associated with a prompt and find our tool more useful and usable than existing resources. Their qualitative feedback also highlights that Farsight encourages them to focus on end-users and think beyond immediate harms. We discuss these findings and reflect on their implications for designing AI prototyping experiences that meaningfully engage with AI harms. Farsight is publicly accessible at: https://PAIR-code.github.io/farsight.

prompt-based prototyping.To address this, we present Farsight, a novel in situ interactive tool that helps people identify potential harms from the AI applications they are prototyping.Based on a user's prompt, Farsight highlights news articles about relevant AI incidents and allows users to explore and edit LLM-generated use cases, stakeholders, and harms.We report design insights from a codesign study with 10 AI prototypers and findings from a user study with 42 AI prototypers.After using Farsight, AI prototypers in our user study are better able to independently identify potential harms associated with a prompt and find our tool more useful and usable than existing resources.Their qualitative feedback also highlights that Farsight encourages them to focus on end-users and think beyond immediate harms.We discuss these findings and reflect on

INTRODUCTION
As artificial intelligence (AI) becomes increasingly integrated into our everyday lives, mitigating the societal harms posed by AI technologies has never been more important.In response to the demand for accountable and safe AI, there have been growing efforts from both industry and academia towards responsible design and development of AI [143,183].The majority of these endeavors focus on machine learning (ML) experts, such as ML developers and other AI practitioners.For example, researchers have introduced techniques that help ML developers interpret ML models [102,128,150] and assess model fairness [30,90,189].Additionally, researchers have also proposed frameworks that target ML developers' workflows, such as improving data collection and annotation practices [14,118,123], documenting training data and models [43,63,122], and anticipating an ML product's potentials for harms [46,120].
However, more recently, we have witnessed a rapid advancement of large language models (LLMs) such as Gemini [178] and , alongside the emergence of prompt-based interfaces like Google AI Studio [70], GPT Playground [133], AI Chains [204], and Wordflow [184] (Fig. 2B).These general-purpose models and easy-to-use interfaces have significantly increased access to the process of prototyping and building diverse AI-powered applicationsleading to a paradigm shift in AI development workflows that poses unique challenges to responsible AI, including introducing new potential harms to avoid [190], as well as challenges applying existing responsible AI practices [98].
Many people who use prompts to create AI applications now encompass a broader spectrum of roles beyond traditional ML experts (Fig. 2A), such as designers, writers, lawyers, and everyday users [55,84,193,207], whereas existing responsible AI research often targets ML experts such as ML engineers and data scientists [78,198].Many users of AI prompt-based prototyping interfaces [e.g., 70,133,184,204], or "AI prototypers" [cf.84] do not have experience in AI or computer science, which can lead to challenges in anticipating the consequences of their AI applications [143]-a difficult task even for computer science faculty and AI researchers [20,45].Furthermore, LLMs demonstrate a wide range of capabilities that are continually being discovered across Farsight provides a range of in situ widgets for these tools, helping AI prototypers envision the potential harms of their AI applications during an early prototyping stage.
various contexts, including tasks such as summarization, classification, and translation [18,174].This characteristic of LLMs gives rise to complex and uncertain impacts of LLM-powered applications [61], presenting a significant departure from the classical ML models targeted by existing responsible AI endeavors [98,190] and introducing a new layer of complexity for responsible AI researchers to help AI developers anticipate downstream consequences.
To help address these challenges in applying responsible AI practices to LLM-powered AI applications, we present Farsight (Fig. 1, Fig. 2B), an interactive tool to help AI prototypers identify potential harms of their LLM-powered applications-a key early step in harm prevention and mitigation [104,120,131,176,177]-during the prototyping stage.Using Farsight as a probe, we conduct multiple mixed-method user studies to investigate (1) how an early-stage intervention changes AI prototypers' awareness of and approach to identifying harms, (2) the effectiveness of our tool in helping people envision harms, and (3) the challenges AI prototypers face during this harm envisioning process.We contribute: • Farsight, the first in situ interactive harm envisioning tool that empowers AI prototypers to identify potential harms that may arise from their prompt-based AI applications, directly within their prompting environments (Fig. 1, Fig. 2).Inspired by prior harm envisioning frameworks [24,46,120] and in situ security alert tools [109,125,147], Farsight overcomes unique design challenges identified from a literature review ( § 2) and a co-design user study with 10 AI prototypers ( § 3).• Novel techniques and interactive system designs to foster responsible AI awareness among AI prototypers.Given a user's prompt, Farsight leverages embedding similarities to surface news articles about relevant AI incidents from the AI Incident Database [111] and uses LLMs to generate potential use cases, impacted stakeholders, and harms for AI prototypers to review, edit, and add to.Applying a progressive disclosure design [129], our tool fits into users' diverse prompting workflows.With a novel adaptation of node-link diagrams [146], Farsight enables users to interactively visualize, generate, and edit use cases, stakeholders, and harms ( § 4).• Empirical findings about harm envisioning processes from a co-design study and an evaluation study.During our design of Farsight, we conducted a co-design study with 10 AI prototypers to evaluate our design ideas and generate new ideas ( § 3).
Fig. 3: Farsight fits into AI prototypers' diverse prompting workflows including prompting GUIs and computational notebooks.For example, (A) when an AI prototyper writes prompts for a therapy chatbot in Google AI Studio [70], Farsight's Chrome extension alerts the user about related accidents and potential harms.(B) When an AI prototyper writes prompts for a toxicity classifier in Jupyter Notebook [91,185], Farsight's Python library shows potential negative consequences of this classifier.
After developing Farsight, we conducted an evaluation user study with 42 AI prototypers to examine the effectiveness of Farsight in aiding users to brainstorm harms and improving their ability to independently identify harms.Our mixed-method analysis highlights that, after using Farsight, AI prototypers are better able to independently identify potential harms that might arise from an application developed with a given prompt, and participants report that our tool is more useful and usable than existing resources.In particular, Farsight encourages users to shift their focus from the AI model to the end-users, providing them with a broader perspective to consider indirect stakeholders and cascading harms ( § 6).• An open-source, web-based implementation that lowers the barrier to applying responsible AI practices.We develop Farsight with cutting-edge web technologies, such as Web Components [115] and WebGL [114], so that it can be easily integrated into any web-based prompt development environments, such as Google AI Studio and Jupyter Notebook (Fig. 3).We open source 1 Farsight as a collection of reusable interactive components that future researchers and designers can easily adopt ( § 4.4).To see a demo video of Farsight, visit https://youtu.be/BlSFbGkOlHk.

RELATED WORK 2.1 Anticipating Technology's Negative Impacts
Various design methods and approaches have been developed to support ideation about potential downstream impacts of technology, including anticipatory tech ethics [22,126], speculative design [10,49,197], and value-sensitive design [57,59,60] among others. 1Farsight code: https://github.com/PAIR-code/farsight To support designers with this, prior research has developed design toolkits [e.g., 29] and resources, such as Envisioning Cards [58], Value Cards [165], Timelines [199], and the Black Mirror Writers' Room [89], among others [e.g., 11,46].Such resources are intended to be used by designers of technology early in the design process, but they may not fit neatly into existing product design and development processes, particularly for AI-powered application design paradigms, where large pre-trained models are used for many downstream tasks [183].
In addition to technology designers, computing researchers have called for the computer science field to consider the negative impacts of their work in addition to the positive impacts [76].In AI research, conferences such as NeurIPS have begun requiring that researchers articulate potential negative broader impacts of their work in statements at the ends of their papers [140] to avoid the "failures of imagination" [20] that may lead to downstream harms.Prior work analyzed these broader impacts statements, finding convergence around a set of topics such as risks to privacy and bias, but often lacking concrete specifics or strategies for mitigation [8,99,127,167].However, prior work suggests that many CS researchers may not have the training, resources, or inclination to engage in this type of anticipatory work [45,175], suggesting that new tools, training, and processes, are needed to support researchers and developers in engaging in anticipatory work in ways that are integrated into their research practices.More recently, researchers have proposed a framework that uses LLMs to anticipate harms for classifiers by generating stakeholders and vignettes for a given scenario [24], evaluating this framework through interviews with responsible AI researchers.Farsight builds upon this framework and extends it to (1) target an early prototyping stage through in situ and interactive interfaces that promote user engagement in the harm envisioning process, (2) support LLM-powered applications with diverse tasks beyond classification, and (3) evaluates its effectiveness through a user study with 42 AI prototypers.

Identifying and Mitigating LLM Harms
More recently, there has been a growing body of research that specifically focuses on identifying and mitigating the harms of LLMs.Researchers have introduced harm taxonomies specifically for LLMs, which identify known risks (i.e., informed by observed instances of harm) [18,100,190] and emerging risks of LLMs (anticipated risks based on foreseeable capabilities of LLMs) [108,166].Since LLMs can be used for a wide range of tasks associated with many different categories of harms, researchers have presented frameworks and evaluation methods to assess a particular type of LLM harm, including misinformation [74,135], representation and toxicity [42,64], human autonomy [65,168], malicious use [38,154], and data privacy [87,97].The popular methods to identify these harms include benchmarking [27,28], user research [101,106], and adversarial testing [41,137].Based on existing benchmarks and harm taxonomies of LLM risks, Weidinger et al. [191] introduce a sociotechnical evaluation framework that identifies three AI actors with LLM safety responsibilities: AI model developers, AI application developers, and third-party stakeholders.
The mitigation strategies for these harms depend on the use cases and context.Popular strategies include algorithmic and sociotechnical approaches [192], such as improving the training data to mitigate social stereotypes and biases [173]; fine-tuning LLM models on curated datasets [64]; filtering LLM outputs [194,205]; employing special decoding techniques [93,158], adding instructions in prompts [9], monitoring the use of LLMs [192]; as well as inclusive product design and development from the beginning [34,36,75,83].Building on this prior work, Farsight introduces a novel framework that leverages human-AI collaboration to help AI prototypers identify the potential harms of LLMs.Specifically targeting AI prototypers as one subset of AI application developers [183,191], Farsight introduces novel techniques and in situ interfaces to foster responsible AI awareness during AI prototyping, although the current version of Farsight does not assist AI prototypers in mitigating potential LLM harms.
To address these challenges and facilitate the adoption of responsible AI practices, researchers have proposed several approaches.These include incorporating ethics into AI education [56,165,172], providing engaging playbooks or design activities [11,79,206], and fostering ethical norms in AI research and development [99,142,171].In addition, researchers have also proposed a wide range of tools to operationalize responsible AI practices [95,198].These tools encompass libraries and frameworks that cover various dimensions of responsible AI, including fairness [13,155,189], explainability [128,150], testing and error analysis [119,151,182], and model development documentation [63,122,142].
Moreover, alongside these advancements, there has been a rise in the research and development of easy-to-use interactive visualization tools to further facilitate the operationalization of responsible AI.For example, tools like What-If Tool [195], FairVis [25], and Visual Auditor [124] enable ML developers to visually assess the fairness of ML models across a diverse range of inputs.Visual analytics systems such as Summit [77], LIT [179], and GAM Changer [187] empower ML developers to interpret their models and fix problematic behaviors.Interactive visual testing tools like Errudite [203], Angler [153], and AdaTest [149] help ML developers surface weaknesses in their models.
Inspired by these tools, Farsight joins the body of research of interactive visualization tools for responsible AI by visualizing use cases, stakeholders, and harms ( § 4.3).In contrast to existing tools that target traditional ML models after they have been trained, Farsight focuses on diverse LLM-powered applications in an early prototyping stage.During this stage, AI prototypers have greater flexibility to iterate on the design and objectives of their applications and implement early mitigation strategies such as engaging with stakeholders and improving data collection [3].

In Situ Alerting Tools
Although in situ responsible AI tools are relatively nascent, there is a large body of research in designing in-context warning tools and interfaces.For example, security and HCI researchers study how to best present warnings to raise people's online security awareness [e.g., 109,125,147] and protect people from malware and phishing attacks [e.g., 51,53,145].The key challenges when designing effective warning interfaces include the presentation of comprehensible messages and supporting evidence [15,54], engaging users [50,202], and preventing alert fatigue and habituation [5,17].To address these challenges, researchers recommend designing simple interfaces [66,67], considering the trade-off between blocking and non-blocking warnings [50], varying interfaces [5], and requiring user input [23].
Using in-context warnings to improve users' safety awareness and encourage users to take protection measures can be considered a form of "digital nudging" [26,160].More recently, researchers have also adapted in-context security warnings to nudge social media users to recognize and avoid online disinformation [85,163] and reflect before posting potentially harmful content [88,169,200].Beyond platform-initiated integration of warnings, end-users also voluntarily seek in-context alert interfaces for productivity improvement.For example, writers use grammar checker tools like Grammarly [73], which offer in-context warnings and scores to improve their writing.Similarly, software developers use accessibility developer tools [40,69] to detect potential accessibility issues during the development process.However, there has been little work in designing and evaluating in situ warnings for developing AI applications, particularly for responsible AI.Farsight's design draws inspiration from many existing warning interfaces ( § 3).Our work advances the landscape of in situ alerting research by addressing responsible AI for modern AI application development.

FORMATIVE STUDY & DESIGN GOALS
To identify the needs and potential challenges faced by users in envisioning harms, we conducted a formative co-design study to investigate (1) how AI prototypers envision harms (if they do), ( 2) what design ideas are most helpful for them, and (3) how to motivate users to think about potential risks when prototyping an AI application.In this section, we report our findings from the formative co-design study, and in § 6, we report on our findings from a subsequent evaluation user study.

Co-design Study
Participants.To inform our tool's design, we conducted a codesign user study with 10 AI prototypers based in the United States.These participants were recruited from Google through internal mailing lists.Our recruitment criteria required participants to have experience using an internal prompt-crafting tool, Prompt-Maker [84], which is similar to Google AI Studio [70] and GPT Playground [133].Each session was 60 minutes, and each participant received an average of $50 USD in their choice of a gift card or a donation to their preferred charity.Among the 10 participants (U1-U10), 6 identified as men, 3 identified as women, and 1 identified as non-binary.Four participants self-reported having expertise in responsible AI.Information about participants' job roles is listed in Table 1.All participants are our targeted users (AI prototypers).
Procedure.We structured our study as a "during-design codesign study" [156].Participants were asked to bring a recent prompt that they had written to the study.The study started with a semi-structured interview regarding participants' prompting workflows and their experience in thinking about potential harms linked to their applications ( § A.2). Then participants were asked to use our very early-stage design prototypes ( § A.2) to envision potential harms associated with their application while thinking aloud.Participants were also presented with low-fidelity sketches for our other design ideas.These prototypes and sketches can be found in Fig. S1.Finally, we asked participants to rate and provide feedback on all of our design ideas and generate their own design suggestions ( § A.3).
Design feedback.Interestingly, although perhaps not surprisingly [cf.78], none of the 6 participants without expertise in responsible AI reported that they typically considered the potential harms of their AI prototypes when writing prompts, while 3 of the 4 participants with expertise in responsible AI did report typically anticipating harms during the prototyping process.Participants' ratings were shown in Fig. 4. Overall, participants favored using AI Fig. 4: Average ratings on our design ideas from 10 AI prototypers.Features marked with were presented to participants as early-stage prototypes, while other features were presented as sketches (see details in Fig. S1).
to generate use cases of their AI prototypes, potential stakeholders, and potential harms.Many participants also highlighted the importance of being able to edit AI-generated content and control the generation direction (U4, U8).On the other hand, participants were less in favor of more distracting design ideas (e.g., an anthropomorphized assistant tool similar to Microsoft Office's Clippy) or irrelevant content (e.g., the latest, rather than the most relevant AI incidents).Participants also provided us with helpful usability feedback that we integrated into our final design of Farsight ( § 4).
New design ideas.Participants generated many interesting design ideas to help raise responsible awareness among AI prototypers.For example, participants recommended categorizing AIgenerated harms (U1, U5), allowing users to rate the severity of harms (U6), and using users' input to steer AI generation (U10).We integrated these design ideas into the final design of Farsight ( § 4).Some other interesting design ideas include designing a game-like reward system to incentivize users to anticipate harms (U5), building online communities to allow users to share their envisioned harms using Farsight and seek support (U2), allowing real-time collaborative harm envisioning similar to Google Slides (U1, U4), and automatically revising a user's prompt to address identified harms (U4).We discuss the implications of these design ideas in user motivation ( § 7.1) and mitigation strategies ( § 7.3).

Design Goals
Based on our literature review ( § 2) and findings and early feedback from the co-design user study, we identify the following five design goals (G1-G5) for Farsight.
G1. Guide users in imagining use cases.Existing research highlights the challenges faced by ML practitioners when attempting to anticipate the uses of their ML-powered applications and how different individuals or groups may be affected [20,45,103,171].Confirming this, software engineer U6 noted "You don't really know how your tool could be used, so it's really hard to envision what harms would be."The availability of LLMs and prompt-crafting tools has broadened the spectrum of AI prototypers to include people without prior technology development experience [55,84], which can further magnify the challenges associated with envisioning diverse use cases for AI applications.Therefore, we design Farsight to help AI prototypers with diverse backgrounds to brainstorm a wide array of use cases for their AI applications.
Depending on an AI application's goal, implementation, and context, some harms are more salient than others [24,121].
To help AI prototypers assess harms, Farsight should first help them understand where and how harms might occur and who might be impacted, by connecting harms to use cases and stakeholders [12,103,120].Participants expressed a desire for the ability to categorize (U1, U5) and rate the severity (U6) of harms.To meet these needs, we aim to design an easy-to-use interface that empowers users to navigate, comprehend, and label harms within diverse potential harm scenarios.
G3. Fit into current workflows and mitigate habituation.In our co-design study, none of the 6 participants without expertise in responsible AI had previously thought about harms when writing prompts.We also found some participants were not incentivized to anticipate harm on their own; for example, U6 explained "To be honest, as a software engineer, I don't use policy tools [compliance tools like checklists] unless I have to."Thus, to make Farsight easy to adopt, we aim to take inspiration from in situ warning tools [e.g., 51,53,145] to design it in a way that fits into AI prototypers' existing workflows instead of introducing new workflows.In addition, we aim to apply strategies like varying content [5] and promoting user input [23] to mitigate habituation-a common pitfall of in-context warning designs [5,17].
G4. Promote user engagement and provide compelling examples.Prior research highlights that the effectiveness of warning tools depends on their clarity and persuasiveness [15,54] Participants like being able to control the harm envisioning process (Fig. 4), and active participation is a key factor in learning [92]-essential to foster AI prototypers' ability to independently identify harms.Thus, Farsight is designed to provide users with human agency and encourage users to actively and critically think about harms.
G5. Open-source and adaptable implementation.Given the ever-expanding array of LLMs and prompt-crafting tools [31], our approach in designing Farsight is to ensure it remains adaptable to this dynamically evolving landscape.We aimed to design Farsight to be model-agnostic and environmentagnostic, thereby making it accessible to users of different LLM models (e.g., Gemini [178], GPT-4 [132], Llama 2 [180]) and prompt-crafting interfaces (e.g., GPT Playground [133], Google AI Studio [70], Wordflow [184]).Furthermore, we open source our implementation to foster future advancements in the design, research, and development of responsible AI tools.

USER INTERFACE
Following the five design goals (G1-G5), we present Farsight, the first in situ interactive tool that aims to foster responsible AI awareness among AI prototypers during the AI prototyping process.Farsight is designed to be a plugin of any web-based prompt-crafting tools.Farsight's interface employs progressive disclosure [129], enabling users to smoothly transition across three main components, with each phase increasing the level of user engagement.The Alert Symbol ( § 4.  The Alert Symbol is an always-on display on top of the AI prototyping tool, displaying the alert level of a user's prompt (Fig. 5).Every time the user runs their prompt, the Alert Symbol updates the alert level using the new prompt.Based on the computed alert level, there are three modes (Fig. 5), each characterized by a progressively more attentiongrabbing style.Thus, Farsight only disrupts AI prototypers' flow when their prompts require more caution (G3).
Calculating the Alert Level.Auditing and quantifying the societal risk of LLM-powered applications is an open research problem [144].To categorize the potential harms that might arise from users' prompts, we propose a novel technique that uses the similarity between the prompt and previously documented AI incident reports as a proxy for the prompt's alert level.First, we use an LLM to extract high-dimensional latent representations (embeddings) of all AI incident reports indexed in the AI Incident Database [111], which includes more than 3k community-curated news reports about AI failures and harms.Then, we extract the embedding of the user's prompt and compute pairwise cosine distances between the prompt embedding and AI incident report embeddings.We label each incident report as , , based on two distance thresholds 0.69 and 0.75.We determine these two thresholds from an experiment with 1k random prompts (see § B.1 and Fig. S2 for details).Researchers can easily adjust these two thresholds (between 0 and 1) to calibrate an article's relevancy.
Finally, we show the numbers of AI incidents that are classified as in an orange circle and in a red circle (Fig. 5) as a proxy of the prompt's potential risk.In other words, we consider a prompt to have a higher risk if many AI incident reports are semantically and syntactically similar to it.

Awareness Sidebar
After a user clicks the Alert Symbol, the Awareness Sidebar (Fig. 6) expands from one side edge of the AI prototyping tool (G3), highlighting potential consequences of AI applications or features that are based on the user's current prompt.We use a real prompt from Awesome ChatGPT Prompts [1] in the example in Fig. 6.
Incident Panel.To encourage users to consider potential risks associated with their prompts (Fig. 6A), the Incident Panel highlights news headlines of AI incidents that are relevant to the user's prompt (Fig. 6-B2).These incidents comprise the top 30 incident reports that are classified as or , sorted in reverse order based on their embedding's cosine distances to the embedding of the user's prompt.The thumbnails are colorcoded based on the incident's relevancy level.Users can click the headline or the thumbnail to open the original incident report in a new tab.These real AI incidents can function as cautionary tales [103,199] reminding users of potential AI harms (G4).
Use Case Panel.To help users imagine how their AI prototype may be used in AI applications or features (G1), the Use Case Panel (Fig. 6C) presents a diverse set of potential use cases that are generated by an LLM.Each use case is shown as a sentence describing how a particular group of end-users could use this AI application in a specific context.For example, for a writing tutor prompt, a potential use case can be "teachers use it to provide feedback on student writing."(Fig. 6-C1).We also use an LLM to generate a potential harm that could occur within that use case, shown below the use case sentence.For example, a harm for the teacher feedback use case can be "students may feel like they are not getting personalized feedback from their teachers."We use fewshot learning to prompt the LLM to generate use cases and harms, whereas we generate use cases, stakeholders, and harms in Harm Envisioner ( § 4.3).We open-source all of our prompts.
To help users assess and organize use cases and harms (G2), we also leverage an LLM to categorize each use case as , , or , although we acknowledge that these may vary by use cases, development and deployment contexts, as well as relevant policies or regulatory frameworks in various jurisdictions.These three categories are introduced by responsible AI researchers to help ML developers structure their harm envisioning process [121].The use cases are those that align with the development target use cases.The use cases encompass those that may arise in high-stakes domains, such as medicine, finance, and the law.The category includes scenarios where malicious actors exploit the AI application to cause harms.The Use Case Panel organizes use cases and harms into three tabs (Fig. 6-C1-3) based on their categories.The first tab, "mix", provides an overview by showing one use case and its corresponding harm from each of the other tabs.

Harm Envisioner
Both the Alert Symbol and the Awareness Sidebar provide easyto-understand in-context reminders to help users reflect on potential harms associated with their prompts.However, instead of passively reading AI incident reports and LLM-generated content, users desire to actively edit and add new use cases, stakeholders, and harms (Fig. 4).Also, active participation-a key factor in learning-may help foster AI prototypers' ability to independently identify harms.Therefore, we design Harm Envisioner (Fig. 7) to support users in actively envisioning potential harms associated with their prompts (G4).We use a real prompt from Awesome ChatGPT Prompts [1] in the example in Fig. 7.
Interactive Node-link Tree Visualization.After clicking the "Envision Consequences & Harms" button in the Awareness Sidebar, Harm Envisioner appears as a pop-up window on top of the prompt-crafting tool (Fig. 7).It begins with a text box filled with an LLM-generated summary of a user's prompt (Fig. 7B).The user is prompted to revise the summary to align with the target task in their prompt.Next, the window transitions into an interactive node-link tree visualization [146], where the user can pan and zoom to navigate the view (Fig. 7C).First, the window shows the user's prompt summary as the root of the tree which is visualized as a text box.The user can click the root node and the LLM will generate potential use cases of an AI application based on the user's prompt, and the use cases are visualized as the root's children nodes.Similarly, users can click a generated node and the LLM will generate its children nodes (stakeholders and then harms).There is a max of four layers, following an order of the user's prompt summary → use cases → stakeholders → harms.This layer order reflects the recommended harm envisioning workflow in responsible AI literature [12,46,103,120,121] and helps users to comprehend and organize diverse harms across different contexts (G2).For additional examples of LLM-generated use cases, stakeholders, and harms in Farsight, see Table S1.Human-AI Collaboration in Harm Envisioning.Our goal is to use AI-generated harms to encourage users to reflect on potential downstream harms and inspire them to add, edit, or curate potential harms (G4).To do that, Harm Envisioner allows users to edit any tree nodes by clicking a button in the toolbar (Fig. 7-C1) or entering new text in the tree node.In addition, users can delete (Fig. 7-C2) and use the LLM to regenerate all of an edited node's children nodes, to effectively steer the harm envisioning direction by offering feedback to the LLM (G4).To meet users' needs of categorizing harms (G2), we use an LLM to classify each harm into a harm type based on a systematic review and taxonomy of AI harms [164].Users can use the dropdown menu to change the harm's category (Fig. 8).To help users prioritize and take notes about harms, the Harm Envisioner allows users to rate the severity of each harm by clicking in the toolbar.Finally, users can click to export all content (e.g., use cases, stakeholders, and harms) in the Harm Envisioner as a Markdown file.

Open-source and Reusable Implementation
To make Farsight easily adoptable by both AI prototypers and AI companies (G5), we implement Farsight to be model-agnostic and environment-agnostic, and we open-source our implementation.Farsight uses LLMs by calling their public APIs so that users can use their preferred LLMs by easily replacing the API endpoints.To help AI companies and researchers integrate Farsight into AI prototyping tools, we leverage Web Components [115] and Lit [68] to implement Farsight as reusable modules, which can be easily integrated into any web-based interfaces regardless of their development stacks (e.g., React, Vue, Svelte).To help AI prototypers use our tool, we present a Chrome extension 2 that integrates Farsight into Google AI Studio and a Python package 3 that brings Farsight to computational notebooks.We implement the interactive tree visualization using D3.js [19] and embedding similarity computation 2 Farsight Chrome extension: https://github.com/PAIR-code/farsight/releases 3Farsight Python package: https://pypi.org/project/farsight/using TensorFlow.js[170] with WebGL [114] acceleration.Computational notebook support is implemented using NOVA [188].

USAGE SCENARIO
We present a hypothetical usage scenario to illustrate how Farsight fosters responsible awareness among AI prototypers.Rosa is a native English speaker from the United States who recently traveled to Vietnam to teach English.She is the only English teacher at an under-resourced high school.Overwhelmed with grading English writing assignments for all students in the school, Rosa tries to develop an LLM-powered AI application that provides writing feedback based on a student's essay.She writes her prompt (Fig. 6A) in an AI prototyping tool with Farsight integrated.After running the prompt, Rosa notices the alarming Alert Symbol (Fig. 6A), so she clicks on it, which expands the Awareness Sidebar (Fig. 6-BC).Rosa reads a few related articles shown in the Incident Panel (Fig. 6-B2).She finds these articles are indeed related to AI in education and are helpful, but they mainly focus on students using AI to cheat rather than teachers using AI to grade assignments.Rosa skims through the LLM-generated potential use cases and finds the use case "teachers use it to provide feedback on student writing" very relatable (Fig. 6-C1).Intrigued by its associated harm "students may feel like they are not getting personalized feedback from their teachers", Rosa clicks the Envision Consequences button and wishes to learn more about this use case and its associated potential harms.
Harm envisioner.Next, Farsight shows a pop-up window asking Rosa to revise and confirm an LLM-generated summary of her prompt (Fig. 7-B).After clicking , Rosa sees the Harm Envisioner presenting an interactive tree visualization showing the functionality of her AI application as a root node and multiple use cases as its children nodes (Fig. 7-C).With a map-like interface, Rosa quickly uses zoom-and-pan to zoom into the teaching use case.After clicking , the Harm Envisioner quickly generates the stakeholders associated with the use case and the harms associated with each stakeholder.Rosa takes some time to reflect on the LLM-generated harm of students not getting personalized feedback (Fig. 7-Harm-1).She has never thought about this consequence before, but she thinks it makes sense-AI does not have background knowledge about each student, so its feedback would not be tailored to students' individual needs.After rating it as very severe by clicking , Rosa continues reading other LLM-generated harms.She does not think the harm of teachers losing jobs to her AI tutor is relevant, so she deletes it (Fig. 7-C2).
Human-AI collaboration.After seeing the random question "increased labor?" next to teacher (Fig. 7-C3), Rosa thinks maybe it will be more time-consuming to review AI-generated feedback than grading students' assignments herself, so she enters that harm into the Harm Envisioner.Next, Rosa is not sure about the legal liability of her school (Fig. 7-Harm-3), but it might be worth discussing with other teachers.Finally, reflecting on her experience with the Harm Envisioner and AI incident articles, Rosa thinks the potential harms of her writing tutor AI application outweigh the potential convenience for her.Therefore, Rosa decides to stop prototyping this application.However, Rosa still sees value in leveraging LLMs in education, so she bookmarks related AI incident articles and clicks to download all the content in the Harm Envisioner as a Markdown file.She will bring these resources to discuss with her colleagues the next day.

EVALUATION USER STUDY
We conducted a user study to evaluate Farsight's effectiveness in aiding AI prototypers to anticipate the potential harms associated with AI features.In addition, we investigate how AI prototypers use Farsight during an early prototyping stage.To investigate the effect of user engagement in AI-assisted harm envisioning, we tested two variants of our tool: Farsight, including all components, and Farsight Lite, including only the Alert Symbol (Fig. 1-B) and the Awareness Sidebar (Fig. 1-C).In other words, Farsight Lite is a "subset" of Farsight.Farsight Lite only shows one direct stakeholder for each use case in the Use Case Panel, while Farsight allows users to interactively add more stakeholders, use cases, and harms in the Harm Envisioner (Fig. 1-A).The study included 42 AI prototypers with diverse roles who were recruited from a large technology company based in the United States.In this user study, we aimed to investigate the following three research questions:

Participants
We recruited 45 voluntary participants from both internal mailing lists related to AI and snowball sampling at Google, based in the United States.The recruitment required participants to have experience in writing prompts for LLMs.In total, we received 61 responses, and we selected 45 participants based on their schedule availability.We conducted pilot studies using the first three study sessions, which were not included in our data analysis.As a result, we had a total of 42 participants.Each study session was either 90 minutes (=28 sessions) or 60 minutes (=14 sessions), depending on the participants' availability.During the 90-minute sessions (or 60-minute sessions), each participant received an average of $62 USD (or $41) compensation in their preferred form such as gift cards and charity credits.Among the 42 participants, 26 identified as men, 14 as women, and 2 preferred not to disclose their gender.Information about their job roles is listed in Table 2. Recruited participants self-reported an average score of 2.55 for their knowledge and experience with responsible AI on a 5-point Likert scale (Fig. 10-top), where 1 represents "No experience" and 5 represents "Expert (I have helped others apply responsible AI practices)."In addition, participants self-reported an average score of 2.81 for experience with LLM prompting on a 5-point Likert scale (Fig. 10-bottom), where 1 represents "Beginner" and 5 represents "Expert."All participants are Farsight's targeted users, AI prototypers.

Study Design
We conducted this study with participants one-on-one.Out of 42 sessions, 2 were conducted in-person, and 40 were through video conferencing software due to office locations and participants' scheduling constraints.With the permission of all participants, Fig. 9: The evaluation study included six conditions with different variations of harm envisioning tools (Farsight, Farsight Lite, and the baseline Envisioning Guide).Participants were asked to envision potential harms associated with an AI feature (e.g., email summarizer) in each harm-envisioning activity (H1, H2, H3, and H4).Participants had access to a harm envisioning tool in H2 and H4.The duration of sessions involving H4 and interview 2 was 90 minutes, while all other sessions lasted 60 minutes.Participants were randomly assigned to a condition, taking into account their availability for study session duration.we recorded the participants' audio and computer screen for subsequent analysis.To start, each participant signed a consent form and filled out a survey regarding their familiarity with responsible AI and LLM prompting (Fig. 10).Then, participants were randomly assigned to one of six conditions taking into account their time availability:   ,   ,   ,   ,   ,   (Fig. 9).C stands for the study condition,   means that participants used Farsight first and then Envisioning Guide, and   means that participants only used Farsight Lite-the other acronyms follow this same pattern.Sessions of   and   were scheduled for 60 minutes each, while the remaining sessions were allotted 90 minutes each.We assigned 7 participants to each condition, as this was the maximum number that allowed for an equal distribution of participants across all conditions, given the time constraints and the availability of the 61 individuals who signed up for the study.
Our study followed a mixed design that combines both betweensubjects and within-subjects designs [161].Each session included three or four harm-envisioning activities, denoted as H1, H2, H3, and H4 ( § 6.2.2), as well as one or two semi-structured interviews to collect participants' feedback ( § 6.2.3).In each harm-envisioning activity, participants were asked to envision potential harms associated with a particular AI feature while thinking aloud (Fig. 9).In H1 and H3, participants envisioned harms on their own, whereas in H2 and H4, they could use a harm envisioning tool we assigned them based on their study condition (e.g., Farsight, Farsight Lite, or Envisioning Guide).All collected harms were rated by seven researchers with experience with responsible AI evaluations, who assigned each potential harm numeric scores in terms of their likelihood and severity ( § 6.2.4).We compared the envisioned harms in H1 and H3 (between-subjects) to investigate how different tools affect users' ability and approach to anticipating harms (RQ1).We compared the envisioned harms in H2 and H4 (within-subjects) to assess the effectiveness of different tools in helping users envision harms (RQ2).Besides the quantitative data on the number and ratings of potential harms, we also collected qualitative data from think-aloud and two interviews (RQ1-RQ3).We incorporated 60minute sessions (  and   ) into our study design due to challenges in recruiting participants available for a 90-minute duration.

Baseline Harm
Envisioning Tool.To compare our work against current responsible AI workflows, we created a baseline intervention Envisioning Guide: a combination of Microsoft's Harm Modeling Practice [120] and the Harm Taxonomy from Shelby et al. [164].These two resources are the latest and the most representative resources designed to help practitioners envision harms.We combined them because (1) we aim to simulate the current practice where AI prototypers can choose from various existing harm envisioning tools, and (2) we do not intend to study the causal effects Fig. 11: In the evaluation user study, we compared our tools against Envisioning Guide, a combination of existing harm envisioning resources.This Envisioning Guide was presented to participants as a Google Doc with three sections.(A) The harm modeling workflow table comes from Microsoft's Harm Modeling Practice [120], providing a four-step process to envision harms.(B) The harm modeling prompts from the Harm Modeling Practice [120] offer templates and questions to help users envision different stakeholders and use cases (not all content is displayed here).(C) The harm taxonomy [164] helps participants explore the space of potential harms by providing a comprehensive list of 20 harm categories organized into five themes (not all content is displayed here).Participants could click the icon to see the definition of each harm category.
of any specific resource.We administered this intervention by providing a Google Doc containing a detailed table and information from these resources (Fig. 11).Both resources were designed to help technology developers and researchers anticipate and prevent negative societal impacts of their technology innovations.

Harm Envisioning
Activities.Depending on the conditions, the study included three or four harm envisioning activities (H1-H4).Within each harm envisioning activity, participants were presented with a description of an AI feature and the prompt that generated that feature.We chose the four AI features (Fig. 9) based on a qualitative analysis of 100 randomly sampled internal prompts written by real AI prototypers.These four features are representative of popular LLM tasks (e.g., summarization, classification, and question answering) and comprehensible to participants with diverse roles.In H1 and H3, participants independently envisioned harms, whereas in H2 and H4, they were provided with a harm envisioning assistance tool (e.g., Farsight, Farsight Lite, or Envisioning Guide).To emulate AI prototyping workflows, we asked participants to perform simple prompt engineering tasks in H2 and H4 before envisioning potential harms of presented AI features.
For each harm, participants were instructed to describe who would be affected (i.e., the stakeholders) and how the stakeholder might be harmed.We provided a harm example for a code generation AI feature: "App end-users might face financial loss due to AI-introduced software vulnerabilities." During the process, participants were asked to share their screens and verbalize their thoughts.They were also asked to enter their envisioned harms into a Google Doc table featuring a who column and a how column.Moreover, participants had the option to articulate the harm verbally, and we transcribed it into the table.At the end of each harm envisioning activity, we reviewed the table together with the participants to ensure the accuracy of the harm descriptions.Participants were instructed to achieve three objectives: (1) envision as many harms as possible; (2) envision the most likely harms; and (3) envision the most severe harms.
H1: Pre-task.To understand how participants independently envision potential harms before using the tool, as a baseline for RQ1, participants were asked to anticipate potential harms concerning an LLM-powered email summarizer on their own (Fig. 9).They received information about the AI functionality: "Shorten and improve a user's email", a development context, and a prompt that enables this functionality (see details in § C.1).The duration of this activity was limited to 10 minutes.H2: Intervention.In the second harm envisioning activity, we asked participants to use different harm envisioning assistant tools.Depending on the assigned condition, a participant could use Farsight (  ,   ), Farsight Lite (  ,   ), or Envisioning Guide (  ,   ) to help them anticipate harms.The activity began with a tutorial on the designated tool.The AI feature used in this activity was an LLM-powered toxicity classifier (Fig. 9).Participants received information regarding the AI functionality "Detect toxic text content," a development context, and a prompt that enables this AI functionality ( § C.2).To emulate AI prototyping workflows, we also tasked participants with a simple prompt engineering assignment ( § C.2).
After completing prompt engineering, participants envisioned harms linked to the toxicity classifier.They were instructed to freely use the assigned tools while sharing their screens and thinking aloud.For participants assigned with Envisioning Guide (  ,   ), the process of entering envisioned harms was the same as H1.Participants assigned with Farsight (  ,   ) or Farsight Lite (  ,   ) could click a button in the tools to export all harms as a text file.The export included both AI-generated harms and harms added or modified by participants.Participants were asked to copy the harms into the Google Doc.As a significant portion of these harms were generated by AI, we asked participants to select harms that (1) they agreed with and (2) would report to their colleagues and managers.Also, participants were welcome to add more harms to the table.For our analysis, we only included the exported harms that participants had selected and added to the table.The duration of this activity was limited to 25 minutes.H3: Post-task.To understand how the intervention may have affected participants' ability to independently envision harms (RQ1), we asked participants to envision harms associated with an LLMpowered article summarizer on their own (Fig. 9).To ensure a valid comparison between the envisioned harms and participants' approaches to the pre-task (H1), we introduced a parallel AI summarizer feature in this activity that was isomorphic to the pretask [139].In particular, to deter participants from directly reusing their envisioned harms from H1, we replaced the email summarizer in H1 with an article summarizer.The AI functionality was described as "Summarize an article in one sentence".The development context and prompt are available in § C.3.The duration of this activity was limited to 10 minutes.H4: Alternative.To assess the effectiveness and usefulness of Farsight and Farsight Lite in comparison to Envisioning Guide (RQ2) and study the usage patterns of different tools (RQ3),  = 28 participants engaged in 90-minute sessions (  ,   ,   , and   ) to envision harms using a tool different from the one used in H2 (Fig. 9).Participants were asked to envision potential harms associated with an LLM-powered math tutor app with the AI functionality "Answer math-related questions", a development context, and a prompt ( § C.4).The procedure for this activity paralleled H2, including a tutorial, prompt engineering exercise ( § C.4), and harm envisioning.This activity's duration was limited to 25 minutes.

Semi-structured
Interviews.This study included two semistructured interview sessions (Fig. 9).The first interview took place after the post-task activity (H3), where we asked participants to reflect on their process for anticipating potential harms during the LLM prototyping process, and how their approach may have changed after the intervention (RQ1).We also asked participants about their challenges in harm anticipation, their experiences of using harm envisioning tools, and potential actions they would take to address the identified harms (RQ3, § D.1).After participants in 90-minute sessions (  ,   ,   , and   ) finished H4, we asked them to compare and rate the usefulness and usability of the two tools they had used in this study (RQ2).We also asked them to rate the helpfulness of different components in the tools on a 5-point Likert scale, as elucidated in § D.2.

Harm Rating.
After completing all 42 study sessions, we recruited seven raters to rate all 989 harms collected in H1-H4 to evaluate participants' ability to envision harms.These seven raters included four of the paper authors and three industry researchers; all raters had experience with responsible AI (unlike many of the participants)-either as responsible AI researchers, developers of responsible AI tools or playbooks, or in a consultant role on responsible AI for product teams.Ideally, evaluations of identified harms would involve both domain experts for the domain in question (e.g., education) and/or stakeholders from demographic groups or communities who may be likely to experience those harms.For this preliminary study, due to timing and resource constraints, we recruited responsible AI researchers as raters instead of specific domain experts or people impacted by AI applications.The limitations of this approach are further discussed in § 6.7 and § 7.2.
Our collected harms were either (1) directly envisioned by participants or (2) exported from Farsight or Farsight Lite and subsequently curated by participants during H2 and H4.Each harm included the impacted stakeholders and a description of the harm.After removing duplicates and random shuffling, we randomly and evenly assigned harms to raters via spreadsheet format.Raters had access to the details of the intended AI feature of each harm, including the prompt and the context of the AI feature (e.g., the prompt and context in § C.1).To prevent the raters from being influenced by our hypotheses, we did not include the experimental conditions in the rating sheet.In other words, raters did not know if a harm was from a Farsight user, a Farsight Lite user, or a Envisioning Guide user.To mitigate rating noise, we designated three raters for each harm.As identifying likely and severe harms is often an objective in AI harm envisioning exercises [120,142], we asked raters to rate each harm's likelihood and severity on a 4-point Likert scale (strongly agree, agree, disagree, and strongly disagree to statements "This harm is likely to occur for this stakeholder" and "This harm will severely impact this stakeholder").Raters could also choose an N/A option if they perceived a rating was not applicable for that feature or use case.During data analysis, we numericalized these four categories as ordinal scores: 1, 2, 3, 4 and removed N/As.See Table S2 for a random subset of harms that were collected from participants and their corresponding ratings.

Data Analysis
We applied a mixed-methods approach for data analysis.First, we conducted a quantitative analysis ( § 6.3.1) on the changes in participants' ability to envision harms by comparing pre-task H1 to posttask H3 responses (RQ1).We also quantitatively assess three different tools' effectiveness in helping users anticipate harms by comparing H2 and H4 responses (RQ2).The quantitative analyses involved metrics such as the total number of envisioned harms, as well as the average likelihood and severity ratings of envisioned harms across 3 raters.Next, we performed a qualitative analysis ( § 6.3.2) on transcripts from think-aloud sessions and interviews to further investigate participants' strategies and challenges in envisioning harms, and usage patterns of different tools (RQ1-RQ3).
6.3.1 Quantitative Analysis.We first conducted quantitative analyses on the count, likelihood, and severity of harms across different conditions to evaluate the effectiveness of our tools (RQ1, RQ2).We measured the likelihood and severity for each harm using the average of ratings from three raters after removing any N/As.The average pairwise weighted Cohen's kappas [32,112] for likelihood and severity ratings are 0.14 and 0.09 (see Fig. S3 and ‡ E for details).These values fall within the range of slight agreement [94].We discuss this relatively low inter-rater agreement in § 7.2.The Shapiro-Wilk normality tests [162] show all measures, except for the changes of harm count between H1 and H3 with Envisioning Guide, follow a normal distribution.We used t-tests with Bonferroni corrections for multiple hypothesis testing.
We also analyzed participants' ratings of the tools' usefulness and usability when comparing the two tools used in the study (RQ2, Fig. 12: To evaluate how different interventions (Farsight, Farsight Lite, Envisioning Guide) affect users' ability to envision harms independently, we conducted one-sample t-tests with Bonferroni correction to examine the difference in the (A) count, (B) average likelihood, and (C) average severity of participant-identified harms between H3 and H1.Each intervention had n = 14 participants, represented by 14 points on the chart.The charts also indicated the 95% confidence intervals, adjusted with Bonferroni correction.The results highlighted that after using Farsight and Farsight Lite, users could anticipate a significantly higher number of harms, while the average likelihood and severity of identified harms remained the same.§ 6.5.3).We converted the 5-point Likert scale ratings into numerical values and assessed the difference between ratings of our tools and Envisioning Guide using Mann-Whitney U tests [107].Considering that most of the ratings did not exhibit a normal distribution, we chose to use Mann-Whitney U tests, as these tests do not assume normality in the data.See § 6.5.3 for discussion of the findings from these questions about usefulness and usability.

Qualitative Analysis.
We conducted a qualitative analysis on the screen recordings and transcripts of the study sessions that include participants' verbalized thoughts during the harm envisioning activities (H1-H4) and interviews.All study sessions were screen-recorded and audio-recorded, with the audio subsequently transcribed by the video conferencing software.We adopted an inductive thematic analysis approach [21,116] and open coded the 56-hour-long transcripts using the qualitative analysis software Dovetail [47].After generating a codebook, we applied deductive coding [116] to assign harm envisioning patterns to each participant during H1 and H3 (RQ1, § 6.4.2).

Findings: Changes in Users' Envisioning
Ability and Approach (RQ1) In the study, participants were asked to independently envision harms associated with an email summarizer (H1) and an article summarizer (H3) before and after using a harm envisioning tool (Farsight, Farsight Lite, or Envisioning Guide) to anticipate harms for a toxicity classifier (H2).We quantitatively and qualitatively compared participants' envisioned harms and approaches in H1 and H3 across different conditions in H2.

Farsight and Farsight
Lite Improved Users' Ability to Envision Harms.For each participant, we compared the count, average likelihood, and average severity of their independently envisioned harms before (H1) and after (H3) the intervention (Fig. 12).Using paired sample t-tests with Bonferroni correction [48], we found that after using Farsight and Farsight Lite, users could envision significantly more harms on their own ( = 0.0028,  = 0.0003), showing an average increase of 2.42 and 3.00 harms, respectively.The effect sizes, as measured by Cohen's  [33], were  = 1.21 and  = 1.27, indicating a very large effect [157].On the contrary, for participants using Envisioning Guide, the average count of identified harms experienced a marginal decrease (−0.14).We hypothesize that the observation of three participants identifying fewer harms after using Envisioning Guide (see the outliers in Fig. 12 A) is because Envisioning Guide had a high cognitive load.The high cognitive load may have resulted in these three participants having less energy to envision harms in H3 compared to H1. Changes in the average likelihood and average severity, on the other hand, were not statistically significant for any of the interventions (Fig. 12-BC).
Our finding implies that after using Farsight and Farsight Lite, users could anticipate a greater number of harms linked to AI features independently, while the average likelihood and severity of identified harms remained unaltered.

Changes in Harm Envisioning
Approaches.We also investigated the impacts of different tools on participants' approaches to harm envisioning by analyzing their self-reports in interview 1 and the think-aloud data in H1 and H3.
Self-reported changes after using Farsight and Farsight Lite.The major themes of self-reported changes are similar between Farsight and Farsight Lite.A large number of participants noted that while they initially considered the AI feature and its potential harms in a general sense during H1, they shifted towards a more focused consideration of specific use cases and stakeholders in H3 (e.g., P23, P34, P38).Some participants highlighted they started to brainstorm potential misuses in H3 (P25, P32).For stakeholders, participants broadened their consideration to people and organizations not initially considered during H1.P40 acknowledged a transition from a focus on "protecting the AI company" in H1 to Table 3: We identified six non-exclusive common patterns in independent harm envisioning by analyzing transcripts of participants' think-aloud process during the harm envisioning activities in H1 and H3.

Harm Envisioning Pattern Description
Failure-mode-driven envisioning Participants envisioned harm by initially considering the AI feature's failure modes (e.g., wrong summarization), limitations of LLMs (e.g., hallucination), or vulnerabilities within system implementation (e.g., data storage).This pattern is similar to a Failure Mode and Effects Analysis [152].

Usage-driven envisioning
Participants envisioned harm by initially considering who may be impacted through this feature and in what usage scenario, such as students using the article summarizer for completing assignments.Then, participants envisioned potential harms that might impact the stakeholders within the identified scenario.

Consider high-stakes uses
Participants deliberately thought about high-stakes use cases of the AI feature, such as being used in medical, financial, and legal domains.

Consider misuses
Participants deliberately envisioned potential misuse of the AI feature, where malicious actors like scammers and hackers could exploit this AI feature to cause harm.

Consider indirect stakeholders
Participants deliberately brainstormed stakeholders indirectly impacted by the AI feature, such as people who did not use the AI tools, individuals mentioned in the input text, and society at large.

Consider cascading harms
Participants deliberately considered (1) harms that could result from other harms, such as productivity loss due to AI errors can lead to economic loss; or (2) harms that might occur even when the AI feature operated as expected, such as students using AI to cheat in homework.
considering end-users in H3.Similarly, P17 reported a focus on end-users after using Farsight: "Earlier maybe I was coming towards it from a very engineering or a very broad feature perspective.The third time, I was thinking more about people who were actually using the product and getting affected.So I was thinking more with respect to the people using it, rather than that being a feature in some application." (P17) Many participants also highlighted that they began to adopt the frameworks presented in Farsight and Farsight Lite (e.g., P9, P10, P32) to structure their harm envisioning procedures.For example, P10 and P32 appreciated the categorization of use cases, and they reported considering intended uses, high-stakes uses, and misuses in H3.After using Farsight, P9 said they followed the sequence of layers in the tree visualization to conceptualize use cases, stakeholders, and harms: "I found that sort of flow from identifying potential use cases, then identifying stakeholders of those use cases, then identifying potential harms for each of the stakeholders to be really valuable.That's a great way to scaffold it and think through the flow rather than just sort of bouncing around, which is what I had been doing [in H1].So yeah, I found that super valuable that has changed the way that I think about it.And that's the framework that I'll use in the future." (P21) Self-reported changes after using Envisioning Guide.Many participants using Envisioning Guide in H2 (  ,   ) also noted shifts in their approaches to envisioning harms.Several participants noted that they started to follow the structure outlined in the Harm Modeling Guide to envision harms (P8, P40, P42).Some participants started thinking more about under-represented social groups in H3 (P8, P31).Furthermore, many participants described the harm taxonomy as a "mental checklist" that provided them with a language to articulate and think about harms (e.g., P6, P14, P31).
Observed changes in envisioning approaches.By analyzing transcripts of participants' think-aloud process during the harm envisioning activities in H1 and H3, we identified six non-exclusive common patterns in harm anticipation (Table 3).Then, we examined the effects of different interventions on participants' envisioning patterns by comparing the number of participants who applied and did not apply these six patterns in H1 and H3 across interventions (Fig. 13).The intervention assignment is random.
Interestingly, the counts of participants who applied each pattern in H1 were consistent across interventions, with the exception of Farsight Lite where notably more participants considered indirect stakeholders in H1 (Fig. 13-5).Before the interventions, the majority of participants relied on failure-mode-driven envisioning when anticipating harms (Fig. 13-1), focusing on the AI feature's limitation, failure modes, and technical implementation details.This observation corroborates participants' self-reported envisioning approaches, where participants like P17 acknowledged having a "very engineering or a very broad feature perspective" in H1.
After the intervention, we observed that all three harm envisioning tools (Farsight, Farsight Lite, and Envisioning Guide) influenced participants to adopt a usage-driven envisioning approach when independently envisioning harms (Fig. 13-2).Particularly, Farsight had the most pronounced effect, followed by Farsight Lite and then Envisioning Guide.All these tools encouraged participants to think more about high-stakes uses (Fig. 13-3) and indirect stakeholders (Fig. 13-5).Both Farsight and Farsight Lite exerted a stronger influence on considering misuses (Fig. 13-4) and cascading harms (Fig. 13-6) compared to Envisioning Guide.However, Envisioning Guide had slightly more impact than Farsight Fig. 13: By analyzing transcripts of 42 participants during the pre-task (H1) and post-task (H3) harm envisioning activities, we identified six non-exclusive common patterns in envisioning harms.This bar chart compares the number of participants who applied and did not apply these patterns before and after the three interventions.Note that there were 14 random participants for each intervention, and the initial number of participants applying certain patterns could differ.The chart highlights that both Farsight and Farsight Lite encouraged participants to consider how the AI feature would be used.Notably, the use of Farsight particularly influenced participants to think more about indirect stakeholders and cascading harms.
Interestingly, Farsight had a notably more pronounced effect in leading participants to consider indirect stakeholders (Fig. 13-5) and cascading harms (Fig. 13-6) than the other tools.For indirect stakeholders, a possible explanation is that during H2, many participants encountered unexpected indirect stakeholders revealed by Farsight ( § 6.5.2).Consequently, these participants consciously began to consider stakeholders that might seem tangential but could be influenced by the AI feature in H3.This hypothesis could also explain the relatively weaker effect of Farsight Lite in fostering consideration of indirect stakeholders, as Farsight Lite had only identified one direct stakeholder for each use case, and participants could not use AI to generate more stakeholders.
For cascading harms, we hypothesize two potential explanations.First, many participants applied a reviewing approach when engaging with AI-generated harms in Farsight and Farsight Lite, where they tried to understand and make sense of these harms.In H2, reviewing existing harms prompted participants to consider cascading harms that might arise from other harms ( § 6.5.2).This experience could have influenced participants to also consider cascading harms in H3.The second explanation is that many participants were surprised by unexpected AI-generated cascading harms in H2 ( § 6.5.2), which might have led them to consciously think about these harms in H3.

Findings: Farsight's Effectiveness in
Assisting Harm Envisioning (RQ2) In addition to assessing the impacts of different harm envisioning tools on users' ability to independently envision harms, we also evaluated the tools' effectiveness in aiding users to anticipate harms.
Specifically, we quantitatively compared participants' envisioned harms when using different harm envisioning tools in H2 and H4.Furthermore, we qualitatively analyzed participants' usage patterns, interview responses, and survey data.

Farsight and Farsight Lite helped users envision more harms.
We compared the count, average likelihood, and average severity of harms collected in H2 and H4 using our tools, Farsight and Farsight Lite, against the baseline Envisioning Guide (Fig. 14).These harms were identified by participants using different harm envisioning tools or generated by AI and selected by the participants.This analysis followed a within-subjects approach, including 28 participants from   ,   ,   , and   .In each comparison, such as Farsight vs Envisioning Guide, a total of 14 participants used both tools, with 7 of them starting with Farsight in H2 (  ), and the remaining 7 beginning with Envisioning Guide (  ).
Results from paired t-tests, adjusted with Bonferroni correction, highlighted that participants using Farsight and Farsight Lite resulted in a significantly higher number of harms compared to those using Envisioning Guide ( = 0.0018,  = 0.0034), with an average difference in the count of 4 (Fig. 14A).The effect sizes, as measured by Cohen's , were  = 1.57and  = 1.48, indicating a very large effect.However, no significant differences were observed regarding the likelihood and severity of identified harms between our tools and Envisioning Guide (Fig. 14-BC).Our findings suggest that our tools are effective in assisting users to identify a greater number of harms compared to existing resources, while the quality of the identified harms remains consistent.

Usage patterns.
We summarized how participants use Farsight and Farsight Lite in H2 and H4.
Fig. 14: To evaluate the effectiveness of our tools in helping users anticipate harms, we conducted paired t-tests with Bonferroni correction to compare our tools (Farsight, Farsight Lite) against the baseline Envisioning Guide based on the (A) count, (B) average likelihood, and (C) average severity of harms collected in H2 and H4.In each comparison, such as Farsight vs Envisioning Guide, n = 14 participants (each shown as two connected dots) used both tools: 7 of them started with Farsight in H2, and the remaining 7 began with Envisioning Guide.The charts also highlighted the mean and standard deviation of all measures.The results showed that Farsight and Farsight Lite were effective in assisting users to anticipate a significantly greater number of harms compared to existing resources, while the quality of the identified harms remained consistent.
Trying to understand (unexpected) AI-generated content.Upon encountering AI-generated content (e.g., use cases, stakeholders, and harms), participants first sought to (1) understand why AI had generated it and then (2) assess its likelihood and relevance to their AI application.For example, for the toxicity classifier in H2, Farsight and Farsight Lite sometimes would generate a use case "HR departments use it to screen job applicants for toxic behaviors."This use case was usually unexpected to participants, and it provoked them to think how an HR department could employ a toxicity classifier.Some participants imagined that the HR could use this classifier on applicants' social media to identify red flags (e.g., P10, P11, P29), while others could only see it being used on applicants' cover letters (P4).Participants then assessed how likely and relevant is this scenario before diving into related harms.
Subjectivity in apprehending auto-generated content.We observed that based on participants' prior experiences, they could have very different views on auto-generated content in Farsight.For example, participants had different perceptions of how their companies' HR division might use a toxicity classifier (e.g., applying the classifier to job applicants' social media content or their application material).Also, for the toxicity classifier in H2, the Incident Panel would often show an incident report on biases in sentiment analysis tools.While some participants could quickly make the connection between sentiment analysis and toxicity classification and reflect on biases in toxicity classifiers (P10, P36), others would overlook this incident (P19, P38).
In some cases, participants' disagreement came from their different definitions of harm.For example, in both H2 and H4, our tools would generate potential harms for people who do not use the AI applications, such as "students who do not use the math tutoring app may feel left behind."Some participants perceived these harms as crucial considerations for assessing the impacts of AI applications (e.g., P6, P18, P30), while others argue against considering harms when an AI feature is absent (e.g., P4, P9, P13).We discuss the implications of subjectivity and rater disagreement in harm envisioning in § 7.2.
Sparked to brainstorm new harms.The content in Farsight and Farsight Lite often inspired participants to brainstorm new use cases, stakeholders, and harms.After seeing an AI-generated stakeholder, many participants could quickly identify potential harms for that stakeholder.For instance, seeing the stakeholder teachers in the math tutoring app in H4, P22 added a new harm that teachers may struggle to integrate this tool into their existing teaching workflows.Many participants also came up with new harms by making connections across different AI-generated use cases, stakeholders, and harms.For example, Farsight anticipated two use cases for the toxicity classifier: (1) online moderators using it to identify toxic content, and (2) hate groups using it to recruit people.P2 connected both use cases and added a new harm: "online moderators could face death threats from hate groups who feel their speech is censored." Thinking beyond immediate harms.Instead of starting with a blank slate, our tools provided participants with initial materials that prompted them to think beyond the immediate harms and envision cascading repercussions.For example, after seeing the AIgenerated harm "job applicants might be unfairly rejected" within the context of HR using a toxicity classifier to screen job applicants, P38 quickly thought of a cascading harm-the company's diversity hiring effort could be harmed, as the toxicity classifier was more likely to misclassify and reject under-represented social groups.Similarly, P18 recognized in the long run, the hiring company could lose money due to the exclusion of qualified candidates caused by a biased toxicity classifier.This usage pattern might explain the increase of participants, who used Farsight and Farsight Lite in H2, independently envisioning cascading harms in H3 (Fig. 13-6).Fig. 15: Average ratings from 28 participants, comparing the usefulness and usability of Farsight and Farsight Lite to Envisioning Guide.Both of our tools were preferred and perceived as more helpful, easier to use, and more enjoyable than the existing resources.Each comparison involved 14 participants who used one of our tools and Envisioning Guide in random order.We use an asterisk ( * ) to denote statistically significant rating differences, determined by Mann-Whitney U tests with Bonferroni correction.We used Mann-Whitney U tests instead of t-tests due to the non-normal distribution of many ratings.Thinking about mitigation strategies.Interestingly, after seeing AI-generated harms, many participants voluntarily considered actions and strategies to take after envisioning harms.For example, after seeing AI-generated harms for the toxicity classifier, P15 and P16 noted that it was important to allow impacted stakeholders to appeal if their content was removed because of the classifier.Similarly, P27 and P40 noted that people should implement a human review process if the toxicity classifier was used to remove social media content.Interacting with Farsight and Farsight Lite also encouraged participants to reflect on their prompting workflows.For example, P29 and P37 mentioned that the AI prototypers should start collecting good and diverse toxicity examples to improve the prompt through few-shot prompting.P2 noted that they would like to add additional instructions in their prompt to safeguard against biased output and potential data leakage.Finally, after envisioning more harms, P2 mentioned that they would rethink if it was worth continuing to prototype or develop this AI feature.6.5.3Our tools were usable, useful, and preferred by users.We asked participants who had used one of our tools and Envisioning Guide (  ,   ,   ,   ) to compare and rate the usefulness and usability of the tools they had used on a 5-point Likert-scale.By comparing their ratings, we found both Farsight and Farsight Lite were preferred and considered as more helpful, easier to use, and more enjoyable compared to Envisioning Guide (Fig. 15).Both tools had significantly higher ratings on "easy to use" than the baseline ( = 0.0384,  = 0.0260).In addition, Farsight was rated significantly more helpful than the baseline (Fig. 15A), while Farsight Lite was more enjoyable (Fig. 15B).The effect sizes of significant results, as measured by the common language effect size [110], were all above 0.7, indicating a large effect.
Usefulness of different features.Besides comparing the two tools, participants also rated the usefulness of specific features in each tool.The average ratings are shown in Fig. 16.All features in our tools were rated favorably (Fig. 16-AB).For Farsight, participants especially liked the interactive tree visualization.For example, P6 commented, "This tree makes a lot of sense.This is how I think about it in my brain as well."Similarly, P16 appreciated the progressive disclosure in the visualization: "I'm able to not get overwhelmed by everything all at once."The rating for the AI incident panel (in both Farsight and Farsight Lite) is relatively lower than other features.Participants explained that the surfaced incidents were not very relevant to their prompts (P39, P41), and the feature would require them to take time to read external articles (P24, P39).

Findings: Farsight's Role in Overcoming
Harm Envisioning Challenges (RQ3) After completing the post-task (H3), participants were asked to reflect on the biggest challenges encountered in envisioning harms associated with AI features.We examined the major themes that emerged from these challenges.In addition, by analyzing participants' usage patterns of Farsight and Farsight Lite, coupled with their interview feedback, we explored how our tools mitigate certain challenges and also identified our tools' limitations.
6.6.1 Challenges in envisioning harms.We summarized three major challenges that participants encountered.
C1. Envisioning use cases.The most prevalent challenge in envisioning harms is to anticipate different use cases for an AI feature.Multiple participants noted that it was most challenging to imagine how different people would use technology, and it was particularly difficult to "put myself in someone's shoes" (P27, P37, P39) and "empathize with different groups of people" (P11).Participants also underscored the vast space of possible use cases (P31, P33, P36), and "often you don't find out the edge cases until you actually work with it" (P2).Some participants also emphasized that it sometimes required creativity to imagine how an AI feature would be used and especially misused (e.g., P5, P22, P23).
C2. Bias and subjectivity in harm envisioning.Interestingly, several participants recognized their own biases in envisioning harms (e.g., P6, P21, P31).For example, P21 noted the challenge in overcoming their biases in anticipating the impacts of AI features: "I had been coming at it from a very American-centric point of view at first.To talk about bias, I hadn't even conceived of the government using this to monitor my phone, but that could happen in other places."Moreover, some participants acknowledged the subjectivity in the definition of harms, as well as in the assessment of harms' likelihood and severity.For example, while envisioning harms and selecting harms to report (H2 and H4), some participants were conscious of whether other people would agree with their identification and assessment of harms (P19, P38).
C3. Inexperience and discomfort in harm envisioning.Many participants mentioned that our study was their first time to envision harms for AI features (e.g., P17, P26, P28).For example, P26 noted "I have never envisioned harm before.This is not something I would think of when developing AI products."Similarly, P18 said "I'm familiar with technical issues but not their social influence".Also, P30 pointed out that there were few incentives for developers to envision harms.In addition to unfamiliarity, some participants also noted that it was uncomfortable and sad to think about harms (P3, P12).For example, P3 said "It's not comfortable thinking through all the bad things that can happen.I think in general people don't like thinking about bad things too much." 6.6.2Farsight and Farsight Lite address major challenges.Our tools could help users address identified challenges.
A co-pilot for brainstorming diverse use cases.Many participants appreciated that our tools provided them with a starting point to predict use cases (e.g., P8, P29, P41).For example, after seeing a few AI-generated use cases, P8 found it much easier to envision other use cases, and similarly, P24 felt empowered to "have a wider net to cast" (C1).Also, P14 noted that even seeing far-fetched AI-generated content helped them brainstorm new use cases.On the other hand, P21 appreciated that Farsight had identified many unexpected and thought-provoking use cases that provided a different perspective in anticipating harms (C2).
In situ guide that promotes user agency.Participants especially liked that our tools were directly integrated into existing AI prototyping tools and contextualized based on the prompt (e.g., P19, P31, P37), where Farsight and Farsight Lite required minimal effort to get started envisioning harms (C3).Participants also thought the Incident Panel and Use Case Panel as a good reminder for potential harms for the AI feature that one is prototyping (e.g., P12, P41, P42).For example, P12 commented that "Even if it's just sitting there, it would be educational."Many participants also liked the interactivity of our tools and found it engaging for adding new use cases, stakeholders, and harms (e.g., P9, P19, P24)-many of them noted that Farsight was so intriguing that they would like to continue using it to explore potential harms (C3).Participants felt they had agency in harm envisioning when using Farsight.For example, P21 commented "If you think something [AI-generated content] is totally bonkers, whatever, just delete it."Similarly, P4 and P5 compared the Harm Envisioner to a mind map, as they appreciate that the interface allows them to freely organize and revise their thoughts in harm envisioning.6.6.3Limitations of Farsight and Farsight Lite.Our findings showed that, in comparison to Envisioning Guide, Farsight and Farsight Lite did not show significant differences in participants' ability to envision more likely or more severe harms ( § 6.4.1), nor did they assist participants in envisioning more likely or more severe harms ( § 6.5.1).Additionally, participants' feedback revealed two major limitations of our tools.
Varied quality of LLM-generated content.Depending on participants' prompts, the related AI incidents in the Incident Panel, and LLM-generated use cases, stakeholders, and harms were different across participants.Sometimes, participants found a few LLMgenerated content confusing and unhelpful.For example, when using our tools on the math tutor prompt (H4), the incidents in the Incident Panel feature articles about hallucination in chat-based LLM models.Some participants found these articles too generic and not relevant to the math tutor app (P39, P41).
Also, some LLM-generated use cases could be too far-fetched.For example, for the math tutor prompt, Farsight sometimes showed a use case: "Scammers use it to explain complex investment schemes to potential victims."While some participants found it interesting and relevant (e.g., P14, P26), others found it unrealistic and not useful (e.g., P6, P12).This disagreement highlights the subjectivity in identifying and assessing harms ( § 7.2).Interestingly, a few participants defended the usefulness of far-fetched content.P24 noted "Even if it's wrong [LLM-generated use case], it is still kind of helpful to think beyond the immediate use case and who else can use this tool."Similarly, P21 said "Some of these feel more of a stretch but it's interesting because I could see how it gives me ideas for things to watch out for which I still appreciate."Lack of actionability.Another limitation is that our tools did not provide users with actions to prevent or mitigate identified harms (P13, P22, P34).P13 also commented that increasing awareness without providing actions to address responsible AI issues could be harmful, because "People have an empathy quota, and it might just be displacing more impactful efforts."Related to the discomfort that some participants had experienced when envisioning harms (C3), P40 mentioned that they felt scared and overwhelmed because there were so many possible harms and they did not know how to address them.Similarly, P29 noted that the lack of actionability made them feel anxious and disappointed: "

Limitations of Study Design
We acknowledge limitations in our tool and study designs.First, we recruited participants from a single large technology company.This was because we needed to require participants to have prior experience in prototyping LLM-powered applications using a particular prompt-crafting tool, into which we integrated Farsight and Farsight Lite in the study.Consequently, all 42 participants had backgrounds in the technology industry in varying roles, such as software engineers, product managers, UX researchers, and linguists 4 as shown in Table 2. Our participants have a wide range of familiarity with responsible AI and prompting (Fig. 10), and they use LLMs for diverse tasks, including prototyping AI features with LLMs-much like the intended users of Farsight.Therefore, findings from our study may be generalizable to AI prototypers who have worked in the technology industry, and who are using LLMs to prototype AI-based applications.Nevertheless, to understand how usable or effective Farsight may be for a broader spectrum of AI prototypers, particularly those with limited background or knowledge of AI, such as creative writers, teachers, students, and more, further research involving individuals with more diverse backgrounds is needed.Second, we administered only one posttask (H3) immediately following the intervention (H2).To evaluate the long-term impact of our tools on users' ability to envision harms, a more extended longitudinal study is needed.
Finally, an inter-rater reliability test showed that, on average, the seven raters (i.e., of the likelihood and severity of the identified harms) only had a slight agreement ( ‡ E).The ratings of the likelihood and severity of participants' identified harms should thus be taken as an initial step in evaluating identified harms, and not as the sole evidence demonstrating the value of this approach.The relatively low inter-rater reliability may be due to the fact that perceptions of severity and likelihood may be highly influenced by the raters' personal experiences, backgrounds, knowledge, and their positionality as a whole.Indeed, substantial prior work on annotations of offensive language, hate speech, and other linguistic phenomena [35,39,123,136,138] suggest that disagreements between raters with different subjectivities (i.e., personal backgrounds and experiences) is an inherent challenge to sociotechnical evaluations, and not one that can be solved with more or better raters.We further discuss the challenges regarding subjectivity in identifying and assessing harms in § 7.2.

DISCUSSION 7.1 Motivation & Engagement in Responsible AI
Potentials of in situ and early intervention in motivating responsible AI practices.Existing research suggests that many AI developers may not have incentives to consider potential harms related to their AI applications [143]-or may be actively disincentivized to identify such harms [104].Our co-design user study validates this finding among an emerging community involved in AI development-AI prototypers who use LLMs to rapidly iterate on potential AI-based applications ( § 3.1).With the rapidly increasing access to LLMs and easy-to-use prototyping tools, it is crucial to motivate AI prototypers to consider AI risks when prototyping their AI applications or features (G3).To tackle this challenge, we propose an in situ system design that integrates our tool into the AI prototyper's existing workflows and employs different design strategies to draw users' attention without causing significant interruption to their flow.Our evaluation study shows that users appreciate our design, and find this in-context warning tool easy to adopt and engaging ( § 6.6.2).By showing unexpected use cases, stakeholders, and harms, Farsight piques users' interests ( § 6.5.2) and motivates them to brainstorm more harms ( § 6.5.2).These findings highlight the great potential of in situ design and early intervention for future responsible AI works.Therefore, future designers of AI development tools (e.g., Google AI Studio, computational notebooks, and VSCode) can natively integrate in situ interfaces to promote responsible AI practices.In addition, future researchers can adopt our design strategies to foster other responsible AI practices, such as illustrating bias in LLMs and encouraging development documentation at an early AI development stage.
Tension between automation and human agency.Farsight's seamless integration into AI prototypers' workflows helps motivate AI prototypers to engage with harm envisioning.In addition, rather than asking users to anticipate harms entirely from scratch, Farsight leverages LLMs to generate the initial set of use cases, stakeholders, and harms, providing users with inspiration and a foundation to build upon ( § 6.6.3).However, this seamless and automated design might deter users from fully engaging in and contemplating the limitations and potential risks associated with LLMs.Prior research in responsible AI has proposed the value of a seamful design [e.g., 52,86], where the designers strategically reveal seams or introduce frictions or "productive restraint" [85,104] to support increased reflection on responsible AI during development.
To explore this tension and tradeoffs between a seamfully-designed workflow that is easy to use by prototypers, and a seamful design that prompts reflection-in-action [52], we (1) designed the Harm Envisioner to encourage users to edit LLM-generated content and steer the harm envisioning direction ( § 4.3, G4), and (2) evaluated two variants of our tool in the evaluation study-Farsight and Farsight Lite, where Farsight Lite omits the Harm Envisioner.
Our study results highlight that participants feel they have agency ( § 6.6.2), and they like being able to control the harm anticipation process (Fig. 4).Our quantitative results also show that Farsight, with higher human agency, is more effective than Farsight Lite across all measures ( § 6.4.1, § 6.5.1).On the other hand, when engaging with AI-generated content, some participants also report discomfort (C3) and even anxiety ( § 6.6.3).Therefore, our work demonstrates that seamless design (in situ AI automation) and seamful design (promoting user reflection) are complementary to each other-tradeoffs and a balance between the two should be considered during the design of responsible AI tools [cf. 198].For future responsible AI work, researchers should engage with potential users and other impacted stakeholders throughout the design process and adjust their design ideas to ensure the responsible AI tools they are designing are both easily adoptable and capable of eliciting active and critical reflection.

Subjectivity in Harm Envisioning
In our evaluation user study, many participants report challenges overcoming the limitations of their own experiences and perspectives when envisioning harms (C2).In addition, we also observed the seven RAI raters of participants' harms disagreed about which harms were more or less severe or likely, resulting in a low interrater reliability for these two dimensions ( ‡ E, Table S2).Our empirical findings contribute to prior research that highlights the role of subjectivity and positionality in anticipating harms [20,99] and in data annotation, particularly for annotations of toxicity or hate speech [e.g., 35,39,43,123,138].What constitutes harm and the assessment of harm severity are often influenced by the individual's background, lived experiences, or even the organizational culture they are working in [130,196].For example, for the article summarizer (H3), one participant envisioned a harm scenario: "If the summary is wrong, journalists' reputation might be harmed." (Table S2).This harm scenario received likelihood ratings of 1, 4, and 3, and severity ratings of 1, 3, and 4 from three randomly assigned raters.It is possible that the rater who assigned the ratings of 3 and 4 possessed specific knowledge about the harms of journalists using LLMs to write article summaries, which led them to rate this harm scenario as more likely and more severe.
A need for new methods to assess harms.Emerging research is beginning to develop methods for measuring and resolving disagreements among annotators in cases where there may in fact be no ground truth [e.g., 35,39,72,123,138].Our findings in this paper-including the low inter-rater reliability of the responsible AI raters-suggest that new methods are needed in responsible AI to account for different perspectives on the severity and likelihood of potential downstream harms.This may ideally involve recruiting participants from communities or populations who may be impacted by a given AI application (e.g., the stakeholders generated by Farsight, for instance, as well as other stakeholders identified by members of the communities themselves [37]).Moreover, with the rapidly increasing access to LLMs and easy-to-use AI prototyping tools, AI prototypers may encompass a broader spectrum of roles beyond traditional AI practitioners [e.g., 78,183].Thus, they may lack either the experience or the resources to recognize the limitations of their own subjectivity when anticipating harms of their AI applications, and may lack the means to identify and engage with diverse stakeholders as part of harm envisioning.
Benefits and challenges of using LLM to envision AI harms.Our evaluation study highlights that diverse and unexpected AIgenerated use cases, stakeholders, and harms in Farsight help some participants overcome their own failures of imagination [20] in order to think from a broader perspective when independently envisioning harms ( § 6.5.2).Notably, these effects were more prominent with Farsight than with existing harm envisioning processes [120] (Fig. 12).There are two implications of these findings.First, LLMs can be a promising tool to help AI prototypers think outside of their own perspectives, and future researchers can adapt our approach to other responsible AI practices.Second, LLMs may encode biases from their training data [e.g., 190], and Farsight may also reflect the biases of its creators, as expressed in the underlying prompts used in Farsight's LLM, which raises a critical question: to what extent can LLMs be helpful as part of a harm envisioning process, without over-indexing on particular harms or leading AI prototypers to overlook other types of harms?
Our research provides an initial starting point into investigating these questions, as well as opening new questions into the role of subjectivity in harm envisioning.Future research can further investigate the factors influencing users' ability to envision harms of AI applications, develop new ways to model and resolve disagreement among AI prototypers or other evaluators about the severity and likelihood of envisioned harms, and integrate such implications into LLM-powered responsible AI tools for AI prototypers or other AI practitioners.Future research can also explore tradeoffs between semi-automated harm envisioning processes (like Farsight) and more traditional processes like value-sensitive design [e.g., 57], participatory design [e.g., 16,37,130], and more.

Mitigating Harms during AI Prototyping
A limitation of Farsight is its focus on harm identification rather than harm mitigation.Participants from our co-design study ( § 3.1) and evaluation study ( § 6.6.3)wanted Farsight to provide actionable items to help them prevent and mitigate identified harms.Some participants also suggested we develop an in situ prompt editing tool to address harms identified from Farsight ( § 3.1).Interestingly, while using Farsight, some participants voluntarily thought about actions and strategies to take after envisioning harms, such as implementing an appeal process, collecting better data, and revising the prompts ( § 6.5.2).These strategies identified by participants reflect some current strategies for mitigating LLM harms ( § 2.2).
Looking ahead, we argue that it is crucial for future designers to provide users with harm mitigation suggestions and resources in systems similar to Farsight.Some participants in our study complained that Farsight is exploiting users' "empathy quota" and potentially desensitizing them about LLM harms, because Farsight only warns users about harms without providing mitigation suggestions ( § 6.6.3).This concern reflects the phenomenon of "alarm fatigue" in alerting tools ( § 2.4) and monitoring alarms in healthcare.Alarm fatigue occurs "when non-actionable alarms are in the majority, and clinicians develop decreased reactivity, causing them to 'tune out' or ignore the alarms" [82].Therefore, to combat alarm fatigue and effectively promote responsible AI practices, future designers should make responsible AI alerts actionable and prioritize actionable warnings in their systems.
Our findings highlight that Farsight users have a great appetite for mitigation strategies during AI prototyping.We have two hypotheses for this observation.First, as Farsight promotes human agency, it might also give participants a feeling of ownership of their identified harms.Prior research shows that triggering a feeling of ownership motivates users' actions [26].Another hypothesis is that Farsight elicits fear by exposing participants to diverse potential harms of their AI applications, evidenced by participant-reported discomfort (C3) and anxiety ( § 6.6.3).Security researchers use fear appeals as a design strategy to motivate users to take security actions [148].Therefore, our empirical findings highlight promising research opportunities in (1) providing in situ mitigation strategies during the early AI prototyping stage, and (2) investigating if in situ tools can increase users' adoption of harm mitigation strategies.

CONCLUSION
We introduce Farsight, the first in situ interactive tool to address the challenges in anticipating potential harms in LLM-powered applications during prototyping.By highlighting relevant AI incident reports and enabling AI prototypers to curate and modify LLMgenerated use cases, stakeholders, and harms, Farsight improves users' ability to independently anticipate potential risks associated with their prompts.A user study with 42 AI prototypers shows that our tool is useful and usable.Farsight fosters a user-centric approach, encouraging creators to consider end-users, and cascading harms, and extend their awareness beyond immediate harms.Our tool is open-source and readily adoptable.We hope our work will inspire future research and development of responsible AI tools that target the early stages of the AI development process.ChatGPT Prompts [1] (red dots and contour).The embeddings' dimensions were reduced from 768 to 2 using UMAP [113] with default parameters.The rectangles and labels show the summaries of AI incident reports in high-density embedding neighborhoods.The summaries are automatically generated by WizMap [186].The visualization reveals different clusters in the AI incident reports, such as incidents related to autonomous driving cars in the bottom left and machine translation on the right.The overlap between the red and blue contours indicates that user prompts can be in close proximity to AI incident reports in the 2D embedding space.This observation inspires us to use high-dimensional embedding similarities to calculate the alert levels in Farsight ( § 4.1).Note that in this example, the 153 user prompts form a cluster due to the primary focus of AwesomeGPT prompts on conversational agents.The distribution of our 1,000 internal prompts (featuring classification, translation, code generation, etc.) is more spread out.For an interactive version of this visualization, visit WizMap.

Use Cases
Stakeholders Harms Fig. S3: We compute weighted pairwise Cohen's kappa to measure inter-rater reliability for the ratings of harm likelihood (left) and harm severity (right).Raters rate each dimension on a 4-point Likert scale (strongly agree, agree, disagree, and strongly disagree).We numericalized these four categories as ordinal data: 1, 2, 3, 4. Within each cell, the top number is the kappa score, and the bottom number is the count of common harms between the corresponding two raters.The average kappas for likelihood and severity ratings are 0.11 and 0.10, which can be interpreted as slight agreement.

E HARM RATING PROCESSING AND INTER-RATER RELIABILITY
In our evaluation user study, we recruited seven raters to rate the likelihood and severity of each collected harm ( § 6.2.4).In total, we collected 989 harms with 895 unique harms (Table S2).We randomly assign each unique harm to three raters to rate its likelihood and severity on a 4-point Likert scale (strongly agree, agree, disagree, and strongly disagree to statements "this harm is likely to happen" and "this harm is severe").Raters could also choose an N/A option if they perceived a rating was not applicable.After collecting all ratings, we dropped all N/As and numericalized four rating categories as ordinal scores: 1, 2, 3, 4.
To measure the inter-rater reliability, we computed Cohen's kappa [112] for each pair of raters.As the rating scores are ordinal, we used quadratically weighed kappa [32], so that the level of agreement between score 1 and score 2 is higher than score 1 and score 3. The pair-wise kappas and the count of common harms are shown in Fig. S3.The average kappa for likelihood ratings is 0.14, and the average kappa for severity ratings is 0.09.Both scores can be interpreted as "slight agreement" [94].

E.1 Example Harms Collected from Participants
Table S2: A random subset (=84) of 895 unique harms collected from 42 participants in our evaluation user study ( § 6).This random subset includes 14 harms for each of the four AI features (four prompts used in H1-H4).Depending on the experimental conditions, each harm was envisioned by a participant (H1-H4) or generated by Farsight and curated by a participant (H2, H4).Each harm was rated by three random raters out of seven raters in terms of likelihood and severity on a 4-point Likert scale (strongly agree, agree, disagree, and strongly disagree to statements "this harm is likely to happen" and "this harm is severe").Job applicants Reddit has a karma score.Similarly, if a social media uses this feature to prioritize non-toxic content, misclassification can lose job opportunities for social media users (e..g, tweet not seen by companies).
3, 4, 1 Online moderators Online moderators may lose their jobs because the AI tool can already distinguish toxic vs non-toxic so there is no need for a moderator.

Fig. 2 :
Fig. 2: (A) Many AI prototypers from diverse backgrounds and roles use (B) prompting tools to prototype AI applications.Farsight provides a range of in situ widgets for these tools, helping AI prototypers envision the potential harms of their AI applications during an early prototyping stage.

Fig. 6 :
Fig. 6: The Awareness Sidebar provides in situ information to remind AI prototypers of potential risks.(A) Given a user's current prompt, (B) the Incident Panel shows the (B1) latest and (B2) related AI incident reports sampled from the AI Incident Database [111].(B2) The related AI incident tab is the default view, which uses text embedding similarities between the user's prompt and all AI incident reports to surface relevant reports.(C) The Use Case Panel leverages LLM to generate potential use cases and harms.Each use case is classified by an LLM and organized into (C1) intended, (C2) high-stakes, and (C3) misuse tabs.

Fig. 7 :
Fig. 7: The Harm Envisioner helps AI prototypers envision harms associated with their AI applications through human-AI collaboration.(A) Given a prompt, (B) Farsight uses an LLM to generate a summary of the prompt and asks users to revise it.(C) Then, the Harm Envisioner presents an interactive node-link diagram to visualize use cases, stakeholders, and harms.Initially, the Harm Envisioner only shows up to the Use Cases layer.(C1) Users can edit the node content before asking AI to generate its children nodes by clicking .Users can edit any node and regenerate its children at any time, and click a node to show or hide its descendants.(C2) Users can delete unhelpful nodes.(C3) This view encourages users to think and add more harms by intermittently and randomly alternating harm categories shown in empty harm nodes, such as "increased labor?"

Fig. S1 :
Fig. S1: To evaluate our early Farsight designs and generate more design ideas, we conducted a co-design study ( § 3.1) with 10 AI prototypers.Participants were asked to use our very early-stage design prototypes (shown in cells labeled with ) to envision potential harms associated with their application while thinking aloud.Participants were also presented with low-fidelity sketches for our other design ideas (shown in cells in the last row).The ratings of our design ideas are shown in Fig. 4.

Fig
Fig. S2: A visualization of the PaLM 2 embeddings of 3,474 AI incident reports [111] (blue dots and contour) and 153 AwesomeChatGPT Prompts[1] (red dots and contour).The embeddings' dimensions were reduced from 768 to 2 using UMAP[113] with default parameters.The rectangles and labels show the summaries of AI incident reports in high-density embedding neighborhoods.The summaries are automatically generated by WizMap[186].The visualization reveals different clusters in the AI incident reports, such as incidents related to autonomous driving cars in the bottom left and machine translation on the right.The overlap between the red and blue contours indicates that user prompts can be in close proximity to AI incident reports in the 2D embedding space.This observation inspires us to use high-dimensional embedding similarities to calculate the alert levels in Farsight ( § 4.1).Note that in this example, the 153 user prompts form a cluster due to the primary focus of AwesomeGPT prompts on conversational agents.The distribution of our 1,000 internal prompts (featuring classification, translation, code generation, etc.) is more spread out.For an interactive version of this visualization, visit WizMap.
it to identify potential terrorists.Potential terrorists ([different name]) may be unfairly targeted by law enforcement due to AI-generated misclassifications.

Table 1 :
The co-design user study includes 10 participants with diverse roles.All participants have experience in prompting LLMs.Four participants who self-reported having expertise in responsible AI are marked with asterisks ( * ).

. As we are targeting AI prototypers with diverse experience in AI and responsible AI, Farsight should be easy to use and understand. When asked what would help them envision
potential harms for their AI applications, many participants mentioned referring to prior examples of AI harms (U1, U2, U8).For instance, U2 said "Giving some specific real [harm] examples for different types of seemingly innocuous applications would help alert people [to consider harms]."Therefore, we aimed to integrate real examples in Farsight to motivate and help users understand the potential risk of their applications.
1) presents an always-on symbol that shows the approximated alert level of a user's current prompt; the Awareness Sidebar ( § 4.2) highlights news articles about related AI incidents and LLM-generated use cases and harms; and the Harm Envisioner ( § 4.3) visualizes LLM-generated harms and allows users to edit, add, and share harms.Examples in this section use PaLM 2 model through its APIs; we chose this model because it provided free API access to the public during our design process.Researchers and designers can easily replace PaLM 2 model with other LLMs by changing the API endpoints in Farsight.

Table 2 :
The evaluation user study included 42 participants with diverse roles and experience in prompting LLMs.
I'm glad that I got to know about them [potential misuses].But I feel I'm vulnerable, probably because I can't do much about stopping them.So that's something that really makes me feel very disappointed.Because unless we do case-by-case analysis, this [preventing misuses] can be very tricky.I feel like it's kind of adding anxiety to me.It's good to know, but I feel like I can't do much about it." (P29) We did not incorporate harm mitigation into our tools, because mitigating harms associated with LLM-powered applications remains an open research question (see more discussion in § 7.3).After the evaluation study, we improved Farsight by providing pointers to existing LLM safety resources [e.g., 6, 71, 117, 164] when users exported their harms.
Stakeholders of the decisionIf the communication is about decision making and the rewrite contains error, it can harm the stakeholders of the decision.BulliesTeachers use it to identify students being bullied.The AI may make the teachers be biased against the bullies.Employees of the company HR departments use it to screen job applicants for toxic behavior.Hate group targetsHate groups use it to identify potential recruits.Hate groups may be able to recruit more people due to the AI tool.
Anyone reading the document If the document is sent to anyone else, the summary may not well-represent the document.The readers of the summary might mis-understand the original document.If the summary is wrong, people would lose confidence in the company.3, 4, 4 2, 4, 3 If there are two facts in the article that are true independently, then it's possible that the summary combines them where the statement is no Editors Journal readers use it to get explanations of equations in inferential statistics sections.Journal editors may worry that the quality of their journal declines because the AI feature makes too many errors.Minority social groups There are different ways to phrase things differently based on some social groups.If the user asks a question in non-profession English or non-English, it can cause alienation.Engineers use it to explain complex math concepts to non-technical stakeholders.Non-technical stakeholders may be misled by AI-generated explanations of complex mathematical models.Students Students use it to learn math.Students may feel like they are losing their ability to learn about math problems.Students Students use it to learn math.Students who do not use this AI product may feel like they are not getting the same quality of education as their Students use it to learn math.Students who do not use this AI product may feel left behind by their peers who do.