Dungeons & Deepfakes: Using scenario-based role-play to study journalists' behavior towards using AI-based verification tools for video content

The evolving landscape of manipulated media, including the threat of deepfakes, has made information verification a daunting challenge for journalists. Technologists have developed tools to detect deepfakes, but these tools can sometimes yield inaccurate results, raising concerns about inadvertently disseminating manipulated content as authentic news. This study examines the impact of unreliable deepfake detection tools on information verification. We conducted role-playing exercises with 24 US journalists, immersing them in complex breaking-news scenarios where determining authenticity was challenging. Through these exercises, we explored questions regarding journalists’ investigative processes, use of a deepfake detection tool, and decisions on when and what to publish. Our findings reveal that journalists are diligent in verifying information, but sometimes rely too heavily on results from deepfake detection tools. We argue for more cautious release of such tools, accompanied by proper training for users to mitigate the risk of unintentionally propagating manipulated content as real news.


INTRODUCTION
In March 2019, Voice of Myanmar News published a video of the Chief Minister of Yangon, Phyo Min Thein, confessing to bribing the deposed State Counselor Aung San Suu Kyi.The video made rounds on Myanmar social media accounts as likely being doctored [48,86].Indeed, publicly available deepfake detection tools indicated that the video was fake, with one showing 98% confdence [71].Soon after, expert analysis contradicted this analysis, pointing out that even though the confession itself may well have been forced, the video of it was unlikely to have been a deepfake [40].Compression artifacts in the published video could have confused the detection tool, giving the users a false verdict.This incident illustrates a disconnect within the current ecosystem of manipulated videos and the use of Artifcial Intelligence (AI) systems used to detect them.
Deepfakes are videos manipulated using deep learning (DL) technology so that a person could be shown saying things they never said.Although recently the term has evolved to additionally encompass audio and image manipulations, the focus in our work is on videos.The current publicly available deepfake generation techniques [47, 58,67,89,89,97] can yield convincing, high-quality results.A tech-savvy individual could learn how to use these techniques from online forums.Then with a high-powered consumer GPU, like the ones used for gaming, they could produce a very high-quality deepfake in just a few weeks.Since deepfakes can be used to make anyone say anything the creator wants, the potential to create misinformation is clear.
Unfortunately, current deep learning-based systems are unstable in open-world scenarios [20,23,55,80], which also applies to deepfake detection systems [27].This issue has led to very confdent false-positive or false-negative deepfake detection results in deployed tools.Additionally, researchers have shown that it is possible to generate adversarial examples to trick deepfake detection models into providing inaccurate results [44,46,79,95].Errors like these could mislead users, creating false suspicion about real videos -as with the video of Phyo Min Thein -or misplaced confdence that a fake video is authentic.
Journalists have been adopting various tools to help them verify the information and the authenticity of video, audio, and images [10,11,53].Large media organizations typically have a variety of protocols in place for the thorough analysis of each type of medium.Given that deepfakes are still a new manipulation threat, however, there are very few reliable detection tools.Prior research [10,102] indicates that journalists would use video verifcation tools as a part of their arsenal to aid them in their work.Unlike traditional verifcation tools, however, deepfake detection tools rely on potentially unstable deep-learning methods and output unreliable confdence estimates.Such tools thus create new challenges for journalists that other tools did not, and the impacts of these challenges have not previously been studied.Moreover, the current detection tools imitate the workfow of anti-virus software, providing positive versus negative detection results, which may not be ideal for the use case of media forensics.Thus, it is very important to study how journalists would interpret results from this new generation of tools for breaking news in a typical newsroom setting given the pressure for fast publication.
Thus, in order to better understand the potential contributions of deepfake detection tools to the journalistic workfows, our work aims to address three aspects of their usage: (1) Perception.There is a lack of research into journalists' perceptions of deepfake detection tools, which have become increasingly available over the past three years.Given that these tools are new and diferent from traditional verifcation tools, it is important to understand journalists' readiness to adopt them.Thus we seek to understand the journalists' perception towards deepfake detection tools (RQ1).(2) Workfow changes.The potential efects and the importance of the deepfake detection tools may vary depending on when the journalists choose to use them.Literature on psychology and AI tools points to anchoring bias [90] as a cognitive bias where people may favor information they received early in the decision-making process.Hence, we want to observe when the journalists would opt to employ deepfake detection tools and why (RQ2).
(3) Overreliance.The Myanmar incident showed how unreliable deepfake detection tools can cause serious harm by misleading users.Previous studies have also pointed out the risk of automation bias [62], where users tend to trust the output of decision-making tools too much.This could have disastrous efects if journalists publish fake videos as real news or dismiss real videos as fake.Therefore, we want to investigate how much journalists rely on the tools (RQ3).
We wanted to design a study that would let journalists verify information in their own way, since diferent journalists have different verifcation methods [99].To achieve that, typical Scenario-Based Design [15] would have been sufcient.Thus we took inspiration from the immersive and fexible environments provided in Dungeons & Dragons tabletop game, to design a semi-structured scenario-based role-playing exercise.Through this methodology, participating journalists are placed in a high-stakes newsroom scenario and asked to verify content containing a video with access to deep-learning-based deepfake detection tools and most importantly, free reign to create their own actions.While this greatly increases the complexity of the study, it allows the participants to act out their behaviors in a more natural environment, as opposed to structured questions where their answers may have less chances to align with their true behavior.To the best of our knowledge, this is the frst research study of its kind to assess the verifcation behaviors of journalists interacting with AI tools.This study design allowed us to carry out the study in both online and in-person settings and remain in control of the interactions.
Through our interviews with 24 US journalists, our key fndings were: • Journalists expressed positive views of deepfake detection tools and looked forward to using them, but wanted more explanations of the results.Their trust in the tools may depend on the perceived reputation of the developers.• Journalists tended to rely more on traditional verifcation methods frst to establish context, and turned to deepfake detection tools when contextual verifcation was difcult or time was limited.Tools were used more when videos made very bold/unusual claims.• Events with high social or political impact encouraged more verifcation steps from journalists, while urgent events sometimes led them to skip steps in favor of a quicker publication.• A few journalists showed possible overreliance on the tools, especially when results confrmed their initial impressions, due to various biases.The trust in the tool seems to depend on their previous experiences with it.Thus while a positive experience may increase overreliance, bad experiences may instill diligence.• The scenario-based role-play methodology proved engaging for participants and shows promise for training journalists on issues around manipulated video and improving verifcation skills.
The fndings reveal the importance of improving the explainability of deepfake detection tools and training journalists on their nuanced interpretation as these AI assistance tools see wider adoption.

LITERATURE REVIEW 2.1 Deepfakes
Deepfakes, a portmanteau of deep learning and fake [105], frst came into the spotlight in 2017.At the time of writing, the majority of fake videos on the web are either non-consensual pornography [77] or entertainment parodies.These examples draw a signifcant amount of negative attention to this technology.There are also positive uses, however, in arts, education, and therapy, including the Dali Museum interactive exhibit [61] and the David Beckham multilingual malaria campaign [83].The focus of our work is on the malicious use of this technology to create and spread misinformation.The potential to destabilize our societal norms with manipulated video and audio content has been a point of discussion in various felds [16,39,51,77,108].Silbey and Hartzog suggest that deepfakes provide an opportunity to reform our education system, journalism, and democratic systems to become more resilient to fake media [98].
The Russia-Ukraine confict in 2022 saw the frst use of malicious deepfakes created to afect a war.Two fake videos were circulated online, one each of the two opposing presidents announcing that their country was to surrender [35,36].More recently, deepfakes have been used in the 2024 U. S. Presidential campaign, both by unknown entities and by the campaigns themselves [73,106].Even the mere existence of deepfakes can cause erosion of trust and lead to the spread of misinformation.In Gabon in 2019, armed soldiers who believed that the New Year's presidential address video was a deepfake attempted a coup, even though the video was probably authentic [14].The Myanmar video we mention in the introduction is also a recent example of this efect.

Verifcation Systems in Journalism
Journalists and independent fact-checkers have long used various tools to verify information.Brandztaeg et al. [10,11] studied the practices and perceptions of a young generation of European journalists towards verifcation tools.They mentioned tools and services available for photo and video verifcation such as Google Image Search, TinEye, Exif, Topsy, Tungstene, Google Maps, Streetview, YouTube videos, and Storyful.Many of these tools were also mentioned in a study on the requirements of deepfake detection tools by Sohrawardi et al. [102].All three studies involved a semi-structured interview format, which appeals to the conversational behavior of journalists.One of the studies by Brandtzaeg et al. [10] included social media users alongside journalists, and both groups had mixed feelings about online fact-checking tools.A very recent survey by Khan et al. [53] listed some of the more prevalent tools in the journalism world and suggested their limitations.In contrast, our study aims to frst observe the behavior of journalists when presented with a task rather than solely relying on questions and answering.
For video verifcation, the InVID Project [104] is a well-known tool that allows users to reverse search frames, alter saturation and brightness, and view the impact of the video on social media.It is often used to verify the context or fnd original videos, but it does not possess deepfake detection capabilities.For deepfake detection, publicly accessible options are Deepfake-o-meter [65], Deepware [24], Sensity [5], and Reality Defender [25].A more restricted and user-centered option is the DeFake tool [102], which was a result of academic research with journalists and is only available for that target population alongside researchers.Since their interface was designed with journalists in mind, it made sense to use it for the studies.We recreated the interface prototypes from the DeFake paper and altered them to match the context of the scenarios in our study.

Overreliance and Confrmation Bias in AI Tool Utilization
A pressing concern in the realm of deepfake detection tools is their potential to inadvertently increase confrmation bias [84].This bias is characterized by the inclination to favor information that aligns with one's existing beliefs, often at the expense of contradictory evidence.Rastogi et al. [90] posited that such biases, coupled with overreliance, can arise when users stop looking for alternative evidence or viewpoints upon receiving machine-generated output.This observation is particularly relevant to our study.
Lee et al. [62] introduced the concept of automation bias [76] within the AI milieu, highlighting the pitfalls of suboptimal humanalgorithm collaboration.Drawing parallels with Watson's rule discovery experiment [109], it's evident that forensic investigators [75] are not immune to confrmation bias.This susceptibility is glaringly evident in the realm of deepfake videos, as observed in public reactions to a manipulated video of President Trump [107].
Numerous studies in the feld have explored AI-assisted disinformation detection [9].Although these approaches introduce automation bias, a potential remedy lies in incorporating explanations, as demonstrated by Epstein et al. [34].Their research revealed an enhanced ability to distinguish between genuine and fabricated news by American internet users.However, it is imperative to exercise caution in the handling of these explanations, as they have the potential to inadvertently promote biased behavior [90].
While Pennycook et al. [88] demonstrated that individuals with heightened analytical thinking are less prone to such biases, the Myanmar incident [86] underscored the vulnerability even among seasoned journalists.One plausible explanation for this behavior is the limited exposure or comprehension of the limitations inherent in AI-driven tools.A review of tools employed by journalists [53] reveals a scanty adoption of AI-based tools, with the few in operation being recent introductions.
Historical studies on fngerprint analysis [100,103] have shown that pre-emptively providing participants with background information, or priming them, can inadvertently introduce bias.This suggests that even the most adept forensic experts are not entirely impervious to bias [59].Byrd's work [13] ofers insights into the myriad biases in forensics, which can be extrapolated to our context.In our research, our aim is to provide journalists with pertinent background information to discern if the context might inadvertently induce an overreliance on deepfake detection tools.

Dungeons & Dragons Tabletop Game
Dungeons & Dragons (D&D) is a popular tabletop role-playing game where one person serves as a Dungeon Master (DM) and others play as characters in medieval fantasy world.The DM creates and narrates the adventure, while the players openly decide what their characters would do and how they interact with the world.The game uses dice and rules to determine the outcomes of actions and events.The gameplay mechanics of D&D are based on the core rulebooks: the Player's Handbook, the Dungeon Master's Guide, and the Monster Manual [92].These books contain, guides on how to create characters, as well as basic rules for combat, magic, and exploration.
The role of the DM is to be the game's lead storyteller and referee.The DM is responsible for preparing the adventure, describing the scenes and locations, playing the roles of non-player characters (NPCs) and monsters, adjudicating the rules, and keeping track of the game state.The DM also has the authority to improvise and modify the adventure as needed, depending on the players' choices and actions.The DM's goal is to make the game fun and engaging for everyone, while also challenging the players and creating a sense of immersion and wonder.
The immersive nature of the game, fexible interaction choices, and the role of the DM are the three core features we adopted from D&D to develop our user studies.

Software Analysis through Participation in Scenarios
Software analysis and design can be approached in myriad ways.Among these, role-play and scenario management stand out as interactive and often gamifed methods.Parson [87] delineated four models where simulation gaming could guide decision-making in intricate policy issues.He highlighted the potential of simulations to either promote creativity and insights or to integrate knowledge.
Our study resonates with the latter, as we aim to gauge the early implications of integrating AI-driven deepfake detection tools into journalistic processes.
Constructing semi-realistic scenario-based studies can be laborintensive, but their value is undeniable, as evidenced by Eden et al. [29].In qualitative role-play studies, typically conducted in groups, participants assume designated roles or personas and enact them throughout the study [28,50].Conversely, scenario management situates individual users within specifc scenarios.Jarke et al. [49] ofered a comprehensive review of scenario management across disciplines, including human-computer interaction.They championed a methodology for scenario development, underscoring the advantages of creativity, contextual awareness, and fexibility.Carroll [15] posited that scenarios should furnish a rich narrative, enabling participants to discern contextual implications.Munroe et al. [78] stressed the signifcance of high fdelity and minimal instructions to encourage genuine scenario interactions.They also highlighted the utility of trigger events to monitor behavioral shifts, an element we integrated into our study to amplify urgency.
Recent AI research has also spotlighted scenario-based design and role-play [8,66,111].While Wolf et al. [111] adopted a theoretical stance, generating scenarios from prior research, our approach, akin to Eiband et al. [33], is more participatory.However, our focus is solely on the evaluation phase, unlike Eiband et al. who engaged users throughout the design process.A recurrent critique of these methodologies is their limited replicability.To try and address this, Geerts et al. [38] repurposed the Serious Game Design (SGDA) framework [72] to evaluate and guide the design of research games -games used as playful methods in HCI research to engage participants and collect user insights.The framework breaks down the game into Purpose, Content, Mechanics, Narrative, Aesthetics, and Framing with the emphasis on cohesiveness and coherence between the pieces.The authors found that the framework helps support a more systematic development of the game.Given that our study can be categorized as a serious research game, we used the SGDA framework post hoc for comprehensive evaluation.Paired with our description in this paper, we hope that it improves reproducibility, keeping in mind the context of the study.
Our approach is novel in its use of scenarios to understand the design of media verifcation tools for journalists.Consequently, we have encountered unique challenges in crafting realistic and demanding scenarios centered on video content verifcation.

Education and Training
Prior work has delved into the efcacy of gamifed techniques for education and training.The What.Hack training program [110] showcased heightened awareness of phishing threats among corporate personnel.Similarly, initiatives like FakeYou! [21] and MAthE [52] employed analogous strategies to bolster public defenses against disinformation.Landers et al. [60] found that gamifed methods resonate particularly with those familiar with video games.Armstrong et al. [6] posited that melding gamifcation with instructional design tenets can enhance learning outcomes.Building on this foundation and leveraging the SGDA [38], we believe that our methodology could be refned and repurposed to train journalists in dealing with the fast-evolving and complex types of disinformation.This would empower journalists to hone their media verifcation skills in simulated settings featuring possible deepfakes.

METHOD 3.1 Sample and Consent
The study involved 24 participants.As shown in Table 1, nine participants engaged in individual sessions, while the remaining 15 participated in group sessions.Individual interviews spanned September 2020 to November 2021, with group sessions conducted in June 2022.All participants were active journalists with diverse experience in news publishing working on various beats (topics) that include tech, politics, health, and entertainment.While all participants were familiar with deepfakes, their understanding of the term varied, hence we used pre-briefng (see Section 3.2) to give everybody a similar starting point.
Recruiting journalists for hour-long research sessions posed a challenge, which contributed to our modest sample size.However, through strategic outreach, leveraging personal contacts, assistance from our funding agency, and employing snowball sampling techniques, we were able to assemble a diverse group of journalists representing both local and national news agencies.Each participant brought several years of journalism experience to the table, and all had dealt with digital multimedia in their professional roles, albeit to varying extents.This approach allowed us to build a representative sample, capturing a spectrum of verifcation habits from journalists across the United States.
The study and its methodology received approval from the Institutional Review Board (IRB) at the Human Subjects Research Ofce in Rochester Institute of Technology.Participants were informed about the study's focus on news verifcation concerning deepfakes and their detection.Prior to the study, participants received an Informed Consent document containing two sections: one seeking consent for participation and an optional section requesting permission to disclose their names.The latter was not solely for academic publication but also to discuss their participation in other mediums.
Upon conclusion of the session, participants were debriefed about the unpredictable nature of deep learning tools and the imperative of thorough verifcation when utilizing them.
While potential risks, albeit minimal, especially in terms of employment implications, may arise due to the potential loss of anonymity, it is noteworthy that majority of the participants willingly provided consent for the use of their names.The primary broad variety broad variety broad variety broad variety Table 1: Anonymized List of Study Participants.Ver indicates whether the participant regularly engaged in verifcation of digital multimedia.Beats Legend: bus -Business, cul -Culture, cyb -Cybercrime, dis -Disinformation, edu -Education, env -Environment, hlt -Health, jst -Justice, pol -Politics, soc -Society, spo -Sports, tec -technology.
objective of this research is to gain a deeper understanding of these tools and to ofer guidance for the development of a reliable deepfake detection tool that empowers journalists to validate media sources, thereby enhancing the quality of their reporting.Such a tool holds particular value for journalists operating without the support of a dedicated fact-checking team.Given journalists' inherent appreciation for the importance of anonymity, they are well-positioned to assess the risks and benefts of participating in such studies.

Study Structure
Our studies were conducted using a semi-structured, scenario-based role-play approach, spanning one-hour sessions.Drawing inspiration from the D&D, we integrated its fctional scenarios and the fexible control of the DM.Following the left side of Figure 1, the participants were placed in a series of carefully designed newsroom scenarios, mirroring real-world situations, each accompanied by a piece of newsworthy information and a video description.
Pre-Briefng.During the pre-briefng sessions, we introduced the concept of deepfakes and familiarized participants with various verifcation tools, including those designed for deepfake detection.We also discussed prominent instances of deepfakes in the media.These tools were presented as cutting-edge AI solutions capable of accurately identifying manipulated videos.It is important to acknowledge that this introduction might have infuenced participants in favor of utilizing deepfake detection tools, potentially amplifying concerns about the prevalence of deepfakes.Despite this, it's worth noting that, at the time of our research, promotional materials for these tools often portrayed them optimistically, and many news articles on deepfakes tended to adopt a sensationalist tone, predicting signifcant societal impacts in the way media is perceived.
Introduction to Scenarios.Participants were briefed on their role, typically that of an investigative journalist in a prominent news organization as shown on the top right of Figure 1, though roles could vary based on geographic context.The scenario's backdrop was then presented, ofering information for verifcation and potential publication, supplemented by a video description.Subsequently, participants were asked to assess the trustworthiness of the source of the information and its potential newsworthiness using a threepoint Likert scale.We emphasized the low stakes of these ratings, aiming to capture their impressions of the scenario's characters and quality.The introduction culminated in an initial verdict or step 0, where participants rated the scenario on a 5-point Likert scale ranging from "Real" to "Fake".This initial verdict aimed to capture participants' preliminary perceptions and monitor for potential primacy efect biases [13].
Verifcation Activity.Central to the study was the step-wise verifcation activity.At each step, participants were presented with a grid of potential actions to help guide their content verifcation process.When selecting an action, they received the corresponding response.Depending on the number of steps taken, an event might be introduced, elevating the scenario's urgency.Regardless of whether or not an event was triggered, participants then revisited the Likert verdict scale, indicating any changes in their judgment.This cycle was repeated until the participants decided to publish or dismiss the article based on their fndings, after which the participants were notifed whether or not the video was deepfake.Although the participants were presented with an initial set of actions, they were made aware that these actions were provided to help them make their decisions and they were free to invent their own actions at any point.
Debriefng.After completing three randomly selected scenarios, the participants underwent a debriefng session.This allowed clarifcations, discussions on the efcacy of deepfake detection tools, best-practice recommendations, and potential study extensions.
Role of the Dungeon Master.Analogous to the original D&D game, the DM role played a pivotal part in our study.A robust understanding of journalistic information verifcation was crucial for the DM, given the predefned responses for many actions, with occasional live adjustments based on the step sequence.The DM's expertise became especially vital when participants opted to invent their own actions, requiring spontaneous response generation.Additionally, the DM injected a sense of urgency, emulating the high-pressure newsroom environment prone to "Honest Errors"-unintended errors made through decisions that were not malicious [13].In our study, the researchers themselves assumed the role of DMs, ensuring a deep grasp of journalistic methods, scenario structures, and study objectives.If a volunteer were to take on the DM role, simplifed guides-though less intricate than those for the original game-would be necessary.
Interactions with the Video.While the scenarios involve potentially manipulated videos, as discussed earlier, it's important to note that we only provide participants with textual descriptions of the videos.In a manner akin to the D&D game, participants' interactions with the video rely on detailed descriptions provided by the DM.This approach is driven by the resource-intensive nature  Introduction slide with an avatar generated using a face generator.Bottom Right: A cropped presentation slide corresponding to the participant's selection of the action to use a deepfake detection tool.The response reveals identifed manipulation in the visual frames using two distinct methods, collectively indicating a 98% certainty, alongside predominantly pristine audio content.In this particular instance, the tool does not detect any inconsistencies in the synchronization between the audio and video signals.
of crafting a near-fawless deepfake.Our decision to refrain from sharing the actual manipulated video stems from a desire to avoid potential unintended artifacts that the participants could point to in order to raise their suspicions while ignoring other pieces of information [41].
It is crucial to acknowledge that a highly motivated adversary could feasibly possess the requisite resources to produce a highquality deepfake for use in a disinformation campaign.
Group Setting Adaptations.For 15 of our participants, the study was conducted in a group setting with 3-4 participants at a time.This was necessitated by time constraints during in-person newsroom visits, but also ofered insight into collaborative decision-making in newsrooms.A signifcant change in this setting was the designation of a leader.Thus, although we promoted discussions amongst the groups during the activity, the leader assigned actions to the group members and made the decisions.After each member received the response of their action, a group discussion ensued, potentially revealing the "Conformity Efect" [13].The leader rendered the fnal verdict, and the steps proceeded until a collective publication decision was reached.To ensure efciency in the face of potentially extended discussions, we introduced a total time limit for the scenarios and updated participants regularly on how much time remained.

Development
Solo study participants were presented with scenarios via Zoom meetings using Microsoft PowerPoint.Slides were built for introductions, rating scales, action card grids, responses, and events.For group studies, we adapted these slides for in-person presentations in newsrooms.

Scenarios.
Our fnalized study design included seven fctional scenarios, as outlined in Table 3.These scenarios blended elements of contemporary events: fve were set in North America (NA) and two were international (INT).Real-world names were used to encourage participant familiarity and set expectations for character behavior.Participants were assigned roles with fctional names and synthetic faces. 1 The represented newsrooms, though fctional, were described in terms of scale, often drawing comparisons to real news organizations.

Character Introduction
You are an investigative reporter, Ezekiel Potter, at a large news organization.You have a team of 5 people for news gathering and verifcation.You receive content through your team or trusted sources and must make decisions regarding verifcation and publishability for each.

Referrer Information
Video Description October 28, 2020 Biden's former campaign manager, Greg Schultz, You receive an email from a news verifcation agency run by your former colleague claims that Biden has been severely ill for the (Mark): "Hi Ezekiel, We've been running through some videos that are getting a lot past few weeks and has been going to rallies on traction online and stumbled on this one.The context and metadata seem to checkout painkillers.The video is slightly compressed, and according to our team, and Mark thought it's more up your alley, so he suggested I Greg is visible from shoulders above sitting on forward it to you.And yeah, I know, the content did seem quite suspicious at frst, what appears to be a couch.especially with all the allegations.Regards, Dave." Tool Verdict: Fake with suspicious audio.[goals] Publish the story as fake; Publish story as real; Do not publish anything.Table 2: Scenario NA1 based in North America, involving the Presidential Candidate Joe Biden.Note that not all elements are shown to the participants.
These characterizations aimed to establish a foundation for the scenarios and potentially induce the "Primacy Efect" [13], potentially infuencing participants towards an early verdict if they did not maintain thorough and objective verifcation.
The stories.constructed around these primary subjects were designed to have substantial societal and economic implications.The signifcance of the information, combined with the rapid spread of rumors on social media, and the pressure for a large news organization to be the frst one to publish, could contribute to confrmation bias [30].Table 2 illustrates one such scenario.
Our fctional adversaries in the scenarios employed more than just deepfake video manipulation.They combined these with cybersecurity attacks and propaganda tactics, including spear phishing, hacked Twitter accounts, spoofed email addresses, and compromised publisher websites.A notable example of such multifaceted misinformation campaigns emerged during the Russia-Ukraine confict, which combined hacking of TV and radio stations with deepfake audio and video [74].
Incremental development.of our scenarios proved benefcial, allowing us to refne our narratives over time.Our goal with the scenarios was to ensure they contained highly newsworthy information, making it worthwhile for the participants to go to great lengths to verify the content for publication.Subsequently, using feedback gathered through the 3-point Likert scale, as outlined in the Introduction and during debriefng sessions, we discerned the scenarios that resonated most strongly with participants.This feedback highlighted the fact that some participants from major news organizations may have the resources to directly contact key fgures like Elon Musk and expect a fairly prompt reply, foregoing a lot of the verifcation steps.Consequently, although we retained the Elon Musk scenario due to his activities being newsworthy, we strategically prioritized scenarios, featuring key fgures whose involvement heightened the urgency for thorough verifcation.However, to avoid easy debunking, the DM had to come up with creative and logical responses.Moreover, insights from debriefng participant 4N prompted us to diversify our scenarios.Responding to the suggestion for scenarios with more restrictive and less trustworthy conditions, we introduced INT1 and INT2, situating them in distinct geographical regions.

Actions and Responses. Our initial set of verifcation actions
and their corresponding responses were based on previous research [10,11,102].As detailed in Table 2, actions were categorized into regular, technical, and goals.Through regular actions, participants interacted with scenario characters and performed contextual analysis.Technical actions delved into forensic video analysis using tools such as metadata readers, a deepfake detection tool [102], and the InVID video analysis tool [104].The fnal category, goals, pertained to publication decisions.The actions were additionally given action identifers for each category to improve the quality of life during the analysis phase.
Deepfake Detection Tool action responses were in the form of screenshots as shown on bottom right of Figure 1, DeFake tool [102].We recreated both the collapsed and expanded interfaces mentioned in the original and adapted the results to match the scenario design, showing the collapsed interface frst.Additionally we included the action for Biometric Verifcation from the paper, allowing the participant to request it if unavailable.In the scenarios, we marked all the verdicts as fake, and varied the amounts of fake content detected on each video as well as the fakeness of the audio signal.This decision was driven by the fact that the tool seemed to lean

ID
Referrer Information Video Details NA2 August 16, 2020 | Dylan, a team member from your news gathering team, points out this Tweet from Elon Musk's verifed account and suggests that it could be an interesting story to go on.[Tweet] Stepping down as the CEO of Tesla after over a decade.Jérôme Guillen will be taking over the reins, and he has my full support as his resume speaks for itself.will still continue to be involved in the day-to-day decision making process as the founder, but the decrease in company responsibilities will give me more time to spend with the family, something I have not been able to aford in recent years.
Announcement seems to have been made from an ofce space, with Elon looking straight at the camera.Tool Verdict: Suspicious with suspicious audio.Real Verdict: Fake Tool Verdict: Fake with fake audio.Real Verdict: Real NA4 October 20, 2020 | Your producer wants to run a video with a headline that says: "Biden is sick?His campaign manager appears to say so." She tells the news director about it and he wants you to do a quick fact check run before going live.He notes that the news is time sensitive and they'd want to run with it soon.And have the anchors mention that the video has not been verifed yet and say: "Is it real?You decide." In case verifcation isn't fnished, he'd stick with the last part.
Biden's former campaign manager, Greg Schultz, claims that Biden has been severely ill for the past few weeks and has been going to rallies on painkillers.The video is slightly compressed, and Greg is visible from shoulders above sitting on what appears to be a couch.Tool Verdict: Fake with suspicious audio.Real Verdict: Fake NA5 October 24, 2020 | A competing highly reputable news organization, NewsPeople, publishes an article on a video of Dr. Fauci claiming that a vaccine already exists that is only reserved for POTUS and his family.The vaccine will be delayed to be released right prior to the elections to bolster the Trump administration's re-election chances.The administration blocked the publication of research related to the vaccine in fear that it would be developed into a cure abroad.Your editor wants you to write an article on it as soon as possible to get some of the early readership.Maung Zarni, are meeting 5 men at a shipping dock.Three men are armed.Maung Zarni opens a container.Armed men check the container.The interior of the container is not visible.The unarmed man gestures to one of the armed men, who hands of a gym bag to Maung Zarni.Maung Zarni walks out of the frame.Tool Verdict: Fake with fake audio.Real Verdict: Real Table 3: List of scenario content introductions stripped of character introductions, actions, responses and events.towards fake verdicts in practice, however the expanded view with a detection timeline helped evaluate the content.
Inventing actions.The participants had the fexibility to invent actions.This allowed them to freely defne their own verifcation steps thus having relative freedom in their workfow.As we refned our scenarios, some of these invented actions were integrated into our base action list, like calling notable organizations or relevant characters.
Predetermined responses for all predefned actions took the pressure of the DM and ensured they provided sufcient information while limiting access to pivotal characters.Some responses were designed to bias participants towards a particular verdict.Depending on the scenario's progression, the DM may have adjusted variables within these responses, like the number of shares on social media, to maintain a sense of urgency and maintain logical order.

Events.
To generate a sense of realism and urgency, the DM strategically introduced events as participants progressed through the study.Each verifcation action was assigned an associated action cost, accumulations of which dictated when an event would unfold.While the majority of actions carried a standard cost of one, those involving potential wait times exceeding a day were deemed to have a higher action cost.For instance, a biometric analysis request for a video subject incurred a cost of three, potentially necessitating an additional two days for completion.
These events primarily manifested as reactions on social media platforms, often marked by surges in retweets accompanied by context and metrics.Some events provided deeper insights by incorporating character reactions, such as denials or delays.The DM dynamically adjusted variables within these events, such as the volume of social media discourse, based on participants' prior actions, creating a responsive and adaptive narrative experience.

Analysis
Our data sources included Zoom recordings from online sessions, in-person group meeting audio recordings, and handwritten notes from both.Zoom recordings were automatically transcribed, while in-person recordings were transcribed using Rev.ai. 2 Encouraging participants to think aloud enriched our insights into their decisionmaking processes.
Our primary objective was to understand participants' approaches to verifying video-based information and their interactions with deepfake detection tools.With this in mind, we conducted a thematic analysis of the recorded data.First, the two authors independently went through the transcripts from the online and in-person interviews and developed their own initial codebook using open coding.Then, the authors reconvened to share their codebooks and derive higher-level themes using axial coding [94].During the collaboration phase, we merged conceptually similar themes and arrived at our fnal codebook.To enhance the accuracy of our analysis, we reviewed the original recordings while developing the codebook to avoid discrepancies that arose from automated transcription errors.The multiphase coding process allowed us to discern nuanced behavioral shifts and expressed opinions.We particularly observed how detection tool results infuenced participants' content validity judgments and their reliance on these tools.To visually capture the variations in participants' decision-making processes-specifcally, the number of steps taken-both before and after employing the detection tools and the resulting shifts in their verdicts, we constructed step graphs, as shown in Figure 2 and 3, using the Deepfake Detection Tool action as a reference point.

RESULTS
This work aims to gain insight into how journalists, as a specialized group of users, would approach news verifcation in the age of deepfake videos and detectors.We emphasize that the focus of this work is to assess near-term efects, as access to the deepfake detection tools is heavily democratized and developers rush to release their novel detection tools to the public.At the same time, we evaluate the risks of overconfdent claims and the lack of disclaimers about detector shortcomings afect users without much prior experience with the tool.

RQ1: Perception of journalists towards deepfake detection tools
At the time of the studies, most of the participants were unaware of the existence of deepfake detection tools, and some were unfamiliar with the InVID tool.This is not surprising as Khan et al. [53] suggested that journalists often avoid tools due to a lack of technical knowledge.Participants 4N and 5N were, however, more accustomed to dealing with modern digital multimedia verifcation 2 https://rev.aiand were familiar with the various verifcation tools available to them.Participant 5N went as far as naming various other deepfake detection tools that were available at the time.
Participants had positive views on deepfake detection tools.All the participants indicated that having these detectors is vital and looked forward to using them.Echoing the fndings from the previous work with journalists [102], journalists expressed gratitude during the debriefng sessions for the research that has been going into deepfake detection and mentioned the necessity of these tool sets in their verifcation arsenal.One caveat that would often come up in the debriefng regarding these tools is the lack of explanations to substantiate the results.
Trust in the app may depend on the developer reputation.Generally, the participants that used the tool felt comfortable receiving an answer from it.This may be attributed to the connection of the application to a neutral academic institution.When the discussions about trust emerged, the participants mentioned that they would trust a tool developed by a neutral entity.Group 2GL mentioned that they would be more comfortable if they knew that the tool was developed by a university source with transparent development and developer details that they could verify.They wanted to ensure that the tool was not a "part of an agenda" that could infuence their opinions.
Positive experience with the tools may contribute to an increase in trust.If the tool provided users with correct responses in the past, there would be a higher chance of them trusting it.The participants from 1GL, in scenario NA5, chose to follow the output of the detection tool and made a publication decision based on that, because it gave them the correct answer in scenario INT1.On the other hand, more tech-savvy users may perform their own benchmark of the tool, something that participant 5N alluded to during the debriefng.
Time spent on the deepfake detection tool varies.The participants were presented with an interface for the deepfake detection tool [102], which displayed an aggregate score for the probability of the video being fake, as well as a more detailed view of individual scores for four detection methods per second of the video.Each participant spent varying amounts of time reviewing the results.Although every participant received a brief overview of the detection interface, experience with verifcation technology (4N, 5N) and group settings (1GL, 2GL) made participants pay more attention to individual detection results.Most of the participants did not raise any questions about the results unless prompted by the interviewer.
Opinions regarding the inclusion of the tool's results in publications varied.When asked whether participants would include the results of the deepfake detection tool in their publication, there was a wide spectrum of opinions.In many sessions, the participants mentioned the importance of the reputation of their own organization in deciding on the contents of their publication, as mistakes tend to tarnish the trust.Most participants were keen on adding the results to their publication for transparency of their methods.3N and 3GL, however, specifcally pointed out that they would not, remarking that the tool only served their own organization's needs for the verifcation.A participant from 2GL gave a diferent reason for not including the results: "I feel like audiences will get bogged down.They don't really understand what it is." As cautious as many journalists can be, another 2GL participant said: 'If we feel like we have a reliable tool that we're basing it of, I would reference that," which captured the essence of many participants' thoughts on this issue.The public access to the tool also came into question, with a few of the participants mentioning that it would be unproductive to include the results if the public could not test it for themselves.

RQ2: When and why deepfake tools are used
The study allowed us to take a closer look at the workfows of the journalists in a simulated newsroom scenario and their verifcation behavior when encountering video material.Although there are various guidelines for newsroom content verifcation [31,32,43], studies like the one by Lewis et al. [63] suggest that in practice, the verifcation process is less defned.Figures 2 and 3 show the disparity in the number of steps used to verify content and the order between various participants.The table was organized by using the Deepfake tool action as a reference, allowing us to observe the number of actions taken before and after it.
Journalists start with verifcation of context through traditional means.Unsurprisingly, participants chose to start with traditional journalistic actions of contacting characters in the scenarios most of the time.Participants from larger national newsrooms would extend this further, using more steps before using the deepfake detection tools.They attributed this to having to adhere to a more well-defned verifcation requirement when compared to local newsrooms.In group settings, each step allowed for three concurrent actions to be assigned.Even though the Deepfake tool action was usually included in the frst step, the traditional actions were picked frst."Typically we would reach out to all parties frst, " said the leader of 2GL.Traditional approaches were preferred due to a mix of comfort through experience and the need to put together the context for the events associated in the video.As noted by participant 3L, "I'd want to minimize the value of the video, ".
Participants who were more comfortable using technology for verifcation exhibited goal-oriented behavior when using the said tools.Their goals were to verify various pieces of context to form a more complete picture, thinking "I want to verify whether X is actually in this location/on this day."As an example of this, 2GL, 4N, and 5N used metadata analysis to identify date and location, while the latter two participants and 4GL used image reverse search to fnd related news and videos.
Deepfake detection is used when context verifcation is difcult.The deepfake tool was often picked if contextual verifcation could not yield confdent conclusions.However, in contrast to the goaloriented behavior, the participants showed a discovery-oriented behavior when opting to use deepfake detection tools.A majority of the participants demonstrated a 'let us see what this tool says about the video' mindset while deciding to use the deepfake tool.This inclination may be caused by their limited familiarity with the tool's functionality and performance.For instance, in scenario INT1, participants from group 2GL hesitated initially about the utility of using the deepfake detection tool, but ultimately decided to try it as their fnal action in step 2.
Figures 2 and 3 show that in both individual and group studies, the detection tool was used earlier in scenario NA5.The primary subject in scenario NA5, Dr. Anthony Fauci, makes a very bold and unusual statement in a video interview."Honestly for like a story of this magnitude, wouldn't you not just like do anything possible before publishing it?" said a participant from 2GL while discussing the possible actions, pointing to the magnitude of this news.Participant 2N was less interested in the results of deepfake detection tools in the frst scenario they faced, saying that they would want to dissociate the video from the news mentioned in it.However, for the Fauci video, they used it early, stating that the odd behavior and the fact that the subject was alone in the video made manipulation more likely.Deepfakes most commonly take the form of single talking head videos.Therefore, it is sensible that they may be concerned if the behavior of the subjects is out of the ordinary.
It is worth noting that participant 4N completely avoided using deepfake detection tools in all of the scenarios.In the debriefng however, they eluded to the fact that they would defnitely have considered using the tools if the various contextual steps proved to be fruitless.The participant talked about needing to develop various layers of evidence in order to come to a conclusion, a part of which would have been results from not one, but various detection tools to get diverse opinions on the content.
Time was also a defning factor when choosing to use the verifcation tool."No tool is perfect and I am trying to do everything I can to add more context allowing the reader to see the bigger picture, " said 5N, admitting that technology-assisted solutions often lead to faster albeit incomplete results.In contrast, traditional methods were more tried and tested, but would often take longer to yield results like a response from characters they tried to contact directly.Time is scarce in modern breaking news scenarios.As a member of 1GL stated during one of the scenarios, "We only got a little bit of time, so DeFake, we're gonna trust that."

RQ3: Potential for overreliance on tools.
Overreliance due to cognitive biases on automation in human-AI collaboration has been a topic of many recent studies [37,90] and will continue to be as our interactions with AI-driven tools evolve.Similarly to these works, we touch on the potential efects of cognitive biases on the journalistic verifcation process given a hectic newsroom scenario.On average, it was reassuring to fnd that three individual participants and three groups displayed a healthy amount of skepticism toward the performance of the deepfake detection tools.A nice example was a discussion amongst the participants in 3GL during scenario INT2 regarding the possible lack of efectiveness of the detection tool on a video from a surveillance camera feed where the faces were less clear.
Overreliance.The novelty and optimistic description of the deepfake detection tools in the pre-briefng may have introduced an automation bias [62] wherein the participants display overreliance on the tool due to blind trust.While it is hard to pinpoint overreliance, we marked the instances (△) where participants used very few contextual verifcation actions and decided on a fnal verdict within two actions of using the tool on Figures 2 and 3.In the session with 4GL, one participant showed extreme interest in the detection tool and said: "I didn't know that technology existed before Actions start at the top (white box) and move down, keeping the location of the deepfake verifcation tool action as a reference point.The action boxes contain the verdict Likert response (1: "Real" to 5: "Fake") after using the action.Publication boxes are marked with ✗ for a wrong verdict, ★ for reafrmation, and △ for overreliance.The top box in each scenario denotes the initial verdict before verifcation.Please refer to Table 1 for the participant codes, and Tables 2 & 3 for the scenario codes.
just now.If I had access to it, I'd be running so much stuf through it".
Although it was nice to see a high acceptance rate of the technology among the participants, developers must be aware of the high dependency of users on the technology.An example we saw was from the session with group 1GL.Having seen that the detector provided one accurate result for the previous scenario, they then trusted the tool too heavily in scenario NA5, in which the tool actually provided the wrong result.Participants 1L, 2L, 1N, and 1GL showed signs of possible overreliance on the tool through their activity patterns and their reasoning as they talked about their actions.Right after using the detection tool in scenario NA1, participant 1L said, "That's looking pretty fake and with my hackles already up on this thing, " and after using one step that happened to support their suspicion, they chose to publish the article.During the debriefng, participants suggested that visual cues indicating causes that may afect the robustness of the results would have been helpful.
Reafrmation.While overreliance may have been based on naivety and lack of experience, reafrmation targets the aspects of confrmation bias and "Primacy Efect" [13], which is a tendency of the people to give more weight to the information gathered early in the verifcation process.Given a prior of the participants' knowledge of the characters in the scenarios, and the initial verdicts given in step 0, we can observe what happens when the initial verdict and the results of the tool are in agreement.If the detection result leads participants to revert to their initial step 0 verdict, considering the possibility that their judgment may have changed during the process due to other verifcation actions, we interpret this as reafrmation.We also assume reafrmation if the participants decide to publish as soon as the result of the tool agrees with their verdict in step 0 without further analysis.The example of the event in Myanmar can be an example of this behavior, since journalists went in thinking that the video was manipulated, only to have their suspicions exacerbated by the results of the tool.Given our defnition, Each action box represents a set of three actions and contains the verdict Likert response (1: "Real" to 5: "Fake") after using the selected actions.Publication boxes are marked with ✗ for a wrong verdict, ★ for reafrmation, and △ for overreliance.The leftmost box in each scenario denotes the initial verdict before verifcation.Please refer to Table 1 for the participant codes, and Tables 2 & 3 for the scenario codes.
the efects of overreliance and reafrmation are not mutually exclusive and can co-occur.However, we thought it would be important to pay attention to reafrmation, as this may cause the participants to pay less attention to detail in the detection results.We identifed reafrmation in sessions 2L-NA3 and 1GL-NA5, denoted by ★ in Figures 2 and 3.
There seems to be added uncertainty due to the existence of deepfakes.In many scenarios, participants opted to invent an action in which they published an article with informational content about the events that transpired and their verifcation progress instead of a defnitive verdict.The participants chose this action after extensive verifcation strategies did not yield the full picture they were hoping for, with the result of the tool sometimes contributing to more confusion.This action lets them publish on an important video to inform the public while mentioning that its validity is still uncertain.For example, participant 2N in scenario NA5 maintained their suspicion about the video of Dr. Fauci, thinking that it was manipulated, even though several other verifcation steps pointed towards it being real.They chose to publish an article with their fndings and stated that they were not yet able to reach Dr. Fauci to make a statement, so they could not confrm the video's validity.
We observed that often the results of the detection tools added to the participant's uncertainty if its results contradicted their other verifcation steps.Sometimes a response from the detector could afect their perception of subsequent and even previous actions, and they could become suspicious of other characters in the scenario.For example, participants 5N and 6N hesitated to publish after a several verifcation steps where the deepfake tool contradicted other actions and preferred to send the video to expert groups.While this slows down publishing speed, we believe it is a safe decision to preserve the integrity of the news.

Impact of the group verifcation setting.
The group studies gave the participants the advantage of multiple opinions and may have created a more natural newsroom environment.However, these interactions are a double-edged sword, as they may lead to the "Conformity Efect" [85] where the detection tool's results are infated by peers.
All groups used the detector in their verifcation process within the frst two steps.Team discussions before and after the steps allowed the journalists to make mutual decisions about the actions.Also, since participants needed to select three actions per step, there would often be room left for the detection tools.Thus, we could assume that in team environments these tools would see greater use in parallel with traditional journalistic methods.
We mention that when carrying out the group studies, we allowed the group to elect their own leaders who would be in charge of assigning actions and making fnal decisions.This works well if there is a comfortable dynamic between the participants.However, in some studies, the preexisting hierarchy within the newsroom may play a part.For example, in 3GL, the participant elected to be the leader was a junior employee, which had an efect on their confdence while making fnal decisions.

Study Evaluation and Training
This was a novel qualitative research study methodology, in which we tried to simulate the natural workspace of professional journalists by placing them in fctional scenarios that mirrored our world and then asking them to think out loud while verifying the content.
For an hour-long study, we hoped to make the procedures engaging enough to elicit interesting and unique discussions, and memorable enough for useful lessons to be learned.
Evaluation.While we draw inspiration from D&D, our study falls under the category of research games.According to Geerts et al. [38], we could use the SGDA [72] framework to evaluate the study.Following this framework, we compiled a diagram of coherence and cohesiveness of our study, as shown in Figure 4.The diagram shows to what extent diferent aspects of the game were strongly or weakly consistent and aligned with the other aspects, based on both the responses from the participants and our own analysis.
When analyzing the coherence and cohesiveness between all the SGDA elements, it was clear that the mechanics of the game alongside the rest of the pieces are strongly coherent with our aims.Throughout the course of the study, we received positive feedback from the participants.They were consistently engaged in the study, and even more so in group scenarios, where they could discuss the process among themselves.The only critical feedback came from one participant, who noted that the study could be tiring due to its intensity.The mechanics of having to pick actions, face events, and deal with consequences may sometimes be overwhelming given the realistic setting.However, this was a design decision we made to elicit more realistic responses.
The SGDA components of fction/narrative and framing, together with content/information, worked well for the most part whenever the participants were in a more relatable scenario within the US.However, when the scenarios place the participants outside of the  US region, the DM would have to thoroughly describe the limitations, journalistic freedom, and government tendencies of the unfamiliar context.Additional context and clarifcations would often need to be provided and repeated throughout the study.
Another pair of components with weaker cohesion between them are aesthetics/graphics and content/information, as we couldn't provide the participants with live videos to look at, but rather textbased descriptions of the content.Although that did not seem to limit engagement, the scenarios would beneft from either lowfdelity visual outlines of the content or high-quality generated videos in the future.
Training.The groups were all local newsrooms, and all felt that deepfakes would have less impact on them, as they often record their own videos and seldom source them from social media.However, one of the groups faced a recent incident where miscontextualized information from a social media source made it into their news broadcast.Hence, the danger from a targeted deepfake video may not be far-fetched.Thus, there is a need for deepfake awareness training for local newsrooms a statement that is echoed in the study of journalists by McClure Haughey et al. [69].
What.Hack [110] showed that gamifed role-playing training systems could improve the behavior of ofce Internet safety by improving the resilience of participants to phishing emails.Although our study requires an active DM to provide live responses, making it harder to use as a drop-in training module for an organization, we were curious what our participants thought about its potential as a training simulation for journalists.During our post-study debriefng sessions, all participants agreed that newsrooms would beneft from a gamifed version of the news verifcation practice used in our work.Participant 2N mentioned: "This is very fun and I could see this being used in my newsroom." These positive responses suggest a direction for future work in developing training methods for investigative journalists.

DISCUSSION
The Dungeons & Deepfakes study primarily follows a qualitative approach, with minimal incorporation of quantitative elements.Through this study, we have engaged participants in an exploration of their thought processes while performing a familiar verifcation task, ofering valuable insights into our research questions.
Journalists want deepfake detection solutions.The participating journalists displayed a high level of receptivity towards the tool and expressed a keen interest in incorporating it into their verifcation toolkit.It became apparent that participants with greater experience in digital multimedia verifcation, particularly when using other technologies, exhibited a greater level of comfort and discernment when using the deepfake detection tool.Some participants even discussed the possibility of conducting their own assessments of the tool to gain a more comprehensive understanding of its capabilities.
Experience and explanations reduce overreliance.While the results were not catastrophic by any means, some participants displayed a considerable degree of trust in the outcomes produced by deepfake detection tools.Those with a higher level of technological profciency would have a smaller chance of being overreliant on the output of the tools.When verifying digital media with technology, it is vital to spend a sufcient amount of time analyzing the results to overcome biases [90].Hectic breaking news scenarios may limit the time available, so it is imperative that developers provide explainable fndings that journalists can use to verify the results.
Reliance on tools is inversely proportional to the ability to independently verify a video.Section 4.2 fndings highlight that seasoned journalists initiate verifcation through contextual measures.They reach out to characters in the story and try to construct a comprehensive narrative.Often, the compilation of information from diverse sources renders the video's structure inconsequential, as the news it conveys can be independently verifed.However, when the trustworthiness of responses from characters within the scenarios diminishes, there is a corresponding increase in reliance on verifcation tools.In difcult scenarios such as organized, state-funded disinformation campaigns and personal blackmail videos, where assembling a coherent narrative proves exceptionally challenging, deepfake detection tools will be increasingly required.
Training will help empower newsrooms while developers catch up.While researchers and developers tackle the lack of robustness, usability, and explainability of forensic tools, it is necessary to alleviate the dangerous impact of deepfakes and unreliable detection tools on contemporary news.To that end, it is essential to train newsrooms to understand the utility, strengths, and weaknesses of verifcation tools.They must be able to use these tools with an awareness of their potential pitfalls and without having to learn on the fy when under deadline stress.Based on the participants' lack of exposure to deepfake detection, is not surprising that a substantial number of our participants chose either not to publish anything or to release articles reporting only on verifed facts while continuing their verifcation processes.
Scenario-based role-play is a viable, though complex, user study methodology.Strategic implementation of a game-like role-play experience injected enthusiasm into the study while efciently extracting valuable insights.It is not simple to design and execute, however.Role-play studies, known for their complexity, demand meticulous curation of roles and scenarios.Our design, inspired by D&D, further necessitates the creation of thoughtful stories, actions, responses, and events, along with a DM possessing sufcient domain knowledge to efectively respond to participants' inventive actions.Nevertheless, our results afrm the efectiveness of our methodology, and the positive feedback received from participants hints at the potential extension of this approach into a training module to address the issues illuminated in our fndings.
Broader Impacts on the Public.Assuming that reliable and effcient deepfake detection, though potentially achievable, is still several years away, platforms hosting potentially manipulated information may not yet have the necessary solutions in place to flter or tag such content.Furthermore, considering the fragility of detection tools, releasing them for public use could pose dangers, as any errors, coupled with confrmation bias, may lead to signifcant controversies, as evidenced by incidents in Myanmar [48] and Gabon [14].The responsibility for informing the public about the truthfulness of content in a logical manner largely falls on journalists and fact-checkers.To that end, our work points to the importance of making sure that they are well informed of the strengths and limitations of the evolving deepfake detection tools, and are equipped with best practices to enhance their verifcation eforts.While human error is inevitable, it is essential to mitigate potential exacerbation through mistakes made by automated technologies designed to assist them.
Limitations.The study only recruited participants based in the US, meaning that the fndings might not refect the practices and opinions of other journalists, especially those from diferent journalistic cultures.However, the participants come from several diferent media organizations of diferent scopes and sizes, providing us with varying thought processes and practices.For our study samples, we only focused on the scope of the organization (national versus local) to separate behaviors.Other attributes like gender, age, experience, and geographic location could yield other types of results in future work.
The study procedures are intentionally designed to bias participants and thwart some of their typical practices.This was necessary to construct meaningful and challenging scenarios in which the deepfake detection tool could be seen as helpful, even when it was mostly new to the participants.It suggests, however, that our scenarios are more challenging than ones journalists might typically face, even when deepfakes are involved.In that sense, the design of a deepfake detection tool should not rely too heavily on the results of this study and should also consider related work [101].
Section 4.3 relies on observations of the verifcation process to evaluate the efects of the deepfake detection tool on the thought process of the participants, where we classifed less desirable behavior as potential overreliance or reafrmation behaviors with respect to the deepfake detection tool.While we believe that our conclusions are reasonable, it would be better in future studies for the DM to identify these behaviors in real time and ask participants to elaborate on their decisions during debriefng.
Finally, as eluded to in Section 3.2, we did not use actual videos in our studies.This prevented participants from applying their expertise in visual verifcation to examine the videos.Despite the challenge of visually identifying manipulations in high-quality deepfakes, where most people are not reliable [12,41,56,96], our studies specifcally involved journalists, a more seasoned group of users.Journalists and fact-checkers often leverage their experience to interpret body language and detect anomalies.On the other hand, awareness of deepfakes -particularly in the context of our study -might lead to participants suspecting that a genuine video is fake due to compression artifacts or issues with lighting.While our study does not directly address this phenomenon, exploring experts' ability to discern deepfakes and generated media, akin to the work conducted by Ask et al. [7], could serve to extend our methodology and mitigate this limitation.
Safeguards.While the focus of this study was not the development of safeguards, but rather a creative assessment of the dangers, our fndings in Sections 4.2, 4.3 and 4.4 showed that it is possible for journalists to misinterpret the results of the deepfake detection tools and overrely on them.Thus, we argue that AI-based media forensics tools must support journalists to understand how the tools may be inaccurate.We propose the following to safeguard journalists and other users of their AI-based forensics tools: (1) Add clear warning messages alongside detection results, alerting users to possible inaccuracies and the reasons for them.
(2) Add tooltip-driven onboarding for frst-time visitors to attune them to the important pieces of the application, including warnings.Mets Kiritsis [70] found tooltip-driven onboarding to be more efective than other popular onboarding methods.(3) Provide training materials or sessions to organizations and verifcation groups that could use the tools and include both contextualized warnings and examples of both correct and incorrect results.While further development to improve the robustness and domainspecifc explainability continues, the above improvements should reduce the unintended efects of the assistive technology.
Future Work.This study has the potential for various extensions.It would be interesting to see the diferences between the behaviors in this study of journalists from diferent regions of the world.As Humprecht [45] showed, levels of journalistic professionalism vary between countries.It would be fascinating to observe the infuence of hierarchy and experience on the verifcation and publication decisions in a group setting.Another direction would be to examine diferent ways to ofer domain-specifc explainability and transparency in a deepfake detection tool for journalists and seek to understand how they might use these when constructing news pieces.However, it would be important for the explainability to be handled and studied carefully so that it does not exacerbate failures [34,90].The study design itself may beneft from potential automation with the help of recent advancements in Large Language Models (LLMs) to design various aspects of the study and even take the role of the DM.

CONCLUSION
Journalists are the gatekeepers of truth in the published media.This work provides novel insights into their behavior and perception towards deepfake detection tools through an engaging scenariobased roleplay methodology.Our key fndings reveal that while journalists have positive views on these new AI assistance tools and look forward to incorporating them into their workfow, most tend to rely frst on traditional verifcation methods to establish context.The tools see more use when contextual verifcation is difcult or time is limited.We observe that urgent breaking news scenarios sometimes lead journalists to skip verifcation steps, while high-impact stories result in more diligent checking.
Additionally, we noticed risky overreliance by a few participants due to potential cognitive biases, particularly when tool results confrm initial impressions.This signals the need for cautions around deepfake detector deployment and highlights the importance of improving explainability.Given the participants' interest, our scenario methodology shows promise for training to improve verifcation skills.
Given the current surge in interest surrounding Large Language Model (LLM) Chatbots [91], our fndings may be extrapolated to indicate that users might overrely on their outputs due to a lack of experience and the prevailing social media hype regarding their efcacy.Despite recent applications warning users about result instability, employing a scenario-based role-play methodology can help assess changes in workfows across various industries, gauge user perceptions of these tools, and determine the degree to which users heed provided warnings.
In conclusion, as deepfake detection tools see wider adoption, it is vital we understand their impacts on professional workfows and provide adequate training.This will empower journalists to harness AI assistance for quality reporting while mitigating unintended harms from shortcomings.Further interdisciplinary eforts between journalists, developers and researchers in this space are crucial.

Figure 1 :
Figure 1: Left: Dungeons & Deepfakes study fow showing the various steps taken over the course of each session.Top Right:Introduction slide with an avatar generated using a face generator.Bottom Right: A cropped presentation slide corresponding to the participant's selection of the action to use a deepfake detection tool.The response reveals identifed manipulation in the visual frames using two distinct methods, collectively indicating a 98% certainty, alongside predominantly pristine audio content.In this particular instance, the tool does not detect any inconsistencies in the synchronization between the audio and video signals.

NA3October 18 ,
2020 | A team member from your news gathering team points out a viral Tweet from Kanye West.[Tweet] I am pulling out on my election plans.Me and Kim refuse to work with a party that supports the vision of a terrible person like Trump.All the best wishes to Biden! Kanye West sitting on a couch at what looks like his home.Video shot in a conference call fashion.
Dr. Fauci sitting in his home ofce talking to the webcam.Tool Verdict: Fake with real audio.Real Verdict: Real INT1 January 25, 2022 | A coworker at your news organization points out a video of secretary to Prime Minister Sheikh Hasina, Md.Tofazzel Hossain Miah admitting to bribery, that is going viral on TikTok.The news that could put a lot of pressure on both the secretary and the bribing party, East West Properties Development Ltd.The same party that was accused of fraud and embezzlement about 10 years ago.Md.Tofazzel Hossain Miah is seen talking about receiving Tk. 75 lac from Ahmed Akbar Sobhan Shah Alam with the aim to facilitate a takeover of 20 acres of land in Ketun region on the outskirts of Dhaka city.Tool Verdict: Fake with real audio.Real Verdict: Fake INT2 January 25, 2021 | Myanmar government-controlled National news publishes an article with Maung Zarni, a Burmese activist for human rights, taking a payment from two other men in what appears to be a shipping dock.The two men are identifed to be part of a recent round of arrests related to human trafcking.

Figure 2 :
Figure2: Individual Study.Analysis of the verifcation actions taken by the participants per scenario.Actions start at the top (white box) and move down, keeping the location of the deepfake verifcation tool action as a reference point.The action boxes contain the verdict Likert response (1: "Real" to 5: "Fake") after using the action.Publication boxes are marked with ✗ for a wrong verdict, ★ for reafrmation, and △ for overreliance.The top box in each scenario denotes the initial verdict before verifcation.Please refer to Table1for the participant codes, and Tables2 & 3for the scenario codes.

Figure 3 :
Figure 3: Group Study.Analysis of the verifcation actions taken by the groups per scenario.The sequence begins on the left (white box) and moves right, keeping the location of the deepfake verifcation tool action fxed as a reference point.Each action box represents a set of three actions and contains the verdict Likert response (1: "Real" to 5: "Fake") after using the selected actions.Publication boxes are marked with ✗ for a wrong verdict, ★ for reafrmation, and △ for overreliance.The leftmost box in each scenario denotes the initial verdict before verifcation.Please refer to Table1for the participant codes, and Tables2 & 3for the scenario codes.

Figure 4 :
Figure 4: Coherence and cohesion between SGDA elements for the study.
Contact the sender; Visit Greg Schultz's Twitter; Search Google for context; Event 1: Video goes viral with 30,000+ retweets, Look for other news reports; Analyze original poster's account; Contact the original Trump retweets.author; Message Greg Schultz; Analyze retweeters' accounts; Search Twitter for Event 2: Greg Schultz responds denying involvecontext; Invent action.ment.[technical] Check metadata; Run through DeFake tool; Manually analyze video; Use InVID Project; Request biometric analysis (3 step cost).