Closing the Knowledge Gap in Designing Data Annotation Interfaces for AI-powered Disaster Management Analytic Systems

Data annotation interfaces predominantly leverage ground truth labels to guide annotators toward accurate responses. With the growing adoption of Artificial Intelligence (AI) in domain-specific professional tasks, it has become increasingly important to help beginning annotators identify how their early-stage knowledge can lead to inaccurate answers, which in turn, helps to ensure quality annotations at scale. To investigate this issue, we conducted a formative study involving eight individuals from the field of disaster management, each possessing varying levels of expertise. The goal was to understand the prevalent factors contributing to disagreements among annotators when classifying Twitter messages related to disasters and to analyze their respective responses. Our analysis identified two primary causes of disagreement between expert and beginner annotators: 1) a lack of contextual knowledge or uncertainty about the situation, and 2) the absence of visual or supplementary cues. Based on these findings, we designed a Context interface, which generates aids that help beginners identify potential mistakes and provide the hidden context of the presented tweet. The summative study compares Context design with two widely used designs in data annotation UI, Highlight and Reasoning-based interfaces. We found significant differences between these designs in terms of attitudinal and behavioral data. We conclude with implications for designing future interfaces aiming at closing the knowledge gap among annotators.


INTRODUCTION
As the quality of training sets directly contributes to the quality of Machine Learning (ML) models, research in data annotation focusing on producing high-quality labels has attracted attention from several communities, including HCI [19,22,28], Natural Language Processing (NLP) [8], Computer Vision [40], and beyond.Several data annotation approaches have focused on providing information cues that can help human annotators complete their work more efficiently, for example, by offering relevant examples [20] or highlighting words or images [19].Such efficiency-driven annotation designs generally seek to improve micro tasks that require less domain expertise, such as restaurant review sentiment classification [30] or object detection [42].
As the goals of ML models become more contextualized and domain-specific [3,4,49,62], however, many annotation task types require advanced knowledge that is beyond that of beginner annotators.Examples of such domain-specific problems include damage assessment [14,33], risk analysis [51], information filtering for disaster management [44], medical image reading [41], and many more [29].While less experienced are more available than those with extensive domain experience, helping the less experienced to annotate "like a pro" by solely relying on efficiency-driven designs can impose challenges.Prior research has emphasized the significance of incorporating a diverse range of knowledge and perspectives [6,37,48], highlighting the need to understand the implications of the knowledge gap between beginners and experts in modern data annotation user interface design.However, few approaches have been explored the way to close the knowledge gap in designing data annotation user interfaces.
In this work, we seek to understand how the knowledge gap can impact the annotation performance of beginners and characterize common reasons that lead beginners to make different decisions than experts.We further seek to explore how annotation design strategies can provide information that mitigates the knowledge gap.To understand and measure this knowledge gap, we conducted a formative study (S1) with a diverse group of eight participants (Experts and Beginners) in different sub-domains of disaster management.We provided distinct sets of one thousand tweets related to the disaster Hurricane Ian [53] to both groups and requested them to annotate whether each tweet pertained to any emergency event involving transportation means, damaged infrastructure, or was unrelated to the context.Next, we interviewed the annotators to understand the rationale behind their differing annotation decisions when compared to expert annotators.Our analysis revealed two common reasons for these disparities: confusing words: messages that contain related words to transportation and infrastructure, such as "cars" or "bridges", but the ground truth is negative and hidden context: instances where messages lacked readily apparent related words but required an "institutional level of insights" to interpret positive relevancy.
In our summative stage (S2), we aimed to evaluate how an AIassistive design built based on S1 insights can impact data annotation task performance compared to widely used state-of-the-art designs.In S2, we built the Context interface which provides cues to detect confusing words commonly misinterpreted by the annotators and reveal the hidden context derived from the annotation differences between domain experts and beginners that we identified in S1.We built two additional designs that represent commonly employed annotation user interface designs.One of the designs is the Highlight interface that color-codes relevant words [19,25].The other design is the Reasoning interface that explains "why" the message can be interpreted as positive or negative by using the advanced Large Language Models (LLMs) [7,34].Using the 3 conditions, we conducted an experimental study with 13 Community Emergency Response Team (CERT) volunteers who are beginners in their domains.Across the three conditions, we measured how their behavioral and attitudinal annotation task performance varied.We observed that the Highlight and Context interfaces exhibited nearly identical behavioral accuracy.The Highlight interface proved to be the fastest among the three designs.In terms of attitudinal performance metrics, the Context design outperformed the other two, leading annotators to perceive it as the most effective in reducing the knowledge gap.
This work offers the following contributions: • Empirical understanding of the knowledge gap in data annotation: Through S1, we offer empirical insights into the knowledge gap that exists between data annotators, specifically within the disaster management domain.We accomplish this by eliciting annotations from two groups of beginner and experienced annotators.Through interviews with beginner annotators, we identify the two common reasons behind annotation differences between beginners and experts-confusing words and hidden context.• Design for mitigating knowledge gap: Based on the S1 findings, we develop a novel interface design that leverages the two common reasons behind annotation differences between beginners and experts.• Experimental study: To understand how the new design can affect beginners' annotation mitigating the knowledge gap, we conducted S2 and reported results.• Implications for design: Based on S1 and S2 findings, we discuss how annotation interfaces can leverage our insights to help beginners reduce the knowledge gap in the future.

RELATED WORK
In this review, we describe how research in group work has been applied to design annotation interfaces.We then examine existing data annotation interface studies.By synthesizing insights from both review directions, we conclude with the necessity of more deeply understanding knowledge gap-driven design in annotation interface research.All forms of collaboration stem from a similar motivation; group work can be productive because it allows individuals to share and gain insights they might not have discovered independently [13].By pooling the knowledge and expertise of all members, group work can be executed more efficiently and accurately [6,16,26,27].Several studies in HCI, especially Computer-Supported Cooperative Work (CSCW), have built interactive systems and designs that apply in group work settings, such as collaborative searching, collaborative information seeking, and beyond [2,9,27,52,54].In the research on data annotation interfaces, a line of studies apply the spirit of group work-"the whole is greater than the sum of its parts"-especially when the ground truth is not deterministic and individual viewpoints can matter.For instance, Chang et al. [17] introduced a tool "Revolt" facilitating group decisionmaking for collaborative crowdsourced workers.They leveraged the disagreement between the annotators to identify ambiguous concepts and provide more options for decision-making based on the disagreement feedback.Upchurch et al. [59] present an online game allowing annotators to refine the majority vote labels to build consensus among them while labeling image data.Numerous research works have been focused on group decision-making that theorizes how a group, either multiple humans or humans and AIcan explain different perspectives and resolve potential disagreements.For example, Sutcliffe introduced a Small Group Theory for designing CSCW systems that employ model-based analysis of group members, technology support, and design principles derived from the theory [58].Hong et al. investigated the impact of single-user designs on consensus-building processes [31] and, to foster group consensus, they introduced Collaborative Dynamic Queries [30].This approach allows a group to filter decision criteria while sharing the choices made by others.Similarly, Kairam and Heer [36] illustrated how crowd-parting analysis at the intermediate level offers insights into sources of disagreement not readily apparent when examining individual annotation sets or aggregated results.Brachman et al. suggested a system where AI assists in identifying cases where the majority vote by labelers is incorrect, employing automation for conflict resolution tasks [11].The outcome demonstrated that automated conflict resolution enhances user accuracy and efficiency.
Past studies have developed intuitive interfaces that simplify the task of labeling, incorporating features like drag-and-drop [55] or batch labeling [5], highlighting [19], and leveraging languagebased models [7].Stureborg et al. grouped more similar contents together and different kinds of pass logic to coordinate between the crowdsourced annotators in multi-labeling tasks [56].Gooding et al. performed a comparative study on annotation interfaces for summarization tasks trained in-house annotators with backgrounds of different proficiency levels [25].The majority of tasks discussed in the earlier literature fall into the category of "microtasks, " which can be performed by almost anyone.In contrast, we define our target annotation task as one that demands advanced knowledge or experience for making accurate decisions, distinguishing it from "microtasks."Our task is particularly concerned with the knowledge gap between beginners and experts, aligning with the focus presented by Wilkins et al [61].In their qualitative study, Wilkins et al. investigated how 38 knowledge workers (21 freelancers and 17 employees) apply knowledge, demonstrate skill in their work, and mobilize resources to address knowledge needs-a phenomenon referred to as the "knowledge gap."When designing technology to alleviate this knowledge gap, it is crucial to address the social-technical gap, as this represents a central challenge for the CSCW [1].Several studies have highlighted how this knowledge gap can generate unwanted productivity erosion.For instance, Kapania et al. studied that when annotators lack access to essential information, clear guidance, and opportunities for collaboration, a knowledge gap can emerge among them [37].Expert annotators possess a deep understanding of the domain and context, enabling them to provide more accurate and consistent annotations whereas beginners may lack this critical expertise, leading to errors and misinterpretations in their annotations [46,48,60].Annotation tasks are designed to establish a definitive "ground truth" label for training machine learning models to minimize the influence of annotators' knowledge gap [37].
Through the literature review, we identified that the insights from group work have been mostly applied to subjective tasks with no deterministic ground truth rather than objective tasks.In handling the objective tasks, meanwhile, we also identified that the research in data annotation design has put more emphasis on supporting microtasks where domain-specific expertise is less important.Based on the review results, we were motivated to further understand how the knowledge gap can manifest in objective tasks, what are the common reasons, and how new data annotation user interface designs can effectively support "beginner" annotators.

FORMATIVE STUDY (S1)
Our formative study's objective is to analyze the knowledge gap between experienced and beginning annotators while they identified transportation-related events on social media during a disaster.S1's Research Questions (RQs) are as follows: • RQ1.How does the variance in expertise between Beginners and Experts introduce differences in conducting annotations?• RQ2.What are the typical reasons when an annotator's knowledge gap can incur disagreement in annotation tasks?

Recruitment
For S1 we recruited eight participants in different sub-domains of disaster management with varying levels of expertise.One of the authors with expertise in the field reached out to the participants via email to inquire about their interest in participating in the study.
From the interested participants, we selected four Experts that included an EMT specialist and transportation manager from the county-level government, along with two emergency managers and a cybersecurity analyst from the federal-level government.The four remaining participants were community members affiliated with a Community Emergency Response Team (CERT) at the county-level of government, categorized as Beginners for the purposes of this study.CERTs in the United States offer standardized training and organization, serving as reliable resources for disaster response entities like transportation agencies [21].When the operation is activated, CERT volunteers can assist formal humanitarian organizations across different disaster response and management tasks, expanding their roles to virtual support, such as social media analysis.Each participant was compensated with a gift card for their participation in the study.

Method
To gain a deeper understanding of current approaches to extract relevant information during disaster events, the common challenges they encounter, and how the knowledge gap can impact their workflow, we conducted open-ended interviews with Expert group participants (P4, P5, P7, P8).These interviews were conducted remotely through the Zoom platform and lasted approximately one hour each.
Prior to commencing the interviews, we obtained informed consent from all participants.Interview questions centered around the following key themes: 1) the expert's overall workflow for identifying significant events on online intelligence systems, 2) the procedures employed for training less experienced volunteers and the corresponding challenges, 3) the impact of the "knowledge gap" between less experienced and experts when analyzing online intelligence, and 4) the desired features sought to speed Beginners' experience acquisition in this domain.
The next phase was the ground truth collection phase.Creating a gold standard is a fundamental requirement for performance evaluation and accuracy verification [15].The purpose behind collecting ground truth data is to pinpoint crucial instances where annotators exhibit the most disagreement and determine their correct labels.These datasets would then be employed during the design evaluation phase, using various interface designs, to assess whether beginners can annotate them with improved accuracy.As part of this phase's requirements, we conducted interviews with four Beginner CERT volunteers (P1, P2, P3, P6) to gain deeper insights into their decision-making processes.We analyzed the disagreement data with the help of a transportation expert (exclusive from our recruited expert participants) to prepare our test datasets.This analysis also guided us in identifying design considerations for an annotation interface that might reduce the knowledge gap and enable Beginners to make more accurate decisions.To summarize, we split our participants into (Experts: P4, P5, P7, P8), (Beginners: P1, P2, P3, P6), and one transportation expert analyst for verifying ground truth data.

Understanding Problem Scope
This stage of the study is focused on addressing our initial research inquiry, denoted as RQ1.Recruited expert participants have field experience in disaster management and have witnessed first-hand the difference between experienced and inexperienced personnel.Their practical insights are instrumental and relevant in qualitatively exploring knowledge gaps within this domain.The Experts also shared their experiences with existing supervised machine learning systems used in their practice and deliberated on applicable considerations for these systems.These systems have undergone training to recognize disaster-related information aligned with specific disaster mission objectives.P5 emphasized the importance of considering various sources in these systems, such as social media platforms, television, or the Internet, for gathering information.All of these sources can provide different kinds of information from different populations that together can form a more complete picture of emergency situational awareness.However, one prominent challenge that emerged from our discussions pertained to the contemporary deluge of data and the difficulty of assessing its accuracy.P5 elaborates, "The volume of data and the lack of awareness as to its legitimacy.Certainly, misinformation can lead us down a path to being challenging to have to validate and then ultimately throw out the information as far as being relevant".Incorrect labeling and misinformation compound this challenge, further complicating efforts to verify the credibility of information within these systems.
Annotation agreement and feedback on that is also important in this context.P5 remarked from his past annotation task experience that there were not many differences or mistakes between experts' and beginners' decisions, rather among the potential reasons for making those decisions, "There were not that many errors or mistakes made by the less experienced versus experienced annotators.I just think the difference between potentially one of the reasons there".Another concern is the challenge of accurately assessing the severity of a situation based solely on media, particularly when it comes to events like natural disasters.For example, P4 mentioned, "People post things about a situation, it may sound terrible, but it may be very localized.There may be some flooding in the county or in a specific location.It is not like there is flooding all over the county".P7 underscores the importance of considering context and perspective when assessing the severity of a situation, for instance, P7 said, "My car is in high water, and I can not cross the road.Well, it is bad for you, but I do not know if that is an emergency for others".This denotes the subjectivity of emergencies urgent situations for one person may not be applicable to others.
P7 emphasizes the idea that people's perspectives on whether something is good or bad can vary significantly based on their position and experience.As P7 says, "Your reason for thinking this is bad or good is very different, based on kind of the position you are in.If a junior person says to me, it is really bad for them because of this reason.And then I say, I do not think it is so bad, because, from a business perspective, this is not what we cover".The statement exhibits how one's professional background, experience, and responsibilities can shape their perception of a situation.Additionally, it suggests that more experienced individuals may tend to have a broader perspective and not view every issue as critically as less experienced individuals might.

Ground Truth Collection
Ground truth collection was conducted in three steps, 1) Collecting samples from social media, 2) Performing annotation task, and 3) Disagreement Analysis.

3.4.1
Collecting samples from social media: In the process of data collection, our primary focus was on the recent disaster event Hurricane IAN [53].We gathered data (publicly posted status and text messages) from the Twitter social media platform [35] specifically within the timeframe of September 23, 2022, to October 2, 2022, covering both the event itself and the aftermath.
For the annotation task, two classes were defined: Transportation Means (TM) and Damaged Infrastructure (DI).TM was defined as the means used to move people and/or goods from one place to another and must have operational value to public safety mission, for example, transportation officials may call for a debris management team to go and remove the inoperable vehicle from the roadway during the disaster or emergency events.DI was defined as foundational structures and systems for transporting people and goods that have been partially or completely damaged.
We applied an existing methodological framework [45] that describes the process of collecting relevant data for disaster management agencies.The process included the use of domain expertprovided keywords to search/filter transportation-specific messages from Twitter.ChatGPT [34] was used to enhance the preliminary set of keywords.Given Hurricane Ian made landfall in the Southwest District of Florida we queried the tool based on common transportation means found in that geographical region.Additionally, we queried the names of bridges, causeways, ports, highways, bus and rail services, and county-level transportation agencies in Lee, Charlotte, etc.A total of 475 keywords were identified as relevant to the context and verified by a domain expert.Using these keywords, we filtered out and curated tweets, resulting in an updated final sample of 4,000 data points for our initial annotation task.

Performing annotation task:
From the pool of 4,000 tweets gathered in section 3.4.1,we divided them into four distinct datasets and assigned each dataset to an individual annotator.Each annotator was tasked with labeling a set of 1,000 tweets that were exclusive to the other annotators.The annotation involved categorizing each tweet text as either TM, DI, or both.Participants also had the option to designate tweets as IR (Irrelevant) to the context.The primary aim of this task is to gather annotations from beginners, which will later be cross-verified by experts to identify any disagreements (labels differing from the expert's decision).A similar interface to [50] was used to execute this task.The data collection process occurred asynchronously, spanning a total duration of one week.

Disagreement Analysis:
This stage of the study focuses on addressing our second research inquiry, RQ2.In this step, we analyze the collected instances of the annotation set from 3.4.2.A domain expert with decades of experience in the emergency management profession conducted a comprehensive analysis of these tweets.The Expert reviewed the instances and determined which label was correct for a single tweet.If the expert's decision differs from the annotator's, then disagreement occurs.
The result revealed that out of 909 tweets where disagreements occurred, approximately 51% (459 cases) were accurately labeled by the Expert, approximately 43% (392 cases) were accurately labeled by Beginners, and 58 cases (approximately 6%) remained inconclusive, with neither party making the correct decision.Furthermore, these disagreements were categorized into distinct themes based on their types, such as classification errors, institutional insights, lack of visual cues, language barriers, and matters of opinion.
We organized online sessions with Beginners to collect insights about their decision-making processes.Each session incorporated two distinct types of exercises.Both of these approaches collect natural user behavior in a relatively unobtrusive manner over an extended period, providing some insights into individuals' thought processes as they engage in activities.They are particularly valuable for comprehending the underlying reasons behind tasks that require focused attention [47].
(1) Data annotation session: In this task, we reviewed 50 samples drawn from the pair-wise annotation project, specifically targeting instances where Beginners made incorrect decisions.
Our selection process prioritized cases falling under the "Institutional insights" category due to hidden contextual indicators suggesting operational relevance, requiring annotators with disaster management expertise.We also included samples from other disagreement categories.Our primary focus was uncovering two key insights for each tweet: the rationale behind the prior class selection (TM, DI, or IR) in annotations and potential alternative interpretations.We recorded participants' responses, contributing to a comprehensive analysis.(2) Follow-up retrospective interview: This session encouraged annotators to group and articulate the reasons behind their decisions.We asked each participant to list 5 common reasons for grouping their annotation decisions.Among the most frequently identified reasons were: a lack of contextual knowledge or uncertainty about the situation (mentioned by 6 participants), the absence of visual and supplementary cues (mentioned by 5 participants), consideration beyond mere keywords, the presence of low-quality tweets, relevance to transportation or infrastructure, clear indicators of events, and the inclusion of topics unrelated to the emergency situations.

Design Considerations
After analyzing feedback from both expert and beginner annotators, it became evident that annotating social media data within the context of disaster management presents a multifaceted challenge [32].Annotators frequently encounter difficulties in making accurate decisions due to the nuanced nature of interpreting messages in this domain, a process heavily influenced by their individual levels of expertise.Individuals with more experience tend to possess a deeper understanding of both knowledge and context within a given field, as P7 mentioned, "Experience tends to know more about the knowledge and of the context and it is because of their experience.They can like to take those events from their own experience and it is easy for them".We have pinpointed key challenges and explored potential design strategies aimed at bridging the knowledge gap, enabling beginners to make decisions equivalent to those of experts in this complex task.

3.5.1
Revealing hidden message context.The most frequently cited factor influencing annotation decisions is an awareness of the message's context.Less experienced annotators often struggle with grasping the correct context of tweets, particularly when it relates to the field of disaster management.Having more context or information about a situation can simplify the process of deciding what class or category to assign, as P1 mentioned, "If you see more context, it makes it easier to make a decision as to what the label would be".Providing users with easily accessible definitions and customized resources could enhance their understanding of unfamiliar terms and contexts, ultimately improving their ability to complete tasks or missions effectively.For example, P5 said, "It could be beneficial for less experienced users to understand and comprehend terms and contexts a little bit better where they could have sort of like a dictionary that gives the definitions of what their mission is or a more customized resources pointing people to getting more information about the tweet".P6 also discussed about similar issue of insufficient context for a particular topic, due to a lack of visual cues or inaccessible links.If annotators had access to these cues, it might provide the missing context.

3.5.3
Emphasizing on class-relevance hints.The participants also discussed the advantages of emphasizing significant keywords or phrases associated with a specific class, a practice that has been utilized in previous research as well [19,25].By providing a visual aid that draws attention to the most critical elements within the text or content being annotated, such as highlighting key terms or phrases, annotators can quickly locate and focus on the information that is directly relevant to the task.As P1 remarked, "Trucks, commercials or keywords related to transportation means, or any kind of damaged infrastructure will be actually helpful to you to make these decisions".Emphasizing the important information can help annotators focus on a particular situation where they see significant damage resulting from an event or incident.For example, P2 said he was specifically focused on extreme cases, "I guess I was looking at it as the extreme damage was done and where our emergency resources are needed".While highlighting keywords can be a valuable starting point for annotators, it's essential to consider the broader context surrounding those words to grasp their intended meaning fully.For instance, P4 said, "Highlighting the relevant keywords and if someone is instructed paying attention to anything but where it mentions driving or your car, then some of those words might get picked which really have nothing to do with the disaster event".

3.5.4
Providing clues for potential errors.Making informed choices during the annotation task is crucial in the context of disaster management.We observed that some annotators prioritized a set of keywords over the contextual information because these keywords appeared significantly pertinent to a specific class, even when the tweet itself did not pertain to an emergency situation.This can occur due to the complexity of certain tweets, which makes it challenging to correctly categorize them, as well as differences in annotators' experience levels.For example, P3 mentioned, "It mentioned car, that's why I chose TM, but this one probably should have been marked irrelevant because it doesn't mention anything about the actual storm.At 70% time, I was probably selecting classes based on keywords, and probably for 30%, filtering it of what could be most useful for them".While false positive cases may not significantly impact operations, false negatives for transportation sector in emergency scenarios can have a substantial adverse effect on the situation.For instance, P8 said, "If we make a mistake in the wrong place at the wrong time, we could actually get somebody killed".
Providing hints for potential errors or ambiguous classifications is immensely helpful to annotators as it can offer guidance and clarification in situations where there might be uncertainty or complexity.When annotators encounter ambiguous cases having those kinds of hints can help them make more informed decisions.

3.5.5
Elaborating message content depending on class.Explaining possible reasonings for a message belonging to a particular class can also be helpful to annotators, as users can understand better why a message fits into a specific category or why it does not.This enhances their decision-making process.To support this P5 remarked, "If people were trying to understand the relevancy of transportation means and damaged infrastructure, having a system where that could provide some local knowledge within it would be tremendously beneficial".This highlights the importance of local insights and expertise in understanding the practical implications and significance of transportation and infrastructure issues.

ANNOTATION INTERFACE DESIGN
Based on the design considerations, we developed three annotation interfaces for providing aids to the annotators: 1) Highlight, 2) Reasoning, and 3) Context.The Highlight and Reasoning interfaces are designed based on state-of-the-art techniques used in data annotation [7,19,25].Since our S1 revealed significant challenges in the decision-making process, we introduced the Context interface, which incorporates two types of hints: 1) highlighting potentially confusing words and 2) hidden context within the presented tweet.

Common features
All interfaces have been created to facilitate text annotation [50] by presenting individual Twitter messages one at a time and inquiring about their relevance to specific classes.Each question is accompanied by two radio button options ("yes" and "no") inquiring whether the displayed tweet is associated with that specific class as shown in Figure 1.The "Confirm and submit" button will be disabled by default when no option is selected.After the users choose an option they can proceed to the subsequent question by clicking the "Confirm and submit" button.To maintain the fidelity of our experimental conditions, questions for each tweet are initiated per the tweet's ground truth label (considering both false positive and false negative cases).Further elaboration on the sampling method details can be found in Section 5.1.2.Users can also track the number of completed and remaining tweets from a progress status bar.The annotation task is executed for two classes (TM and DI).If both class options are labeled as "no," we categorize the tweet as irrelevant (IR) for the given context.

Highlight
The Highlight interface employs color-coded schemes to highlight specific keywords within a single tweet.We highlight relevant and irrelevant words (tokens) of the text to help the annotator pay more attention to the tokens that can potentially be indicative of the correct label.Moreover, we consider different intensities in highlighting the tokens to represent how much a token is relevant or irrelevant to the label.For example, the token "Drive" with darker shades indicates a stronger connection (TM), while the lightest shaded tokens "right", and "HUGE" represent the weakest association (Not TM), as shown in Figure 1 (a).
In our formative study, we used 4000 samples to find disagreement cases.We use samples from the agreement part to extract relevant and irrelevant tokens to each label and calculate their relevancy measures, as follows: (1) For each sample, we create a list of candidate tokens.These tokens include nouns, verbs, and named entities that exist in the sample.Here, we use Spacy (an open-source library for NLP) to extract these tokens.(2) To calculate the relevancy of each token to a class , we employ normalized Pointwise Mutual Information (nPMI) measure [10].First, we calculate:  (, ) = log( (, )/( () ())), where  (, ) is the probability of a sample containing token  and is annotated as class ,  () is the probability of a message containing token , and  () is the probability of a message being annotated as class .
Then, we use  (, ) =  (, )/(− log 2  (, )) to normalize PMI measure and map the PMI measure to the range of [-1,1].This measure is used to measure the relevancy of the token  to the class , as: • nPMI value of zero indicates no association between token  and class .• A positive nPMI indicates that samples from class  are more likely to use the token .• A negative nPMI indicates that samples from class  are less likely to use the token .Furthermore, since we need to find both relevant and irrelevant tokens to the target class , we divide the dataset into samples that have been annotated as relevant to the target class  and those with irrelevant label.Therefore, we generate two lists of extracted relevant tokens and extracted irrelevant tokens to the class  by calculating nPMI measure for each token in these two categories.In this way, higher  (, ) on the relevant samples of class  implies that the token  is more relevant to class , and higher  (, ) on the irrelevant samples of class  indicates that the token  is less relevant to class .
In the testing phase, we highlight two tokens that are likely more relevant and two tokens that are likely less relevant to the class .For selecting the top two relevant tokens, we use a list of expertprovided relevant tokens in addition to the list extracted relevant tokens with the following procedure: (1) If the candidate tokens in the test sample exist in the expertprovided relevant tokens, they are selected and assigned a relevancy measure  = 1.0.(2) We select top-k tokens from extracted relevant tokens with the highest nPMI measures.(3) We merge these two lists and select the top two tokens with the highest nPMI measures from the merged list.
For selecting the top two less-relevant tokens, we use the list extracted relevant tokens and select top-k tokens with the highest nPMI measures.Since it may happen that the model has interpreted the same tokens as less-relevant and as more-relevant, to avoid confusion we remove such tokens from the less-relevant list and select the top two tokens.

Reasoning
The Reasoning interface presents explanations for why a single tweet message could be classified to a particular class (Why), as well as reasons for why it might not belong to that class (Why Not).
Annotators can review these rationales to inform their decisionmaking process.They can evaluate the tweet from both perspectives and then select the class label for that tweet.An example of this interface is shown in Figure 1 (b).
To generate the reasoning, we explore the capabilities of LLMs by prompting an LLM model with the message and proper instructions.Since LLMs have been trained on a massive corpus they have good knowledge of subjects like Transportation Means, but for the annotating task, we need to provide a definition of the subject that represents the annotation task's requirements, such as institutional insights.Moreover, LLMs usually generate long reasonings which can overwhelm the users with details that can confuse the users, so we need to provide short and informative reasoning.The procedure for generating reasoning by LLM is as follows: (1) We prompt the LLM with the given tweet and expert-provided definition of class , and ask the LLM to generate the reasoning about why the tweet is relevant or irrelevant to the class .(2) We extract all sentences from generated reasoning.
(3) Again, we prompt the LLM by providing these extracted sentences as the options and ask the LLM to select the reasoning from these options.

Context
The Context interface leverages insights from disagreements between Experts and Beginners, assisting Beginners in identifying potential errors and uncovering hidden contextual information within the displayed tweet.This interface incorporates two key elements: first, it provides cues in the form of hints for keywords that have confused annotators in the past and led to incorrect decisions.Second, it reveals the accurate context of the presented tweet, which might not be explicitly stated, by utilizing ground truth data with the assistance of LLM-based summarization techniques, as shown in Figure 1 (c).
For the first hint, we specified the tokens in the given sample that may cause ambiguity for the user to select the correct label.In the second element, we utilize feedback and reasoning provided by both Experts and Beginners, gathered during our formative study, with a specific focus on areas where they disagreed.To extract ambiguous tokens for a given label, we figure out the tokens from our dataset which are distributed almost equally in both relevant and irrelevant classes.However, these tokens need to occur at least min_freq (3 in our experiments) times in each class.For measuring the ambiguity of a token   in a class   ∈ {relevant, irrelevant}, we employ the ambiguity measure  (  ,   ) as described in [43]. (  ,   ) = tf (  ,   )/tf (  ), where tf (  ,   ) is the frequency of token   in class   , and tf (  ) is the frequency of token   in all classes.This measure represents the frequency of a token   in each class, so to measure the ambiguity of token   in the given label, we calculate the ambiguity measure  (  ) = max ( (  , relevant), ( (  , irrelevant)), which is the maximum of the ambiguity measure for a token in both relevant and irrelevant classes.If a token   equally occurs in both relevant and irrelevant classes, the ambiguity measure  (  ) = 0.5, which means the token has the highest ambiguity.If the token occurs only in one class, the ambiguity measure is equal to 1.0 which implies that the token is indicative of a particular class, so it is not ambiguous at all.Therefore, the range of ambiguity measure is [0.5, 1.0].We set a threshold max_amb (0.7 in our experiments), and tokens with  (  ) ≤ _ are considered as ambiguous tokens.In our experiment, we use the samples from the disagreement part of the dataset, as the training set, to figure out ambiguous tokens for each label (TM or DI), since these samples can represent the knowledge gap between Experts and Beginners.First, we calculate the ambiguity measure for all tokens in the training set and then select the top-k (k=3 in our experiment) token with the lowest  measure (higher ambiguity).To provide reasoning about the knowledge gap between users based on their feedback, we prompt an LLM with the test sample (see appendix), selected labels by the users for that sample, and their feedback regarding their choice and ask the model to generate reasoning.We generate the reasoning for a given sample through the following steps: (1) For each annotator, we prompt the LLM with the text, the definition of the label, and the annotator's feedback and ask the LLM to generate reasoning behind the annotator's prediction.
Providing the definition of the label helps the LLM to consider institutional insights in generating the reasoning.(2) We prompt the LLM with the text, the definition of the label, and two reasoning generated in step ( 1) and ask the model to generate the reason behind the users' disagreement.We change the order of providing annotators' reasoning to eliminate any bias that may caused by the order.

SUMMATIVE STUDY (S2)
In this study, we aim to understand how different interfaces can make a difference in reducing the knowledge gap and increasing annotation performance for the less experienced annotators in the disaster management context.S2's Research Questions (RQs) are as follows: • RQ1.How do different annotation interface designs impact annotators' behavioral performance in terms of accuracy and efficiency?• RQ2.Which design is perceived as the most effective in addressing the knowledge gap among annotators?

Method
5.1.1Recruitment.For recruiting participants, we followed the same approach as our formative study.One of the authors with expertise in the field contacted the CERT volunteers both in person and via email, inquiring if they were interested in taking part in the study.Those who expressed interest were then selected for the annotation task, resulting in a total of 13 participants.For this round, we specifically enlisted Beginners, and the participants are entirely distinct from those involved in the previous study.On average the participants have 3.5 years of experience in the field.Each participant was compensated with a gift card.
5.1.2Data sampling and environment setup.In the final phase of our study, we meticulously selected 459 data points from the earlier disagreement analysis (section 3.4.3).These data points, indicators of disagreements between the experts and novices in disaster-related cases, serve as the basis for our investigation.Our primary aim is to validate whether our proposed designs can narrow the knowledge gap and foster consensus among annotators.To ensure impartiality, we enlisted entirely new participants who had no prior exposure to the dataset.
The study is structured to assess three distinct design conditions, employing a Latin square design for counterbalancing [12].To mitigate the influence of a potential learning effect in a within-subject design, we curated three separate datasets, namely D1, D2, and D3, using a stratified sampling approach.Within each interface, we included 40 tweet samples, evenly distributed between Transportation Means (TM) and Damaged Infrastructure (DI) classes.In both classes, 10 samples represented False Positives (FP), while the other 10 represented False Negatives (FN), based on the previous annotation task outcomes.In sum, each participant undertook the annotation of 120 tweets, including three distinct design variations, each comprising 40 tweets.

Training and annotation exercise.
Before commencing the actual annotation task, a training session was conducted to acquaint our participants with the task's objectives and the various system interfaces.We began by introducing the three interface designs and providing operational instructions.Interactive training ensued, featuring example tweets from their respective class or label.Active participation and decision-sharing were encouraged during this phase.Participants were also trained on system features, such as task tracking, interface navigation, and survey completion.
Participants received individual access to the application through a unique URL and user credentials for the primary annotation task.Approximately 3 hours were allocated for task completion.Each user accessed one dataset at a time, transitioning to a different interface upon completion.The sequence of dataset presentation was individualized and determined via the Latin square technique [12].
After annotating each interface, participants completed a mandatory survey assessing efficiency, effectiveness, and knowledge gap reduction, using a Likert scale from 1 (lowest satisfaction) to 7 (highest satisfaction).

Results
Since the summative study relied on ground truth-driven data and was conducted within subjects, achieving consensus among annotators wasn't our primary focus.One of the common performance measures in HCI research is focusing on dependent variables to understand the impact of design [39].In experimental research two such variables are, Efficiency-how fast a user can finish a task, and Accuracy-how error-free or precise users are in completing a task [39].We established five metrics to gauge the performance of the annotation task and determine which design was most effective in enhancing annotation accuracy and efficiency.The initial two metrics assess users' behavioral accuracy and efficiency by examining the outcomes of the annotation exercise.The remaining three metrics focus on user perceptions, measuring attitudes regarding accuracy, efficiency, and knowledge gap reduction, as determined through survey responses.
5.2.1 Behavioral Accuracy.Behavioral accuracy per user evaluates the correct annotation of samples against the ground truth for a specific interface.To compare accuracy across the three designs, we used the Kruskal-Wallis test and post hoc Dunn analysis.In scenario S1 with all samples considered, the mean accuracy for the Highlight design was 0.57, the Reasoning interface 0.54, and the Context design 0.57, with no significant differences.Figure 2 (a)'s box plot illustrates this, where the white circle indicates the mean.In S2, we excluded samples with less than 5 seconds spent per question due to accidental clicks without reading the tweet or cues.For example, a participant mentioned, "Answers require clicking on very small (at least on my screen) radio buttons.I could not select an option by clicking on the text next to the radio buttons or vicinity.This led me to make a mistake on one of the questions".Another point of consideration is that the average time taken per question by all users exceeds 5 seconds, supporting the exclusion of those samples.In situation S2, the mean accuracy values shifted for the Highlight interface to 0.56 and for the Context interface to 0.58, while no change was noted for the Reasoning interface.We examined another scenario, S3, in which we excluded one participant's data from the evaluation due to their significantly shorter time spent compared to all other annotators.For S3, the mean accuracy scores updated to 0.55 for the Highlight interface, 0.52 for the Reasoning interface, and 0.58 for the Context interface.However, in all these scenarios, no statistically significant differences were observed among the three design conditions and in the pair-wise comparisons.
From the box plot, it is clear that the median value for the Highlight interface surpasses the mean, suggesting that some participants achieved exceptionally high performance with this interface.However, there is a notable difference between the maximum and minimum range of performance outcomes for Highlight and Reasoning interfaces, which indicates some participants achieved noticeably lower performance using these interfaces as well.Both min and max accuracy scores for the Context interface are better than those of the other two designs.
We further explored using the Chi-square test (contingency table) and conducted a question-wise assessment for all participants across the three designs in scenario S1.Since each interface comprised 40 tweets, we designated the initial tweet in all interfaces as question 1, the second as question 2, and so forth.We count the accuracy scores for each question for all participants.In our findings, the Context interface outperformed both the Highlight and Reasoning interfaces for 13 questions and outperformed one interface for 3 questions.Among these questions, 7 cases exhibited significant differences compared to the Highlight and Reasoning interfaces (p < 0.04).Similarly, the Highlight interface outperformed two designs for 11 questions and outperformed one interface for one question.6 cases among them showed significant differences (p < 0.03).Lastly, the Reasoning interface demonstrated better accuracy in 10 questions compared to the other two designs and outperformed one design for 4 questions.We observed 3 cases being significantly better (p < 0.03).For two questions all three designs exhibited similar accuracy scores having no impact on each other.5.2.2 Behavioral Efficiency.Behavioral efficiency measures the speed and completion time of annotators using different interface types.The aim is to measure which interface is efficient in providing information and allowing annotators to quickly annotate or label data with minimal time and effort.When considering data from all annotators (S1), the mean completion time for annotating 40 tweets was 16.59 minutes for the Highlight design, 25.23 minutes for the Reasoning interface, and 23.05 minutes for the Context design.Notably, the Highlight interface appeared to be the fastest in terms of completion time.Although there was no statistical significance observed when comparing the speed across all three design conditions, we detected significant differences in pair-wise comparisons.Specifically, the Highlight interface proved to be significantly faster than both the Reasoning interface (p < 0.05) and the Context interface (p < 0.03).
In scenario S2, where we excluded samples with less than 5 seconds spent per tweet, the mean completion times remained relatively stable: 25.21 minutes for the Highlight design, 25.21 minutes for Reasoning, and no change for the Context design.In scenario S3, the average completion times were increased across all interfaces.Specifically, for the Highlight design, the mean completion time extended to 17.35 minutes, while for Reasoning, it reached 26.70 minutes, and for the Context interface, it was 23.93 minutes.The Reasoning interface captures the majority of users' time.In both S2, S3 the Highlight interface stood out as significantly faster (p < 0.03) than the other two designs.This occurred mainly because annotators didn't need to invest additional time in reading AI-generated explanations, allowing them to complete the task more swiftly.

Attitudinal Accuracy.
Attitudinal accuracy assesses the perceived accuracy of annotators based on their survey responses.We posed a specific question for this metric: "I found the way the current interface provides information enables accurate annotation decisions with less error", aiming to assess the annotators' subjective perception of the interface's ability to support them in making correct and error-free annotation decisions.Respondents used a Likert scale ranging from 1 (strongly disagree) to 7 (strongly agree) to rate their agreement.The mean attitudinal accuracy score for the Highlight interface was 3.08, for the Reasoning interface it was 3.77, and for the Context interface, it reached 4.54 out of 7. It's noteworthy that annotators believed the Context design facilitated more accurate decision-making compared to the other two interfaces, even though there appeared to be no significant behavioral accuracy differences between the Highlight and Context designs.The agreement scores for the Highlight feature exhibit significant variability, with some participants assigning a score of 1 while others rated it as 7.A similar pattern of variability is observed for the Reasoning interface.In contrast, for the Context design, the range of agreement scores falls between 3 and 6.From the Kruskal-Willis analysis, we found a significant statistical difference between these three conditions where Context design achieved a better attitudinal accuracy score than other designs.

5.2.4
Attitudinal Efficiency.To observe our annotators' perceptual efficiency, we asked a question in the survey, "I found the way the current interface provides information, enables fast annotation using less time".The goal is to understand whether the participants found the interface to be time-saving and efficient in its information presentation and annotation process.The mean attitudinal efficiency scores were as follows: 3.54 for the Highlight interface, 3.15 for the Reasoning interface, and 4.69 for the Context interface, on a scale of 1 to 7. The Reasoning interface achieved the lowest score, aligning with our observations from behavioral efficiency, where annotators took the longest time to complete tasks using this interface due to the additional time spent reading AI-generated explanations.Surprisingly, while the Highlight design was found to be the fastest, annotators tended to perceive the Context design as the most efficient for facilitating effective decision-making.This demonstrates that users have a strong preference for the Context interface (p < 0.04).

5.2.5
Attitudinal Knowledge Gap Perception.Our last survey question was "I found the way the current interface provides information, helps me learn the aspect that I could overlook otherwise", intends to determine whether the way the interface presents information assists annotators in learning and understanding certain aspects they might have missed or overlooked otherwise while decision-making.The mean survey scores were as follows: 3.23 for the Highlight interface, 4.31 for the Reasoning interface, and 5.54 for the Context interface, on a scale of 1 to 7. Statistical significant differences were observed among these conditions (p < 0.03), suggesting that annotators perceived the Context design as offering more valuable support compared to the other designs.As one of the annotators mentioned, "I found both, the context and reasoning options more useful than the highlighting.On one occasion I answered a question without looking at the helping text, then I read the helping text and changed my response for better, I hope.However, the design where I had both, the keywords on the tweet and helping text options, may have helped me speed up my responses".Interestingly, the Highlight design received the lowest score in terms of attitudinal metrics, despite its behavioral metrics showing the opposite trend.

DISCUSSION & IMPLICATIONS FOR DESIGN
This section provides insights we learned from S1 and S2 that can motivate future research and annotation user interface designs.We will discuss how the annotation user interface design can be improved by leveraging the two notions of confusing words and hidden context.In discussing the possible expansion, we will first provide how the two can be applied depending on the data type.Next, we will discuss how the two can be applied to different domains in disaster management.Finally, we discuss how an advanced technical pipeline can be considered to advance design for closing the knowledge gap using confusing words and hidden context.

Design directions
Advanced annotation designs can be adopted based on the data type and modality, focusing on addressing ambiguous elements and hidden contexts.Several potential directions in this regard are outlined below.
6.1.1Adopting in multi-step annotation workflows.Unlike microtasks, advanced task requires comprehensive reasoning based on domain knowledge.In that sense, the two notions can be applied to the interface to help annotators apply the "divide-and-conquer" approach in annotation.This line of designs might consider implementing a step-wise [56], where the workflows will guide annotators through a sequence of stages.For instance, In the first step, annotators can focus on identifying the hidden context and highlighting ambiguous elements.In subsequent steps, they can provide annotations based on the clarified context, leading to more accurate and comprehensive annotations.While such designs can help annotators to be comprehensive in checking multifaceted aspects of the annotation, the thread of this approach is possible expansion of spending time.Annotations might be guided through additional explanations, links to external resources, or suggestions for seeking additional information.After the detailed clarification step, there can be a resolution step where the system offers guidance by suggesting strategies for disambiguation, such as considering the surrounding context or consulting domain-specific knowledge.
6.1.2Applying the notions to images and videos.Our study findings show that confusing words and hidden context can benefit annotation research in images and videos.Computer vision models are increasingly becoming contextualized and applied in professional tasks.Confusing "patterns", for example, can be applied in detecting falsely correlated objects in classification or object detection models-where the object type has a strong correlation to a particular class or object type but itself doesn't mean that the data points should be classified to that object (e.g., tennis racquets or baseball bats in gender classifier [24]).Hidden context can also help provide underrepresented knowledge in varying domains, such as damage assessment or medical imaging areas.
6.1.3Adding feedback loops for improving context and ambiguous word lists.Introducing feedback loops from either domain experts or trained AI models into the annotation process can offer annotators personalized feedback tailored to their annotation performance [23,57].These suggestions may originate from domain experts who can provide insights into beginners' decisions or be delivered through AI agents.Such feedback can pinpoint their strengths and areas requiring improvement including their ability to handle hidden contexts and ambiguities.This feedback can take the form of comments, ratings, or suggested improvements directly within the annotation interface.The system collects and aggregates feedback from multiple annotators or analyzes them to identify common themes.Feedback integration can be done in real-time or iteration basis, annotators can contribute to an evolving set of hints.
6.1.4Multimodal contextualization and ambiguity detection.In a multimodal interface, the concept of explaining hidden context can be extended to include not only textual but also visual or auditory context.For example, if annotating an image with text, the interface could provide explanations for why certain text elements were chosen or why specific regions of the image are relevant.For instance, one of our participants mentioned, "There were links that we didn't get to see or open in the tweet.So if I could have seen the link then that might have given me more context".The system should be capable of identifying and highlighting ambiguous elements not only in text but also in visual or auditory forms.Offering alternative word choices, explaining ambiguous visual elements, transcribing the speech, or clarifying confusing words could be useful to the context.This might involve using a combination of NLP techniques for textual content and computer vision or audio processing techniques for other modalities.

Application in disaster management
The scope, magnitude, and complexity of disasters influence social media use and value [32].On the ground decision-making is driven by these disaster characteristics and the relevance of social media content to support such decision-making processes for response across different sectors (e.g., transportation) rapidly changes.Thus, annotation tasks on such social media content for disaster data analytics systems could face varying relevance of words and context in messages for response sectors and lose value as operations transition from response to recovery.As a consequence of these dynamics, combined with early indications in our S1 study that annotation tasks may have been complex and overwhelming for beginning annotators, the future design could focus on annotation aids for microtasks with binary classifications.

Advancing technical pipeline
Our experimental interfaces highlighted both technical challenges and opportunities for future designs.The dataset's size, used for extracting relevant and ambiguous words in the Highlight and Context interfaces, can impact accuracy.Expanding the dataset size could enhance method performance.Additionally, LLMs have shown powerful capabilities in NLP tasks such as keyword extraction and generation.Thus, utilizing LLMs with appropriate prompts, which capture the context of the task, can be an alternative approach to generate a list of relevant and ambiguous words for a target label (e.g., Transportation means) based on a given message.Future designs should consider the potential bias in the generation process of LLMs.There is a concern that explanations generated by LLM may be overly verbose or detailed.This verbosity can potentially overwhelm users with information.If explanations are too extensive, users might find it challenging to digest the information quickly leading to cognitive overload and may hinder the decision-making process rather than aiding it.Utilizing prompt templates that enhance LLM faithfulness to account for contextual knowledge can improve reasoning quality in future tasks [63].This information helps the LLM in understanding the context and criteria for different labels, enabling it to make more informed decisions during reasoning tasks.However, while including label definitions aids the LLM in reasoning, it does not guarantee that the model will consistently adhere to these definitions.Deviation from the provided definitions may occur due to various factors, such as biases in the training data or the model's inherent tendencies.Furthermore, the lessons of ambiguous words and hidden context in the proposed interface designs are focused on single modality of data, i.e., text could inspire the technical implementation of the annotation aids for multimodal annotation tasks.The multimodality of messages including images and videos would require the detection of relevant and ambiguous objects to support annotation aids on the interfaces.The advanced models using transformers [38] have shown remarkable capabilities for computer vision tasks that could be leveraged for such annotation aids to support interfaces that aim to reduce knowledge gap on multimodal annotation tasks.

LIMITATION & CONCLUSION
Regarding the limitations of this work, we utilized the same dataset for both testing the interface and generating context to address disagreements between Experts and Beginners in the context interface.While this may not align with real-world scenarios, we adopted this approach to create a suitable context for our experimental condition.In practical applications, an accurate model trained on historical annotated data can be employed to generate a context explanation for resolving user disagreements.Additionally, the S1 focused only on one event type due to time and cost limitations, employing a single-step expert verification for ground truth preparation.Classifying our participants as experts and beginners could produce unaccounted variability among the Beginner annotators which could affect generalizability.Incorporating multivariate data and preparing ground truth with additional feedback from multiple domain experts could broaden applicability.
Our study findings underscore the critical nature of annotation tasks in disaster management, which can introduce ambiguity for annotators across varying levels of expertise.We investigate the potential of AI-assisted interface designs to mitigate the knowledge gap among less experienced annotators.The empirical study reveals that the most prevalent reasons for disagreements are confusing words and hidden context for textual annotation tasks.In response, a novel interface is introduced that provides cues to address these challenges.In our future work, we plan to expand this research to accommodate multimodal data annotation tasks and enhance the implemented technical pipeline's predictive capabilities to help generate aids for further improving the annotation performance.
Instruction: Two annotators generated the following reasoning for the following task and tweet.What is the reason for their disagreement?Task: Based on the following definition, is the following tweet relevant to <disagreement-label>? Definition: <given-definition> Tweet: <given-tweet> Annotator 1: <annotator1-reasoning> Annotator 2: <annotator2-reasoning> Answer: The reason for their disagreement is that

Figure 1 :
Figure 1: Providing assistance and inquiring if the tweet belongs to TM class in (a) Highlight interface, (b) Reasoning interface, (c) Context interface

Figure 2 :
Figure 2: Summative study result under S1 situation -(a) Behavioral Accuracy: box plots for accuracy scores of all users for three interfaces, (b) Behavioral Efficiency: box plots for total completion time (in minutes) of all users for three interfaces, (c) Attitudinal Accuracy: box plots for users perceptual accuracy ratings on a scale of 1 to 7, (d) Attitudinal Efficiency: box plots for users perceptual efficiency ratings on a scale of 1 to 7, (e) Attitudinal Knowledge Gap Perception: box plots for users knowledge gap perception ratings on a scale of 1 to 7

Figure 4 :
Figure 4: Prompts for generating reasoning behind the users' disagreement [17,18]ovidinginsights from past annotation decisions.Exploring different strategies for better grasping the importance of elements in the annotation task, possibly by relying on the collective wisdom or opinions of others, has been useful in past literature[17,18].P4 mentions sharing what other people thought or getting insights from others' perspectives might be more helpful in determining the significance of certain aspects of the task, as he remarked, "This was relevant because of this aspect or something like that.So maybe if you pointed out what other people thought, would be helpful as well".Annotation decisions by experts also provide clear and well-illustrated ideas of how the annotation task should be performed.Beginners can learn from these examples by observing how experts approach complex or ambiguous cases.
Some participants also talked about example-based guidance in a learning context and role-playing exercises can be helpful for teaching as they provide practical, real-life scenarios for learners to engage with.For instance, P7 said, "Example-based guidance, such as what can be the good things, what can be the bad things that you have to kind of role-playing exercise would be super beneficial".