Towards Designing a Question-Answering Chatbot for Online News: Understanding Questions and Perspectives

Large Language Models (LLMs) have created opportunities for designing chatbots that can support complex question-answering (QA) scenarios and improve news audience engagement. However, we still lack an understanding of what roles journalists and readers deem fit for such a chatbot in newsrooms. To address this gap, we first interviewed six journalists to understand how they answer questions from readers currently and how they want to use a QA chatbot for this purpose. To understand how readers want to interact with a QA chatbot, we then conducted an online experiment (N=124) where we asked each participant to read three news articles and ask questions to either the author(s) of the articles or a chatbot. By combining results from the studies, we present alignments and discrepancies between how journalists and readers want to use QA chatbots and propose a framework for designing effective QA chatbots in newsrooms.

Nonetheless, prior research shows that oftentimes readers are left with questions for journalists/authors after reading news [11,83].Readers use various means such as news comment sections, email, and social media direct messages to ask their questions [18,51].However, due to many reasons such as the abundance of questions, lack of time, and absence of direct incentives, journalists cannot respond to the queries [15,55].On the other hand, both journalists and newsrooms have an interest in engaging with the audience/readers, to learn what aspects of their news raised curiosity and questions in readers' minds [53,56], and to understand if readers are questioning newsroom's journalistic integrity [52].This knowledge is critical for learning what attracts readers to news content, targeting the right audience for news recommendations, building trust with the audience and so on [42,86].According to Meier et al. [53],-"a paradigm shift away from a 'lecturing' approach to a 'dialogue' approach is a key factor for journalism in a post-truth era".Thus, there is a clear interest from both readers' and newsrooms' perspectives to engage with each other in a dialogue fashion through question-answering [53,60].
With the rise of Large Language Models (LLMs) [9], like any other industry, news media industry is also investigating how Artificial Intelligence (AI) and LLMs can be integrated with the news production, distribution, audience engagement, and other related processes to improve efficiency, increase trust and accountability, and open new possibilities [27,78,79].For instance, BBC has started experimenting with how we can leverage bot technologies to reach new audiences on messaging platforms and social media [3,4].They have developed a prototype of an in-article chatbot to help less engaged audiences understand big news stories [39].Other media outlets have started experimenting with Meta (previously known as Facebook) chatbots [92].While these experiments are ongoing individually in different newsrooms, facilitating question-answering (QA), one of the primary modes of interaction and dialogue in chatbots [44], remains a challenge for newsrooms [74,76].While researchers are working towards mitigating technological limitations such as the lack of truthfulness and existence of biased and formulaic text in LLM responses [19,84], we argue that it is also important to take a systematic approach to understand how journalists and readers want to use a chatbot for question-answering and what roles they deem fit for such a chatbot.For instance, are journalists comfortable with a chatbot answering readers' questions about their written news article?What types of questions do readers expect the chatbot to answer?And what types of questions do readers expect the authors to answer?Without answering these questions, we may end up designing QA chatbots that are detached from readers' and newsrooms' needs.
This paper takes a formative step towards designing LLM-powered chatbots that will be able to answer open-ended questions from readers, reduce journalists' burden to answer a large number of queries, and at the same time elevate communication between readers and journalists.We call such chatbots as QA chatbots.For the rest of the paper, we use the term QA chatbots and chatbots interchangeably to refer to the chatbots that are able to answer questions from news readers, unless specified otherwise.While we only focus on question-answering in the context of LLMs, in a real-world setting, like many LLM-powered chatbots (e.g., ChatGPT), a QA chatbot (or simply a chatbot) will likely have other forms of interactions (e.g., recommending news).To achieve our research agenda, we conducted two studies in this paper.First, we interviewed six professional journalists to understand how journalists currently answer questions from readers and, more broadly, interact with readers.The interviews revealed that most journalists now obtain feedback and answer questions from readers through one-to-one conversations over email, direct messages, or in-person meetups, except for a few member gatherings organized by the news outlets.All participants were enthusiastic about using a QA chatbot as a mediator between themselves and readers.They want the chatbot to answer factual and redundant questions from readers while want the chatbot to direct questions seeking subjective interpretation towards them.
We subsequently conducted a between-subject experiment on Amazon Mechanical Turk (MTurk) with 124 participants with two goals: (1) understanding to what extent readers' perspectives and questioning patterns match with how journalists see the functionalities of a QA chatbot; and (2) understanding how a QA chatbot might modulate readers' questioning patterns.The second goal is motivated by prior research that shows that chatbots can significantly modulate user behaviour [14,43,48].In the study, participants were instructed to read three articles from three different domains (health, politics, and environment) and ask questions to the authors of the articles or a chatbot.We analyzed their responses using a grounded theory approach [22] and a set of quantitative measures.Our findings suggest that readers mostly asked questions on two broad categories: Information, questions that seek short-form factual and long-form details; and Interpretation, questions that seek explanation and opinion.We found that readers' questioning patterns significantly differ depending on the receiver -authors vs. chatbot.Participants asked questions seeking facts more frequently to the chatbot than to the authors.In contrast, they asked questions seeking explanation and opinion with multiple facets more frequently to authors than to the chatbot.
These results indicate that readers' perspectives about the role of QA chatbots match with journalists' views to some extent.Readers too predominantly want the chatbot to answer factual questions while seeking to engage in subjective conversations with journalists.Despite this alignment, we found evidence that a significant number of readers do not share the view that chatbots should only answer factual questions.For instance, while not as frequently as to the authors, readers did ask a considerable amount of questions seeking subjective interpretation to the chatbot.Similarly, a considerable number of readers wanted authors to answer factual as well as subjective questions.Finally, we found that chatbots can modulate readers' behavior significantly which researchers and news organizations should be aware of.For example, articles having lower perceived quality motivated readers to be critical and question journalistic integrity when the authors were present in the loop.However, when the chatbot was present, readers did not engage on the same level, even when the perceived quality of the news article was low.
Overall, these results indicate that researchers and news organizations need to understand journalists' and readers' perspectives and expectations thoroughly before determining the functionalities of a QA chatbot.The results also highlight the need to devise policies around chatbots and communicate the policies to journalists and readers effectively.We conclude this paper by proposing a framework for devising a policy for designing QA chatbots for online news.In summary, our contributions are as follows: (1) An interview study with six journalists to understand their current practices and challenges for answering questions from readers and how they perceive the role of an intelligent chatbot in this regard; (2) An online experiment with 124 participants to understand what questions readers typically ask about a news article and how the questions change if a chatbot is present at the receiving end, instead of the authors; and (3) A design framework for developing a chatbot policy, informed by the interviews, online experiment, and literature review.

BACKGROUND AND RELATED WORK
In this section, we discuss prior works on understanding communication between journalists and readers.We also review Question-Answering (QA) based chatbot technology.

Audience Engagement and Chatbots
Audience or reader engagement plays an important role in news production, dissemination, and consumption [88].Readers often want to connect to authors for providing compliments and encouragement, engaging in intellectual discussion, asking follow-up questions, or for pointing out shortcomings [16,65,70].Journalists want to connect with readers for occupational duties, community engagements, obtaining new directions, or self-improvement [2,81,82,85].As most news sites became online in early 2000, communication between journalists and readers has evolved from one-to-one interaction to various forms of online activities.For example, the comment section in news sites had shown promises for readers to provide feedback to the journalists and incite community discussion [18,66].However, several concerning issues such as the use of abusive language, spam, and a lack of manpower for moderation have contributed to comment sections becoming virtually non-existent in major news sites [57].
Social media has drastically changed audience engagement and news consumption paradigm (e.g., Twitter) [87,88].About half of the U.S. population now access news from social media [59].Thus, news media are no longer the authoritative sources of news [6].With this shift in dynamics, news organizations are exploring different ways to adapt to social media environments and engage audience [28].One of the promising ways is designing chatbots in social media [28,46].For example, several large international news organizations such as ABC News, NBC News, and BBC News have chatbots in Facebook messenger [92].These chatbots, including others in this area [46,49,92], largely focus on supporting fact-checking, news dissemination, and recommendation.
However, similar to how technology has shaped audience engagement throughout history, it is likely that LLMs will influence the design of chatbots for news outlets [58].LLMs can generate free-form text for a query, opening up opportunities for designing chatbots for complex QA tasks.At the same time, LLMs can hallucinate and provide wrong or even non-existent information to users [84].Thus, it is imperative that we study what questions the journalists and audience want a chatbot to answer and align the design of the LLM-powered chatbot to match that expectation.We aim to address this gap.

How End-users Perceive Chatbots?
Chatbots are ubiquitous in many domains now, from Facebook Messenger to most online services, customer-facing platforms [7,93], educational platforms [72], and health-related services [21,91].Chatbot has become a popular choice for solving many user-centric problems in HCI [7,25,30,33,69,75,90,91].Beyond developing chatbots, HCI research has a long history of studying how chatbots impact user behavior, interactions, and expectations.For example, Luger et al. [48] found user expectations to be dramatically different than the capabilities of the chatbot.Liao et al. [43] found that preference for humanized social features varies from user to user based on the underlying task requirements.Several other studies reported that user preference and satisfaction largely vary among individuals [31,44].
Given the prior evidence of chatbots mediating user expectations and behavior in other domains, we believe it is essential to study how news readers perceive the capabilities of a chatbot and how that impacts their expectations from the chatbot.To study that, we explore what questions readers would ask a chatbot in the context of question-answering, the primary form of interaction with chatbots [44].

Question-Answering in NLP
Question-answering (QA) is a core NLP task [63,64].Extractive QA or Reading Comprehension is the simplest form of QA task, where a model answers a question from a context [63].There are many challenges to the Reading Comprehension task.For example, a question can be Unanswerable because of unverifiable presupposition [38].QA systems that allow answers for unverifiable presupposition may produce inappropriate answers [32].More complex QA tasks include generating free-form answers from a context or generating freeform answers from an open domain.Another way to categorize QA systems is to consider whether the system is rule-based, ML-based, or hybrid [90].Rule-based systems can provide exact answers to pre-defined questions but do not scale to a large volume of questions.ML-based systems have more language understanding but require a large amount of data and model training.Hybrid systems combine both rule-based answers and ML to answer questions [89].

STUDY WITH JOURNALISTS
We conducted IRB-approved semi-structured interviews with six journalists to understand how journalists interact with readers, answer questions, their challenges, and the opportunity for a QA chatbot in the process.The interviews took place in between January and February 2023.

Participants
We contacted professional journalists who also work as faculty at our university's journalism school.We also reached out to journalists at local and national news outlets.Six journalists agreed for the interviews.Their information is given in Table 1.Participation was voluntary with no compensation.

Method
We conducted two interviews in person and others over Zoom.One author of this paper conducted the interviews while another author took notes.Each interview was divided into three parts.First, after gathering informed consent, we asked participants to describe how they currently interact with readers, answer questions from readers, and manage conversations with readers.After that, we asked participants about any challenges they faced in the process.Finally, participants brainstormed with the authors about how a chatbot can enrich journalist-reader communication.Each session lasted around 1 hour.The semi-structured questionnaire is available in our OSF repository.

Data Collection and Analysis
We recorded audio for all interviews and made anonymized transcripts for each.Two authors of this paper analyzed the interview data, following a thematic analysis process [8].Throughout the analysis, we refined themes and relevant passages required to support the themes.We present the findings in the next section.

How do journalists communicate with readers?
Participants mentioned several ways to interact with readers.Most interactions happen asynchronously, over emails and social media (P1-P6).All participants valued communication with readers (P1-P6).However, they often do not have enough time to interact with the readers (P1, P4, P6).
Participants mentioned comment sections are non-existent in their newsrooms, except for a few special reports (P3-P5).Newsrooms do not have enough workforce to moderate comment sections (P5).Some newsrooms periodically invite a subset of readers (often subscribers) to virtual or in-person gatherings for feedback and discussion (P1).When asked about chatbots, P2-4 mentioned that they are aware of several deployed chatbots from news organizations.However, the chatbots are not used for answering questions, but rather for recommending articles to readers.
3.4.2Challenges for journalists.Journalists mentioned that maintaining fruitful discussions with readers at scale is their main challenge (P1-P6).They are often rushed from one news to the next one.As a result, it is difficult for them to engage with readers on a previous article.Even when they interact with readers, they fear abuse and threats, especially if the article's topic is polarized (P1, P5).

3.4.3
Opportunities and requirements for a chatbot.We asked participants to comment on the opportunities for a chatbot in this domain.All our participants were aware of the recent bloom in LLMs.P3 and P4 mentioned that some newsrooms already have internal talks on how to use these technologies responsibly.P2 referred to BBC's recent efforts in designing chatbots [4].However, participants mentioned several challenges for AI-based chatbots.First, participants think it is essential to decide the purpose of the chatbot and its extent of engagement with readers.For example, P3 and P1 said, "I wouldn't mind AI answering some factual questions from readers.I sometimes get redundant questions from readers.AI could be helpful there.However, I do not think I will use it if the reader is seeking a deep conversation or asking critical questions." (P3) "A chatbot can be helpful to both audience and journalists.It can work as a learning medium for readers.However, I do not want to be completely separated from my audience.I would also use it cautiously.I wouldn't be comfortable knowing that an AI might misinterpret my writing and propagate that to the readers."(P1) Finally, according to P4, it is a "three-way street, " and it is important to find the "balance" and "mechanisms" for journalists, readers, and AI to interact with each other.

STUDY WITH AUDIENCE/READERS
The interviews highlighted the challenges for journalists to maintain conversations with readers.They often do not have enough time and opportunities for fruitful conversations or answer questions from readers.All journalists were enthusiastic about using a chatbot as a mediator between them and readers.They wanted to stay connected with readers and answer important questions while being able to use chatbots for factual and redundant questions.
While the perspective of journalists is clear from the interviews, to design an effective QA chatbot, we still need to know how readers, the other significant stakeholder, perceive chatbots.More importantly, since journalists want to use the chatbot as a mediator, we wanted to know to what extent readers' questioning patterns match with this view.We determined that conducting a study to understand how readers want to route questions between the authors and the chatbot would be essential to achieve this.We also wanted to understand how a chatbot might influence the questioning patterns of readers.Prior research suggests chatbots can significantly modulate user behaviour [14,43,48].Thus, we seek to answer the following two research questions: RQ1: What are the types of questions readers would ask about the news?RQ2: How do the questions differ when readers ask the questions to a chatbot in comparison to the actual authors?
In addition, we wanted to study the effect of confounding factors, such as readers' perceived quality of the article [24] and preference for news outlets with political alignment [41], on the process.Thus, our second set of RQs are as follows: RQ3: How would the perceived quality of the articles influence the questions?RQ4: How would the readers' personal preferences for news outlets influence the questions?
To answer the RQs, we decided to conduct a crowdsourced experiment.While an interview study with readers could also be useful here, we decided that a crowdsourced experiment is better suited here since it would enable us to understand the perspectives of a larger population, engage readers in the task of concern (asking questions), and help us answer the RQs.This section describes the study design and protocol.All study materials, including news articles and the source code for the study interface, are available in our OSF repository.The study took place in June 2023.

Study Conditions
We conducted a between-subject study where we asked participants to read news articles and ask questions to the following two entities (conditions): A. Authors: Participants were prompted to ask questions to the authors of an article.Participants received the following prompt: "Consider you have the option to directly ask questions to the authors of this article about the article and get answers from them.Please list any questions you would ask the authors if you had the opportunity to communicate with them online.
You can ask for clarifications, followup information, opinions, or any other questions that you think are relevant to the article." C. Chatbot: Participants were prompted to ask questions to a chatbot.Participants received the following prompt: "Consider there is an automated chatbot that can answer your questions about this article.Please list any questions you would ask the chatbot if you had the opportunity to communicate with it online.You can ask for clarifications, followup information, opinions, or any other questions that you think are relevant to the article." While many chatbots can converse with humans (e.g., ChatGPT), the chatbot in our study (C) did not feature any response or interactions.There are three reasons for that.First, we see this study as a formative step to design future chatbots.We wanted to understand user needs and perceptions before developing the chatbot.Thus, while we acknowledge that an actual chatbot will likely influence readers' perceptions, we believe we would be able to study that influence once we design a chatbot informed by this study.Second, current LLMs have several limitations such as providing information that is wrong or even non-existent, biased text generation, and so on [19].These problems can significantly impact participants' experience in the study which would be difficult for us to control in an open-ended setting such as ours.Finally, it is unlikely that we would be able to recruit journalists in the study who would be able to manage time to answer questions on the fly for the Authors condition (A).Thus, an actual chatbot capable of replying will be biased against the Authors (A) condition.
Half of the participants asked questions using A while the other half used C. Our goal was to collect 375 questions for each condition.This is motivated by prior work [65], which used 1500 comments in online news to derive a taxonomy of engaging comments.However, only 3 out of 12 classes in that taxonomy are for questions, equaling around 375 data on average for deriving the taxonomy for questions.We decided that around 750 questions in total (750/2 = 375 questions per condition) would enable us to capture the differences between A and C, if any.

Participants
We recruited 126 participants from Amazon Mechanical Turk (MTurk).We excluded 2 participants due to incomplete responses, resulting in 124 participants.We decided on the number of participants by considering the estimated responses required for answering the RQs (around 375 per condition).We considered that crowd workers typically have a short attention span and perform well on short repetitive tasks [35].Thus, we decided to ask participants to ask only two questions per article.Participants repeated this task for 3 separate news articles.We determined the number of articles to read, along with other study variables, by conducting a pilot study with 10 undergraduate and graduate students at our university.Overall, participants found the process of reading 3 articles to be easy and quick.This also allowed us to assign one article from three different domains (Health, Politics, and Environment) to each participant (see subsection 4.3) and curate a wide range of questions.Finally, we anticipated that we would be able validate the responses from crowdworkers confidently if we could measure responses across three articles.Each session resulted in 3 x 2 = 6 questions.Thus, we recruited 375/6 = 62.5 or 63 participants per condition and 126 participants in total.We conducted Monte Carlo simulation to ensure that the Structural Equation Model (SEM) proposed in the result section (subsection 5.3) has a power of over 0.8 (0.82 to be exact).
The details of the participants are provided in Figure 7 (Appendix A).MTurk has several embedded functionalities to control the quality of participants.We required participants to be located in the United States (US), have at least 50 approved tasks, and be Master workers, a qualification assigned to workers who have demonstrated a high rate of success in completing a wide range of tasks.On average, it took 29 minutes for participants to complete the study.Considering the minimum wage standards in our US State, we compensated each participant $8 for their time.

News Articles and Collection Method
We curated 15 news articles from three separate domains (five articles per domain): Health, Politics, and Environment.We chose Health as there has been an increased public interest in healthrelated reports due to COVID-19.There are reports of increased misinformation and pseudoscience in this domain.Similarly, politics and environmental issues such as climate change are contentious areas.People tend to have divergent opinions about political and environmental issues.We anticipated these domains would result in a wide range of questions, helping us answer the RQs.
To curate the articles, we identified the news outlets we intended to source articles from.Our objective was to select outlets that spanned the political spectrum, ensuring that the articles encompassed diverse biases.To achieve this, we consulted the AllSides media bias website [1] to identify organizations that could be categorized into five distinct groups: far left, leaning left, center, leaning right, and far right.Subsequently, we performed a search for articles followed by one of the aforementioned categories (e.g., "Environment articles, far left-leaning").The first page of search results yielded various articles from multiple outlets.We then examined the first article and cross-referenced the outlet it originated from with AllSides to confirm that the outlet aligned with the intended category.We selected that article if the outlet's classification was consistent with the desired category.If not, we proceeded to evaluate the next article in the search results.All articles chosen for this study were sourced from the first page of the Google search results.We also intentionally avoided long articles, so that crowd workers could quickly read them.The average length of the 15 selected articles was around 650 words.
By employing this systematic approach to article selection, we ensured that our study incorporated a wide variety of perspectives and minimized potential biases in the articles analyzed.The final counts for the source news outlets are-Reuters: 1, BBC News: 2, NBC: 1, NPR: 1, AP News: 1, CNN: 1, NY Times: 2, Washington Times: 1, NY Post: 2, Fox News: 2, Daily Mail: 1.To avoid potential bias towards a news organization, we removed all identifiable information such as Author's name from the articles and replaced reference to the news organizations name with generic names.For each participant in the study, we randomly chose one article from each domain.Figure 8 (Appendix A) shows the number of times the articles were read by the participants.

Study Interface
We developed a web interface for conducting the study (Figure 1).Participants used the interface to provide consent, read instructions and three news articles, and provide two questions for each article.We used Python in the backend for web services with JavaScript in the front end for supporting user interaction.We used MongoDB to save responses from participants.

Protocol
We used the "external HIT" functionality on AMT, where the interface hosted on our server was accessible to the workers.Participants followed the following sequential workflow in the study: (1) Read instruction and provide consent: Participants read an overview of the study, including what they will be doing in the study and the requirements for completing the study.After reading the instructions, participants provided consent for the study.We notified participants that all responses will be validated by the research team and any suspected fraud or abuse will result in forfeiture of compensation and removal of data.Can you summarize the movement of the migrants, according to the article?
Table 2: A taxonomy of questions.We identified 3 high-level categories (Information, Interpretation, Others) and 7 subcategories/types from our analysis of the questions.The numbers represent counts for each category.
(d) Rate the quality of articles and provide expertise: To answer RQ3, we wanted participants to rate the quality of the articles.However, quality is a complex construct, combined with many factors [68].Graefe et al. [24] used credibility, readability, and journalistic expertise to measure quality.These three dimensions are calculated from 16 factors (e.g., accurate, trustworthy, entertaining, coherent, etc.).We determined that 16 factors would be too many for participants to rate.Thus, we asked participants to rate the credibility, readability, and journalistic expertise of the article on a scale from 1 to 5 (higher is better).We asked participants to consider the 16 contributing factors while rating the 3 higher-level dimensions.Finally, participants self-reported their knowledge of the broader topic of the article on a 4-point scale (1= not at all, 4= very knowledgeable).

Quality Control
We ran the study in batches, 10 users at a time.After each batch, we manually examined each response to evaluate the quality of the questions.We discarded 2 responses (participants) due to incomplete submissions.This indicates that 99% participants were able to successfully complete the study without compromising the quality of the responses.

Analysis
We used Grounded Theory [22] to analyze the questions.All manual coding was conducted by two authors.At first, we selected 20% of the questions randomly and two authors open-coded this subset independently.The authors then met to finalize the codebook, discussing different dimensions and observations.The authors then coded this set again independently by following the codebook.The inter-annotator agreement after this stage was 0.80 (Jaccard's Similarity).The disagreements at this point were resolved through discussion between the two coders, with input from the full research team.The rest of the questions were divided equally between the coders.Throughout this process, the authors met and discussed regularly about the codes.
Beyond grounded theory, we coded the complexity of each question based on Bloom's Taxonomy.We followed a similar protocol as above for coding this dimension.The inter-annotator agreement was 0.82 (Jaccard's Similarity) for this dimension.We also coded the questions based on whether they indicate a violation of journalistic integrity (Truth and Accuracy, Independence, Fairness and Impartiality, Humanity, and Accountability), according to the Ethical Journalism Network [20].The inter-annotator agreement was 0.92 (Jaccard's Similarity).The disagreements for both dimensions were resolved through discussion.Finally, we also conducted several automated analyses (e.g., sentiment and readability index) on the questions.We provided details of each automated analysis in the relevant sections.

RESULTS FROM THE STUDY WITH AUDIENCE
We collected 752 questions (3 articles * 2 questions * 124 participants = 744 + 8 extra) from MTurk.This section presents the results of our study.We structure the results around our RQs.

Fact Details
Evidence Close_ended Opinion Explanation Other

A Taxonomy of Questions (RQ1)
Table 2 shows different types of questions we identified from the study.We found three high-level categories: Information, Interpretation, and Others.The high-level categories are deconstructed into seven sub-categories.Among them, three categories (Factual, Opinion, and Explanation) align with the taxonomy identified by prior research on news comment sections [65,66], while others are new.Unlike previous taxonomies, we found that the categories are not mutually exclusive; a question can fall under multiple categories.For example, consider the following question: How would you compare the level of energy and excitement for Trump at this event to that of his announcement for his presidential bid for 2016?This question is expecting more details about a current event and a previous event from 2016.It also seeks to know the author's opinion about the energy and excitement levels in the two events.Similarly, this question, "What are the Northern Triangle countries and what makes them so important?"expects both factual information about the Northern Triangle countries and an explanation about why they are so important.83.24% (626/752) questions belonged to only one category, with 14.63% (110/752) and 2.13% (16/752) questions belonging to two and three categories, respectively.

Authors vs Chatbot (RQ2)
5.2.1 Taxonomy Distribution.Figure 2 shows different types of questions that were asked to the authors (A) and chatbot (C) by the participants.Participants asked significantly more factual questions to the chatbot compared to the authors (A= 75, C= 136).In comparison, participants asked more questions seeking long-form details (A=174, C=128), evidence (A=36, C=20), opinion (A=51, C=18), and explanation (A=114, C=83) to authors.Finally, participants asked a similar amount of close-ended questions (A=25, C=25) and a small number of questions that did not fall into any of the categories (A=2, C=7) using both conditions.

Complexity.
The distribution in taxonomy already shows that participants asked questions that require subjective thought, opinion, and knowledge more frequently to the authors than the chatbot.We further labeled the questions based on Bloom's taxonomy [37] to measure the complexity and depth of knowledge required to answer the questions.Bloom's taxonomy is a six-point hierarchical scale (1= Knowledge, 6= Evaluation) where higher levels indicate a higher knowledge, learning, and cognitive skills requirement to answer the questions than previous levels.
Figure 3a shows the distribution of the questions based on Bloom's taxonomy.Given that most questions in our data seek facts and details, it was expected that levels 1 and 2 would be the most common according to Bloom's taxonomy.However, on average, participants asked more complex questions to authors (2.58) than the chatbot (1.86).The Mann-Whitney U test showed that the difference was statistically significant ( < 0.0001).We further computed the length of the questions (  = 102.90,  = 86.26).The difference was statistically significant with  < 0.007 (Figure 3b).Finally, we counted questions that fall under 2 to 3 categories in our taxonomy at the same time (Section 5.1) as that is a measure of questions having multiple facets.Out of 110 questions that had two categories or facets, readers asked 71 to the authors (64.55%).All 16 questions that belonged to 3 categories were asked to the authors.

Questioning journalistic integrity.
We noticed that a few questions focused on the integrity and ethical dimensions of reporting.For instance, consider the following two questions-both indicating a violation of Truth and Accuracy [20].The first one is directed toward the authors of an article while the second one is toward the chatbot: "Why would the author imagine this has something to do with the veracity of what's being said on the air when the entire article only details allegations that the people referenced engaged in inappropriate conduct behind the scenes?""Are you sure that low-income people were at higher risk of depression, they might have already had depression before that.How was it tested to get that result?" Figure 3: Complexity of the questions.(a) Rating of the questions based on Bloom's Taxonomy.Participants asked questions higher on the taxonomy more frequently to the authors than the chatbot.(b) Length of the questions.On average, questions asked to the authors had higher length than those asked to the chatbot.

Negative Neutral Positive
Sentiment (VADER) In total, we found 73 (9.71%) questions raised doubts related to at least one of the five principles [20].Out of the 73 questions, 49 (67.12%) were directed toward the authors, while 24 (32.88%) were directed toward the chatbot.

Linguistic Features.
We measured several linguistic features of the questions.Figure 4a represents the frequency of different sentiment categories (Positive, Negative, Neutral) for questions directed toward authors and chatbots, based on the VADER sentiment analysis model [29].The figure suggests readers had more emotionally charged questions for authors, especially with negative connotations, than chatbots.The reason may be because readers asked more critical questions (e.g., about journalistic integrity) to the authors.Readers directed more neutral questions toward chatbots than authors.The reason may be because most questions asked to chatbots seek simple facts and details without any particular sentiment.
A second-person reference (e.g., you, yours, etc.) in a question means that the question is directed at the authors or chatbot.Such references often create a more personal and engaging tone, as they involve the reader of the question directly in the topic being discussed [71].We found that participants used such references more frequently to address the authors than chatbot (Figure 4b).
Finally, we found that questions asked to the authors have a higher readability index according to the Flesch Kincaid Grade scale (Figure 4c).Although the effect size was small (Δ = 0.89), the difference was significant with  < 0.01.

Effect of Perceived Quality and Personal Preferences (RQ3, RQ4)
To measure how perceived quality influences questions, we adopted a structural equation model (SEM) [77], shown in Figure 5. Similar to previous research [24], we consider perceived quality to be a latent variable and derive it from user-provided ratings on readability, credibility, and journalistic expertise.Figure 9 in Appendix A shows participants' ratings across the three higher-level dimensions that capture the perceived quality of the articles.Overall, participants  Here, perceived quality is a latent variable, derived from readability, credibility, and expertise ratings provided by the participants.We determined participants' preference for news consumption (left, neutral, right) from their reading preference (Figure 7e).rated the articles highly across the three dimensions.To answer RQ4, we also added participants' preferences for news outlets in the model.We manually checked the preferred news outlets provided by the participants (Figure 7e), matched the polarity of the outlets to AllSides ratings [1], and then categorized them to either left, neutral, or right-leaning news outlets.We assigned neutral to participants who read both left and right-leaning news (27).We assigned a left-leaning tag if the participant consumed mostly left-leaning news (60) and right-leaning (37) vice versa.Finally, we conducted multigroup analysis [23] to capture differences between the authors and chatbot condition.We used the piecewiseSEM [40] package in R to find which paths vary across the two conditions (i.e., groups).Table 3 presents the results from the SEM analysis.The model fit measures are: Comparative Fit Index (CFI): 0.96; Goodness of Fit (GFI): 0.90; Adjusted Goodness of Fit (AGFI): 0.91;  2 : 610.59; and RMSEA: 0.07.The measures indicate a good SEM model fit..The coefficients show the polarity and strength of the relations.For example, the coefficient for Journalistic Integrity and Perceived Quality was -0.54.This means low-quality articles correlated with a high number of questions about journalistic integrity.

Authors Chatbot
When interacting with the authors, we found that high-quality articles encouraged participants to seek more details and opinion about the topic.Readers were more critical of authors when the perceived quality was low, seeking more facts and evidence, and questioning journalistic integrity.In contrast, when interacting with the chatbot, readers were less critical, even if the perceived quality was low.Thus, it implies that readers may not engage with a chatbot even if they think the quality of the article is poor, and many critical questions can be raised.We did not observe any significant impact of readers' choice of news outlets in the questions.

What functionalities authors
want in a chatbot?Our framework includes five different facets informing the policy.Once the policy is formalized, it needs to be communicated through the feedback loop.The informing and feedback loop can be iterated multiple times.

A FRAMEWORK FOR DESIGNING CHATBOTS
Our findings from the study with the audience indicate that readers' questioning pattern varies significantly based on the receiver of the questions-authors vs. chatbot.We noticed this difference across various dimensions-from the type, complexity, and linguistic features of questions to the user behavior towards journalistic integrity and perceived quality.The logical conclusion is that readers perceive authors and chatbots (or machines in general) very differently.Readers' lack of trust in AI and skepticism about AI are likely the reasons behind this difference [36,47].Increased scrutiny in popular media and legal steps from different domains such as the introduction of the "Blueprint For An AI Bill Of Rights" from the White House [5] may have further contributed to such polarized view.
While infrequent, we found evidence that readers may project human-like feelings and behaviors onto AI (i.e., anthropomorphizing AI [17]).For instance, a considerable amount of readers want the chatbot to be able to provide answers to subjective questions, albeit not as frequently as to factual questions.Readers' prior experience with chatbots such as ChatGPT and narratives about Artificial General Intelligence (AGI) may have contributed to this perception [73].
Overall, the findings from the two studies suggest that there is a high-level agreement between journalists and readers about the role of a QA chatbot in newsrooms.Both parties predominantly wanted the chatbot to answer factual questions while they wanted to engage with each other directly for subjective questions.However, as mentioned above, our findings suggest readers do not see the role of such a chatbot to be a zero-sum situation, i.e., all factual questions should be answered by the chatbot while journalists should focus on answering questions requiring subjective interpretation.A considerable amount of readers wanted the chatbot to answer factual as well as subjective questions.Similarly, we also noticed that readers want authors to answer a substantial number of factual questions, despite asking most of the factual questions to the chatbot.
Given this somewhat conflicting alignment between readers and journalists, it is clear that researchers and news organizations need a systemic approach to calibrate and communicate the needs and expectations of the two stakeholders.We propose a conceptual framework with four facets to facilitate this.For ease of explanation, we assume that a newsroom or news organization is developing the QA chatbot.However, other user groups (researchers, tech industry) can use our framework as well.Finally, while our discussion below is specific to QA chatbots, we believe the framework will be useful for designing a wide range of chatbots.

Facets
We briefly describe the five main facets or components of the framework below.
Authors.The first facet of the framework is the Authors.Since authors or journalists are the ones that are typically responsible for answering questions or response to readers, it is essential that the chatbot serves authors' needs and does not overstep into authors creative space.News organizations may conduct interviews such as ours with journalists to gather such information.
Readers.The second facet is understanding readers' needs.The central question for this facet is "What functionalities readers want in a chatbot?"It is possible to design the chatbot just by collecting requirements from the journalists, but such a chatbot may not match readers' needs and improve audience engagement, which is the primary reason behind supporting question-answering [53,56].
News Organization.The third facet is the preferences of the news organization itself.Should each article have a chatbot option?Does the chatbot need to adapt to the domain of the article (e.g., health) and types of reporting (e.g., investigative vs. beat reporting)?These questions need to be answered in this facet.
Technology.The fourth facet is understanding the design space of the currently available technology.News organizations should conduct a review such as ours in Section 2, consult internal research and developer teams, or recruit external researchers or corporate tech companies for this purpose.AI Regulatory Policies.The final facet is seeking legal consults to understand current AI/technology regulatory policies such as the "General Data Protection Regulation (GDPR)" [80] from the European Union (EU) or the "Blueprint For An AI Bill Of Rights" [5] from the White House in the U.S.These policies may contain rules and policies that news organizations should abide by while designing the chatbot.

Inform Stage
The four facets should inform the chatbot policy.Researchers should balance requirements from the facets at this stage.Here, based on the findings from our studies, we demonstrate three examples on how researchers can balance requirements at this stage.
6.2.1 Developing Chatbots for Answering Factual Question.The two arrows from the authors and readers in Figure 6 (through the two studies) have informed us that any future QA chatbot should be able to answer factual questions.However, the arrow from technology indicates that the most powerful technology available today, LLMs, can hallucinate, provide information without proper references, or even provide wrong information [84].Thus, newsrooms should consider how they can support factual QA during this stage.For instance, instead of relying solely on the LLMs, researchers can consider a hybrid model that uses an expert user base for this purpose.Xiao et al. recently showed that coupling AI with an expert user base can support the health information needs of general users [89].Researchers from Microsoft Research recently proposed LLM-Augmenter [61], a system that can provide answers to factual questions by using external knowledge bases.One good knowledge base for news articles could be the background research and references collected by the authors for writing the articles [12,13].

Routing
Factual and Subjective Questions.Our findings indicate that researchers will need to devise a routing policy for factual and subjective questions in the future.One practical solution would be to adopt the predominant view between journalists and readers: the chatbot will only answer factual questions while re-routing all questions seeking opinions and explanations to the authors.While this policy may not cater to the requirements of every reader, it will likely guarantee legal compliance for the news organization.For example, the AI regulatory policy facet in the framework can inform the newsroom that asking a chatbot to provide human-like opinion may violate legislative principles penned in the "Blueprint For An AI Bill Of Rights" [5,17].
Newsrooms can adopt machine learning approaches to achieve this routing mechanism.For example, newsrooms can use softprompting [62] with a pre-trained LLM to separate factual questions from subjective questions.Alternatively, newsrooms can collect a large corpus of questions similar to ours and then fine-tune a pre-trained LLM for this task.Finally, our findings suggest readers may expect answers to some factual questions directly from the authors, instead of the chatbot.To facilitate that, newsrooms can develop a scoring model that will measure the severity of a factual question and based on that route the question to either the authors or chatbot.

6.2.3
Collecting Critical Feedback from Readers.Our findings suggest that over-reliance on chatbots may deprive newsrooms of obtaining critical feedback from readers such as the underlying causes for the low perceived quality of an article or what raised questions about journalistic integrity in readers' minds.According to our results, readers may not raise questions about journalistic integrity to a chatbot as frequently as they would to the authors, even if they think the quality of the article is low.Several previous works have reported similar negative impacts of chatbots in user experience design [43,48].Considering this fact, researchers may recommend news organizations to plan alternative methods to collect critical feedback from readers.As mentioned by journalists in the interviews, many organizations now organize audience gathering and feedback sessions, which could be an excellent alternative to achieve this.

Feedback Stage
The final part of our framework is the feedback loop.All stakeholders from the five facets should know the policy determined after the inform stage.For example, based on the discussion in Section 6.2.2., if the news organization decides that the chatbot will not provide opinions and explanations, that decision should be conveyed to the readers.To encourage readers to provide critical feedback, news organizations can inform readers that questions about journalistic integrity will be immediately rerouted to authors.News organizations should also advertise other mediums for providing feedback (e.g., member gatherings) to readers.Finally, the news organization can iterate on the two stages for finalizing the chatbot policy.

DISCUSSION, LIMITATION, AND FUTURE WORK
In this section, we discuss design implications, limitations, and future improvements of our work.

Design Implication
7.1.1Future of Audience Engagement.As we look towards the future, the landscape of audience engagement is poised for a transformative shift.With the integration of LLMs, readers could have interactive platforms catering to their immediate factual queries while valuing human authors for in-depth analyses and opinions.This symbiotic relationship between AI and human expertise can redefine news consumption, making it more interactive, personalized, and engaging.We have presented several design prospects and implications for future AI and HCI research in Section 6.
7.1.2Increase Trust and Accountability.News organizations are grappling with a crisis of trust among their audiences.Prior research has demonstrated that presenting information alongside explanations of how and why news stories were developed can significantly enhance the trust readers place in news outlets [52].This study adopts a novel approach by directly engaging with readers and inviting them to share their inquiries about the news.Successfully answering these questions about the news production process would quench readers' thirst for news production issues (e.g., elaborating on the rationale behind including certain sources while omitting others) and would help increase trust and accountability.For instance, the news industry can offer LLM-based chatbot services through API to third-party applications such as Google Home, Alexa, Siri, etc. Also, readers' questions can be utilized, with permission, for data mining and extracting insights for better news recommendations and advertisements.Moreover, each newsroom has its own archive of historical news.Currently, the archives are mostly reserved for record purposes without being involved in any revenue process.LLMs can utilize this valuable news archive to answer questions and bring this resource into a revenue process.

Limitations and Future Work
Although we interviewed journalists and conducted an in-depth analysis of the readers' questions, our analysis is not free from limitations.For example, while we engaged a significant number of readers in the study, a small-scale followup interview study with readers might reveal readers' perspective about chatbots more clearly.We consider such a study as our immediate next step.Our dataset contains several other facets that we have not explored yet.
For example, we have not looked into the effects of news topics.We also have not looked into types of news such as investigative, breaking, etc.These analyses are left for future work as they were less relevant to our current research questions.
The chatbot in our study did not feature any response or conversation, which likely would influence users' perception of AI and the chatbot [45].Our future work will focus on designing an actual chatbot, informed by the findings of this paper, and then conducting a study to understand readers' questioning patterns while using the chatbot.
The study might also involve hidden confounding variables.For instance, there are two differences between the author and chatbot conditions: human vs AI and authors vs non-authors.We did not study these facets separately as our goal was to understand how readers direct questions towards authors and chatbot, not how the questions differ between human vs AI or authors vs non-authors.In other words, we were interested in the complete constructs embodied by authors and chatbot, not their individual components.Nevertheless, we acknowledge that studying these facets could provide more insights into readers' questioning patterns.
Finally, we have not conducted any experiment to evaluate how different LLMs such as GPT-4 and LLaMa might perform on our question set.We left this for future work as we believe that this itself is a separate research question.We aim to conduct the experiment with various QA systems and tasks: Closed domain vs. Open Domain [38], answerable vs. unanswerable [32], and so on.

CONCLUSION
The news industry is exploring how to integrate Large Language Models (LLMs) and Generative AI into news production, distribution, audience engagement, and other related processes.In this paper, we took the first step towards designing an LLM-powered chatbot by understanding how journalists and readers want to use such a chatbot.To achieve that, we interviewed journalists and conducted controlled experiments with readers.From interviewing journalists, we observe that journalists prefer a human-inthe-loop process rather than delegating all questions to a chatbot; particularly for the opinionated questions.We then examined what questions readers ask the authors and chatbot and if there are any differences.Our findings show that the questioning behavior of readers varies depending on who is receiving the question-the author or the chatbot.One of the major insights from our results is that readers ask factual questions more to chatbots compared to authors.Another significant observation is that readers question about journalistic integrity more to authors compared to the chatbot.Our findings have implications for designing a chatbot for audience engagement; specifically to evaluate how correctly LLMs can answer the questions that readers normally ask as we found.We argue that a careful and systematic approach to introducing this LLM technology to a trust-sensitive news industry would help minimize risks and ethical concerns.

Figure 1 :
Figure 1: Web interface for the study.Participants used this interface to read three anonymous articles and ask questions to the authors or a chatbot.The screenshot shows the interface for an article.(a) A PDF reader for reading the article.(b) Instruction for completing the task.(c) Input boxes for writing the questions.(d) Buttons to write a new question (optional) and move forward (enabled only when a participant provides two questions).

( 2 )
Read an article and ask questions (x3 articles): Participants completed the following four tasks for each article in the study.(a) Read the article: Participants read the article in a PDF reader.(b)Pass the reading comprehension test: As a quality control step, we asked participants to answer two multiple-choice questions relevant to the article.Participants had access to the PDF in this step too.The questions were designed to test participants' comprehension of the article and to ensure that they do not move to the next step without reading the article.Participants had 4 attempts to answer the questions correctly for the article.In the case of 4 wrong answers, participants would be logged out of the session and considered unsuitable for the task.No participants in the study were disqualified through this method.(c) Ask two or more questions relevant to the article: We asked participants to provide at least two questions relevant to the article.Participants could ask extra questions if they want.We prompted participants to ask about clarifications, follow-up information, or any other questions relevant to the article.Depending on the condition (A or C), we presented the scenario of asking the questions to the authors or a chatbot.Participants had access to the PDF in this step.We alerted participants again that the research team will manually validate each question and any suspected fraud or abuse will result in forfeiture of compensation and removal of data.

( 3 )
Complete demographic survey: Participants completed a demographic survey and received a unique code to indicate the completion of the study in MTurk.

Figure 2 :
Figure 2: Number of questions asked to the authors and chatbot.

Figure 4 :
Figure 4: Linguistic features of the questions.(a) Sentiment of the questions based on the VADER model [29].(b) Second person (you, yours, etc.) reference in the questions.(c) Readability Index (Flesch Kincaid Grade).

Figure 5 :
Figure5: Structural Equation Model (SEM) for measuring the effect of perceived quality and news outlet preference.Here, perceived quality is a latent variable, derived from readability, credibility, and expertise ratings provided by the participants.We determined participants' preference for news consumption (left, neutral, right) from their reading preference (Figure7e).

Figure 7 :
Figure 7: Participant demographics.Our participants (MTurks workers) varied in terms of a) gender, b) age, c) level of education, d) how frequently they read news online, and e) preferred news sources (top 12).

Figure 8 :
Figure 8: Distribution of news articles.We randomly assigned 15 articles to the participants in the study.Overall, the distribution is uniform in nature.

Figure 9 :
Figure9: Perceived Quality of the news articles.Participants rated each article on a scale of 1 to 5 across three dimensions: Readability, Credibility, and Journalistic Expertise[24].Y-axis shows counts for each category.Higher values indicate better ratings.

Table 1 :
Participant demographics for the formative study.

Table 3 :
Results of the SEM model.Each cell represents the coefficient for the respective relation.Same coefficient values between the author and chatbot conditions indicate globally constrained paths in the SEM model (i.e., no differences between the groups) based on the multigroup analysis.p-value significance: * p < 0.05, ** p < 0.01, *** p < 0.001.
7.1.3New Revenue Stream.The news industry, particularly local news is suffering from financial troubles [26, 34].A new LLMbased audience engagement could open a new revenue stream.