Paper Plain: Making Medical Research Papers Approachable to Healthcare Consumers with Natural Language Processing

When seeking information not covered in patient-friendly documents, like medical pamphlets, healthcare consumers may turn to the research literature. Reading medical papers, however, can be a challenging experience. To improve access to medical papers, we introduce a novel interactive interface-Paper Plain-with four features powered by natural language processing: definitions of unfamiliar terms, in-situ plain language section summaries, a collection of key questions that guide readers to answering passages, and plain language summaries of the answering passages. We evaluate Paper Plain, finding that participants who use Paper Plain have an easier time reading and understanding research papers without a loss in paper comprehension compared to those who use a typical PDF reader. Altogether, the study results suggest that guiding readers to relevant passages and providing plain language summaries, or"gists,"alongside the original paper content can make reading medical papers easier and give readers more confidence to approach these papers.


INTRODUCTION
A strong public health system depends on the timely dissemination of medical findings to those who need them.Most often, healthcare consumers stay apprised of medical findings through communication with experts-conversation with their doctors, printed materials like pamphlets, and online resources like MedlinePlus or hospital websites [27,55,109].
However, these resources do not cover all medical conditions and treatments [9,90], especially those which are the focus of emerging research [17,86].In these cases, healthcare consumers may have no choice but to go to the source of medical knowledge-the research literature [30,35,81,100,111].In the words of one patient:2 I had been studying CLL [Chronic Lymphocytic Leukemia] through free access articles on PubMed and Google Scholar. . .Reading these NIH papers enabled me to have an intelligent dialogue with a CLL specialist, ultimately leading me to the selection of a clinical trial.3 Fig. 1.Paper Plain helps healthcare consumers consult medical research papers by making their texts more approachable.Shown is the Paper Plain interface and the assistive features it provides to readers.When a paper is opened in Paper Plain, a side pane opens with a reading guide (1), comprising of curated key questions a reader might ask, previews of generated plain language answers, and pointers to where in the paper the reader can find more details.When a reader clicks a question (2), the paper jumps to the passage that provides that answer, accompanied by the plain language answer (answer gist) (3).Readers can also access in-situ plain language summaries for every section of the paper (section gists) by clicking labels next to section headers (4), and definitions of medical jargon sourced from external references by clicking those terms (5).
However, it is one thing for healthcare consumers to access the literature, and another thing entirely for them to comfortably navigate research papers.Healthcare consumers report that, unsurprisingly, medical papers are difficult to read [30,80].This is in part due to being overwhelmed by the amount of unfamiliar jargon.It is also because healthcare consumers are unaccustomed to the norms of how research is conducted and how reports of it are structured [14,30].
The result is that reading medical papers can be an experience that is challenging and at times demoralizing.
Given these difficulties, is it helpful and effective for healthcare consumers to read medical research papers?We believe that interacting with these papers gives patients an awareness of cutting edge medical findings and the complexities of underlying studies, even if they do not fully comprehend them.These papers also constitute the literature patients wish to share with their healthcare providers, should they discover information germane to treatment options [30,81,111].
In this paper, we ask how interactive information interfaces can make the research literature approachable to non-expert healthcare consumers that need it-whom we refer to as "readers" in this paper.In particular, we study how the paper itself can be imbued with new affordances to help readers navigate and evaluate its contents.As the human-computer interaction literature shows, reading interfaces can offer novel affordances to assist readers in navigating documents in new ways [6], looking up the meaning of unfamiliar terminology [44], summarizing sections as they read [12], and searching for answers to their questions [110].Drawing on this work as inspiration, we ask what combination of affordances would be necessary to help bridge the often enormous gap between a reader's current knowledge of biomedical research and the contents of a paper.Consider, for instance, this sentence from a recent article about systemic lupus erythematosus, linked to from a patient-facing MedlinePlus page [98]: The most salient events include an impaired apoptosis of dying cells, a type I interferon (IFN) signature, the uncontrolled activation of T and B lymphocytes and the production of autoantibodies mainly directed against nucleic acids or ribonucleoproteins (RNP).
This sentence is difficult not only because it contains technical jargon, but that in combination these words form a sentence so foreign that a reader has little chance of understanding it without learning a considerable amount of background knowledge from elsewhere.A reader not only needs to know what "autoantibodies" and "ribonucleoproteins" mean, but also how production of one implies condition progression and risks to their health.A medical paper contains not one but hundreds of such sentences, making it exceedingly difficult for readers to find, let alone understand, information important to them.We believe that future interactive aids will need to go beyond their typical capabilities to instead help readers understand where to find information of interest in a paper according to the language they already know.
The key insight of this paper is that medical papers can be made more approachable by judiciously incorporating plain language summaries to supplement original paper content.A reader can engage with the original text through plain language summaries-which we refer to as "gists"-that contain simplified sentences and reduced jargon and are presented alongside passages in the paper.The reader can approach any content in the paper by first inspecting its gist, only committing attention to a dense passage after learning if it is likely to be relevant.In this way, the reader has the support to engage meaningfully with the original paper text: skipping passages of little relevance and spending time reading those of consequence.
This paper begins with a formative observational study of 12 non-expert readers to identify barriers in reading medical research papers.We observed that, in addition to the expected pervasive difficulties of understanding passages dense with unknown terminology, readers struggled to know what parts of a paper to read and often spent considerable effort making sense of sections with limited usefulness to them.These findings suggest that reading medical papers is uniquely challenging for our envisioned readers due to their lack of domain knowledge and understanding of how medical research is communicated.An augmented reading interface for these readers will need to go beyond the capabilities of prior interfaces-that define terminology [44], provide summaries [42], or allow readers to ask questions of a paper [110]-and provide a reading experience that guides readers to useful information and helps them understand this information in the context of the paper.
To improve access to medical papers, we develop a novel interactive system, Paper Plain, through an iterative design process.The system is designed to make medical papers accessible with four features (illustrated in Figure 1) that combine to provide support at multiple levels of granularity (e.g., term, paragraph, section) and throughout the reading process.Paper Plain helps a reader find information relevant to them in the paper by providing a list of key questions about medical studies and a preview of plain language answers ("key question index").When a question is clicked, it takes readers to the paragraphs that answer the question along with full paired plain language answers ("answer gists").Paper Plain helps a reader understand the essence of jargon-dense passages by allowing them to access in-situ plain language summaries of any section ("section gists").Finally, Paper Plain provides assistance for understanding unfamiliar terms by making their definitions available.The first three of these features ("key question index", "answer gists" and "section gists") are novel in the context of reading applications for research papers, while the fourth is a known feature, though it is necessary to provide holistic reading support.The design of the system is described at length in §4.
We envision Paper Plain as a system that can one day be enabled for any medical research paper.The system draws on active research in natural language processing for biomedical question answering [108], plain language generation [42], and term identification [79].One limitation of current text generation capablities is the risk of generating factually incorrect or inconsistent text, often referred to as "hallucinations" [71].Deploying any system in a medical context will require algorithmic advances or human oversight to detect factually incorrect generations [56,71].In this project we assume such advances are possible (see [38,60] for examples of current advances in this space) and provide some manual filtering of incorrect, incoherent, or copied text (i.e., selecting the most fluent and correct generation out of five).This allows us to focus on developing interactions that would enable readers to meaningfully engage with medical research papers.§5 describes the implementation of Paper Plain and highlights the adaptations needed to make text generation model outputs useful for readers, while §8.3 discusses in more depth the limitations of text generation models for our application.While to date our implementation relies on some human curation, this project as a whole indicates the potential for reading experiences like Paper Plain to be deployed at scale over the scientific literature.
To assess how Paper Plain supports the reading experience, we conducted a 24 within-participant usability study where participants read papers with variants of Paper Plain or a typical PDF reader during a timed reading task.
The study showed that Paper Plain lowered participants' self-reported difficulty in reading the paper and increased confidence that they found all of the information of interest to themselves without any observable degradation in paper comprehension.The clear favorite feature was the key question index and answer gists.Participants also used, and appreciated, in-situ section gists and term definitions; though participants tended not to use them when the aforementioned key question-based navigation was available.Altogether, this study suggests that reading interfaces that provide guidance and plain language summaries can make medical papers more approachable and offer readers more confidence than they would otherwise have when reading medical research papers.
In summary, this paper contributes: (1) A characterization of the barriers readers face when they approach medical research papers.These findings support and deepen prior work on barriers in medical information [30,80,95] by illustrating the barriers healthcare consumers face in medical papers, such as uncertainty about where to find relevant information in a paper and an overabundance of jargon ( §3).
(2) Paper Plain, an interactive reading interface for research papers that integrates existing affordances like term definition tooltips with novel affordances like in-situ plain language summaries of paper sections and a collection of key questions that guide readers to answering passages in the paper with paired plain language answers ( §4).
(3) Evidence from our usability study that these new affordances helped readers quickly find places in a paper that were informative to them.Participants using Paper Plain's key question index and answer gists had a significantly easier time reading research papers and were more confident they got all relevant information from the papers while retaining a similar level of paper comprehension compared to the typical PDF reader baseline ( §7).

Healthcare consumers reading medical research
Research on consumer health information seeking suggests that trustworthy online health information can empower healthcare consumers, improve clinician-patient interactions, and increase adherence to medical recommendations [16,27,49,99].Tan and Goonawardene [99] reviewed consumer health seeking behavior and perceptions on using internet information in consultation with clinicians; they found that people did not feel like internet information adversely affected consultations, and that it helped them feel more confident in the consultations and in following clinicians' suggestions.Cartright et al. [22] distinguished two types of health information searching behaviors: evidence-based, which focused on details of symptoms, and hypothesis-based, which focused on understanding a particular diagnosis.
In a related setting, Cocco et al. [27] studied how people search for health information while in an emergency room, showing that many searched for information online on trusted sites like university or hospital websites.Kivits [55] explored why healthcare consumers search the internet for medical information, finding that the motivations for searching included helping oneself and filling in missing information from their clinician.Choudhury et al. [24] studied health searching and sharing behavior on search engines and social media, finding that search engines are often used for serious medical conditions, but social media can be used to share information about more benign symptoms or conditions.Work has also studied how medically concerned users search for health information online [84] and how online searching can lead to real-world healthcare utilization [107].
While the internet is a good source of consumer health information, there are also many barriers to interacting with this information [95,96].White and Awadallah [106] analyzed top search results for common health information queries and found that top search results returned for health interventions skewed positively, meaning that more search results said that an intervention will help a condition than suggested by medical evidence.Sommerhalder et al. [95] found that healthcare consumers searching for information online also struggled with information overload.
Information overload can be caused by searches returning unrelated results (e.g., searching a particular symptom and getting results about different diagnoses or home remedies), complex text, or different trusted sites providing contradictory guidance [9,50,95,96].Most people could not resolve these issues themselves, instead needing to discuss the information during consultations with their clinicians [95].
While many people start out on consumer-facing sites, medical literature is an important source of highly specific, up-to-date information for them [111].In 2005, the NIH established an open access policy in part to encourage "individuals [to] become educated consumers about their healthcare and related research, and to consult with healthcare professionals for specific guidance." [81] Subsequent research has shown the public benefit of this open access policy, such as improved access to new research findings for healthcare workers and consumers [100].While the traditional debate for open-access journals have focused on wider dissemination within research communities, there is an increasing recognition that public stakeholders, including advocacy groups and healthcare consumers, can effectively make use of primary medical research findings [30,35].Indeed, there is a movement in the medical community to involve patients more in the research process, including understanding lab reports [78], reviewing research papers [87] and leading research efforts [72,76].
At the same time, medical research, and scientific research more broadly, present unique barriers to readers without research expertise [74].Britt et al. [14] argued that science literacy is the ability to evaluate scientific texts effectively, but that this is challenging due to complex arguments and unfamiliar text structures.Bromme and Goldman [15] highlighted hurdles that the general public face when reading scientific information, including the ability to determine what is relevant and lack of domain expertise.Day et al. [30] outlined additional barriers specific to searching through medical research, such as lack of adequate scientific literacy, the potential to draw inaccurate conclusions from the findings, and fraudulent journals without sufficient peer review.Nunn and Pinfield [80] interviewed healthcare consumers on reasons for accessing medical literature and their response to lay summaries written for medical papers.They found that readers appreciated the lay summaries, but often wanted to read the article themselves anyway.At the same time, other work has found that lay summaries help improve reader comprehension compared to journal abstracts [52].
Our project illustrates how interactive reading interfaces can make medical research papers accessible to healthcare consumers through a novel interactive system, Paper Plain.

Interactive reading interfaces
Paper Plain draws inspiration from prior affordances in interactive reading systems that have used term definitions [44], question answering [23,110], and guided reading [34] to support reading medical text [20,69], dialogue [63], news [12], and search results [28].Inquire Biology [23] is a biology textbook augmented with artificial intelligence (AI) features to support student learning.The textbook allows students to view concept definitions and ask open-ended questions about information in the textbook.If students are unsure of what questions to ask, the textbook also recommends possible questions based on highlighted passages.In another resource for students, Dzara and Frey-Vogel [34] introduced a new method for conducting reading groups that required no prior reading preparation through developing questions about a paper's methodology and findings.They found that these interactive discussions can help pediatric residents analyze medical papers effectively.Also in the clinical context, UpToDate [4] provides expert-written summaries of current research for healthcare providers.
In the context of reading research papers, Head et al. [44] introduced ScholarPhi, a PDF reader that surfaces positionaware definitions for terms defined in a paper (Nonce words) and features for revealing these terms across a paper.In a usability study, researchers were able to read papers more easily using the interface.Zhao and Lee [110] introduced "Talk to Papers, " a natural language question answering system for exploring research papers."Talk to Papers" allows users to query papers with natural language questions and provides passages where answers are taken from.Other work has explored tools for adaptive summarization in news articles [12], evaluating research literature [62,69], navigating concepts within a paper [6,48] and providing reading guidance in textbooks [20,105].There are also interactive systems for collaborative reading of research papers, such as Fermat's Library [1], which provides community annotations on popular research papers, and Hypothes.is[2], which allows users to annotate and share annotations on any webpage.
In contrast to previous reading interfaces for research papers that focus on clinicians, researchers, or students, this project focuses on interactions to make papers understandable to healthcare consumers.There are key ways in which previous designs would not support these envisioned readers.Medical research text is so jargoned that a reader has to invest considerable effort learning the background knowledge to understand it.Previous interfaces that assume readers know what important questions to ask [110], where to look for their answers [23] or know how to make sense of definitions of terms within a paper [44,48] can make reading exceedingly difficult for these readers.Paper Plain goes beyond the typical capabilities of interactive readers to instead help readers understand where to find information of interest in a paper according to the language they already know.To do this, the system incorporates plain language alongside original paper content.

AI for scientific text processing
Paper Plain leverages recent gains in natural language processing (NLP) for making medical information more understandable to the public, specifically healthcare consumers [31,104].The research most salient to Paper Plain are automated term definition or replacement [102], plain language summarization [32], and consumer biomedical question answering [5].In addition, we discuss here writing tools to encourage plain language [41], as the underlying techniques for powering such systems are similar to those leveraged by Paper Plain (e.g., generating plain language).
Paper Plain integrates these advancements in its implementation to show the promise of such methods in supporting Paper Plain: Making Medical Research Papers Approachable to Healthcare Consumers with Natural Language Processing 7 healthcare consumers in a user-facing interface and indicate the potential of scaling this reading experience across the scientific literature.
Veyseh et al. [102] presented a web-based system for acronym identification that works in the biomedical, scientific, and general domain and Murthy et al. [75] explored how to define scientific terminology with terms recognizable to a specified reader.Devaraj et al. [32] introduced a new dataset of healthcare consumer summaries for clinical topics and a trained model for simplifying medical text.Guo et al. [42] used plain language summaries to train a model for generating summaries of biomedical text.Abacha and Demner-Fushman [5] collected a dataset of consumer health questions from NIH websites and developed methods for automated answering of these questions.Mrini et al. [73] introduced methods to improve answer recall for long and complex consumer medical questions.Gero et al. [41] used generation models to help researchers author "Tweetorials, " a threaded tweet meant to inform a general audience about a scientific concept on Twitter [13].Other work has introduced writing tools to help journalists [53] or clinicians write using simpler terms [61,85,101], simplify text by replacing jargon with more common terms [11,59,82], simplify e-prescription and medical instructions [21,64], and automatically classify the questions that healthcare consumers ask [89].Paper Plain draws on this active research to improve access to medical papers.§5 discusses in depth the adaptations needed to make this research provide useful output for healthcare consumers reading medical research papers.

OBSERVATIONS OF NON-EXPERT READERS
Prior work on reader barriers have focused on consumer health information [95], scientific research in other domains [74], for students [93], or searching through medical literature [30], but it is unclear how these barriers manifest for nonexperts reading medical research papers.To gather more direct and comprehensive evidence of barriers for this population, we conducted a think-aloud reading study.
3.0.1 Participants & recruiting.We wanted to observe the barriers faced by healthcare consumers when reading medical research.However, the timing of these reading episodes was hard to predict, making it difficult to observe authentic reading experiences.As a compromise, we developed scenarios based on interviews with four healthcare consumers who had prior experience reading medical research and two healthcare providers who had discussed findings from medical papers with their patients.Healthcare consumers and providers were recruited through our personal and professional networks and by referral from other interviewees.More details on these interviews are in Appendix A. We then recruited participants without medical or research expertise to walk through these scenarios.We provided these participants with a primer about a medical condition and allowed them considerable agency in how they approached the reading task.
We recruited participants who had no experience in the medical profession and in undertaking research via Upwork, a crowd-work site for hiring freelancers.We listed our job under both "Editing & Proofreading" and "Customer Research" (i.e., workers partaking in user surveys) to attract a broad sample of workers with varied degrees of reading and writing experience.All participants were paid US$15 for the hour-long study. 3We discuss possible limitations to this recruiting strategy and the presence of a paid timed task in §8.4.
A total of 12 participants completed the study (T1-12).Of these participants, 11 had completed college and 5 had completed professional or graduate school.11 participants had taken 3 or fewer STEM courses since high school.
3.0.2Procedure.In the study, participants were given a scenario about a fictional diagnosis representative of common but serious medical conditions (e.g., a herniated disc) with a goal for reading medical papers (e.g., finding new treatments).To ensure participants were equipped with some prior knowledge before approaching papers, they first read a consumer health webpage (MedlinePlus) about the medical condition.This MedlinePlus step was meant to more closely approximate realistic circumstances, in which a participant would have received some information from their doctor about their diagnosis.
We designed the scenarios such that participants would benefit from the additional information found in research papers.To uncover a comprehensive set of barriers, we created four scenarios varied across the following dimensions: diagnosis, demographics (i.e., common or uncommon for a diagnosis), relationship to patient (i.e., patient vs. caretaker), and motivation.
There were two possible diagnoses for each scenario: a herniated disc or systemic lupus erythematosus (SLE, also called Lupus).These diagnoses were selected because they are relatively common and represent serious, long-term issues for a patient.Motivations were: learning background-specific information, becoming aware of emerging treatment options, and comparing treatment options.These scenarios were validated as realistic by a healthcare researcher familiar with consumer health.For more information on these motivations, see Appendix A.
Participants were randomly assigned into one of the four scenarios.Each scenario was assigned to the same number of participants.After reading a description of the scenario, participants read the MedlinePlus page on the diagnosis then browsed a list of 11 research articles related to the diagnosis.To make these papers representative of the sort healthcare consumers would find in their own searches, we selected only from PubMed articles linked from the MedlinePlus page.We selected papers that were 1) review articles or randomized control trials and 2) relevant to the scenarios (e.g., covering possible new treatments).Papers varied in how relevant they were for a scenario (e.g., some papers covered treatments not clinically available), though all papers had some relevance to the scenarios.While in real-world health information seeking scenarios, readers would undoubtedly come across irrelevant information [95], the study's focus was on barriers in reading papers rather than searching through papers and determining their relevance.Participants chose which papers to consult, which permitted us to see how the contents of a paper affected a participant's choice to read it deeply.Most participants had enough time to read one or two papers (all were asked to read at least one).
Participants were provided a total of 40 minutes of reading time, split between the MedlinePlus summary page and the papers they chose to read.Participants thought aloud or wrote down any barriers they had while reading.They were prompted for this information every 5 minutes if they had not already volunteered it.The researcher present would sometimes ask participants to elaborate on these barriers.Following the reading, the researcher interviewed participants on what was difficult about reading the research articles and what tools they wish they had to help.After the interview, participants filled out a questionnaire to report their medical literacy and prior research experience.
One author conducted a thematic analysis of the think-aloud and questionnaire data to identify barriers to reading.
In multiple meetings, the one author discussed the themes and described evidence with the other authors, refining these themes with input from the other authors.In addition, the other authors confirmed themes by observing recordings of the sessions.We grouped these themes into a set of core challenges, that, if resolved, would help readers make better sense of medical research papers.medical papers, confirming the presence of these difficulties and highlighting concrete instances of difficulties that inform opportunities for design.
Unfamiliar terminology.Nearly all (T1-3, 5-8, 10-12) participants mentioned struggling to make sense of the information in the papers because of medical terminology or acronyms that they did not know.These terms ranged from only appearing in some areas of biomedical research (e.g., "therapeutic peptides") to commonly used medical terms (e.g., "comorbidities," "meta-analysis").The two participants that did not mention struggling with specific medical jargon (T4 & 9) often skimmed over these terms or were able to infer them from context.Interestingly, while others reported medical terminology as a barrier, they still made some sense of an article without knowing terms by making assumptions about the terms' meanings.At the same time, some terms had meanings that were integral to understanding an article.
Incorrect assumptions about these terms could mean misunderstanding the article (T6 & 10).For example, T10 did not know that "in vitro" referred to pre-clinical, non-human studies.They only realized this after reading the majority of the article, which dramatically changed their perception of it's usefulness (i.e., that none of the studied drugs were in clinical trials).
While terminology is a common barrier in scholarly communication [70], past interactions to address it present additional issues for our reading context.Past work has addressed researchers not knowing terms in a paper by providing definitions of terms based on earlier references in a paper [44].There are two issues with such an approach for our reading context: (1) the sheer number of terms could make it difficult for a reader to know which are important and (2) there is no guarantee a reader in our context would understand references drawn from the paper, considering that almost all text in medical papers has technical jargon.These issues suggest that a different approach to defining terminology for our envisioned readers is needed.
Dense text.While participants could ignore individual terms, such as T4 & 9, sentences were so filled with these terms, and paragraphs were so filled with these sentences, that participants were overwhelmed by passages of dense text (T1-8, [11][12].This dense text included unfamiliar terminology, but also statistics and complex wording or arguments.Because of the amount of text in the articles and the high cost of reading any of it, participants were quickly overwhelmed.
As T8 put it, "Honestly reading that stuff it was. . .overwhelming just how much terminology I didn't know to start off with. . .It's not like I didn't understand it at all, it was just hard to follow because I had to keep going back, like 'Oh what does that acronym mean?' " T8 was reading a section containing multiple acronyms defined earlier in the paper, including 'QoL', 'DORIS remission, ' and 'SLEDAI.' The beginning of one paragraph reads as such: "In some cases, modifiable causes like anaemia or hypothyroidism may be found, but in most patients, fatigue is unexplained. . .In contrast, SLEDAI or BILAG do not correlate with fatigue."[51] T5 expressed a similar sentiment when describing a results passage they were reading: "I am not going to act like I understand what any of this means. . .I would have to take the time to understand what these terms mean." Continuously having to reference earlier sections of a paper, or searching for term definitions on the internet, can be a major distraction, especially when multiple terms appear in a single sentence.Multiplying this by every sentence in a medical paper creates a categorically different barrier than one term might present.
Dense text is a barrier that every reader has encountered when learning to read in a new language or domain and is a core motivation for text simplification research.The nuance to this barrier in the context of medical research papers is that readers might have little interest or capacity mastering the language and norms of a particular paper, given that other papers they might read could use different language, and that they may be pressed for time and emotionally and mentally drained from handling their diagnosis.
Knowing what to read.Of the 12 participants, 11 (T1-3, 5-12) had a difficult time knowing if a paper held relevant information and invested intense reading effort to determine this.They read papers exhaustively top-to-bottom, reading most of the text, spending time making sense of dense results sections and descriptions of statistical analyses that later they had no use in understanding (T2-3, 5-8).Much of the dense text participants reported struggling with (discussed in the previous barrier) ended up being in sections that they later discovered were less important to read (e.g., a detailed statistical results section).
One clear example of this was T5, who reported struggling to read the entire first paper they selected because they wanted to do their due diligence by understanding the results.After getting to the discussion they realized that it provided an accessible overview of the results, so for future papers they ignored the technical results sections.As they explained, "the results, which in my mind would be the first place I would want to go to. . .are very technical and I am not going to know what that means. . .so a general discussion of the results will be more helpful.While some used a paper's introduction to determine how useful a paper would be, many participants did not trust their ability to know what a paper would contain without exhaustively reading it (T3, 6-8).T6 and 8, for example, both suspected that certain papers would not be useful after reading the abstract or introduction, but continued reading the papers because they hoped they would still find something that was helpful.As we will discuss more in the next barrier-searching for answers-sometimes there was indeed information not surfaced in the introduction or abstract that participants wanted to know, such as low-level details on participant demographics.Participants could invest immense effort to determine if a paper contained this information.In the case of T6, they spent 40 minutes reading a single paper.In another case, T7 reported that they suspected there was useful information in a paper, but it would take them too much time to find it.T3 provided a similar sentiment of wanting a way to know exactly what to read first in a paper: "I would love some sort of. . .thousand foot-view, which is kind of what I needed in the beginning.Make [the paper] less designed for doctors, and make it more patient friendly, where you are less overwhelmed by all the information all at once, where you can search it out in smaller bites." When asked to elaborate, T3 explained that the smaller bites of information could provide high-level findings that they could follow-up on for more details if they were interested.It is worth noting that some biomedical papers do structure abstracts with high level summaries of all sections first or include article highlights at the beginning of the paper, which could help non-expert readers as well as scientists reading these papers.
Searching for answers.Participants in our study had specific information they tried to find in the paper, but struggled to do so (T2, 4, 6, 9-10, 12).In contrast to the previous barrier where participants struggled to know what to read in a paper, sometimes participants knew what they wanted to read, but couldn't find this in the paper.The two most salient examples of this barrier were searching for patient demographics and previous treatment options.T2 tried to find information on specific demographic groups in the study to see if they matched their scenario.They had to read through the entire article to find a table with patient demographics and a single sentence within the discussion section making reference to the patient group most relevant to them.Abstracts also did not talk about study demographics or current best practices for treating an illness.Introductions would often include useful information, but it was hidden in background paragraphs or quickly mentioned before moving on to the novel results.Participants therefore had to sift through headers and paper sections, making sense of unfamiliar terms and dense text (two previously discussed barriers) while trying to determine if each sentence was relevant to them.
Relating findings to personal circumstances.Some participants also wanted additional information from the papers that were personally relevant to them (T2, 5, 8-9, 11).T2 and 8 imagined a tool that could explain how a treatment would affect them, such as by providing patient testimonials for treatments in the paper or results for slices of patients based on demographics.For example, T2 read a paper that reported a 60% reduction in pain after a surgery, but they wanted to know whether patients regretted the surgery or would recommend it.They also wanted results for a slice of patients most similar to their hypothetical scenario, a 20 year-old male smoker, but the paper only presented average reductions across all patients.T5 found it helpful when an article made reference to the monetary cost of different treatments as a way of referencing patient experiences, though this only happened in one paper.While this personally relevant information was not the goal of the research papers, participants sought this information as a way of relating the information in the paper to their own lives.These barriers are unique, or uniquely challenging, to our envisioned readers, necessitating a novel approach to ameliorating them.Past interactive reading systems for research papers have assumed readers have extensive domain knowledge, are able to make sense of paper text as a way of resolving unfamiliar terms [44], know what are the right questions to ask of a paper [110], and understand the basic structure of a paper [6].In contrast, the barriers we identify illustrate that these assumptions do not hold for non-expert readers.Below we discuss how our system, Paper Plain, seeks to address these barriers using plain language (gists) and a collection of key questions as a reading guide, both novel techniques in the context of interactive reading systems for research papers.

PAPER PLAIN: READING SUPPORT FOR MEDICAL RESEARCH PAPERS
Paper Plain makes medical papers approachable to non-experts.Unlike other systems in the augmented reading space for research papers, Paper Plain focuses on the barriers of non-experts, such as knowing where to invest reading effort.
To address this reading context, Paper Plain integrates existing features like term definitions with novel navigational guidance and reading support through a Key Question Index and Answer Gists.
We focus on four of the five barriers discussed in §3: unfamiliar terminology, dense text, knowing what to read, and searching for answers, because these were the most common barriers we observed that hampered readers' ability to get useful information from the papers.In contrast, relating findings to personal circumstances reflected a desire for additional information and was less focused on understanding information in the paper itself.
We followed an iterative design process for developing Paper Plain.A total of 8 participants used 2 early prototypes of Paper Plain in qualitative usability evaluations.Participants were recruited from our institution, our professional networks, and Upwork.These evaluations lasted one hour each.The iterative design is described in more detail in Appendix B.
One finding from the iterative design we would like to highlight here is the need to supplement, rather than replace, original paper text.We observed participants double checking generated plain language (the gists) with the original text.When asked their reasons for doing so, participants mentioned generated text being vague or wanting to confirm information with the original paper.NLP systems are imperfect (e.g., by generating inconsistent information [71]) and these observations highlighted the risk of relying solely on generated content.Because of this, in Paper Plain's design all gists were placed as close to the original text as possible without overlapping, and gist content was provided on-demand, rather than initially displayed along with the paper, to encourage readers to focus on the paper and only pull from the gists for supplemental information.We discuss future designs to further encourage reading original paper text in §8.2.

Paper Plain design
Based on feedback from the iterative design process, Paper Plain was designed with the following features: (1) Term Definitions -Tooltips provide definitions of unfamiliar terminology from the open web.
(2) Section Gists -In-situ plain language section summaries support readers' understanding of dense paper text.
(3) Key Question Index -Key questions in the sidebar guide readers to relevant answering passages.
(4) Answer Gists -Plain language summaries of the answering passages help readers understand the important information contained in these passages.
Paper Plain supports healthcare consumers in making sense of medical research papers.To illustrate how Paper Plain is designed towards this goal, we describe how a fictional reader, Sarah, leverages Paper Plain to achieve their goal of finding more information about new treatment options.Sarah is a first-time reader of medical literature and therefore might differ from some readers who have become familiar with medical terminology because of prior efforts to understand a chronic condition.That being said, we believe that Paper Plain's features that highlight useful information in a paper can support first-time as well as regular, non-expert readers.
Sarah was recently diagnosed with Systemic Lupus Erythematosus (SLE, also called Lupus), an autoimmune disease.
Currently their symptoms are mild: some joint pain and tiredness, but symptoms can worsen and become debilitating over time.When Sarah discusses treatment options with their doctor, Sarah doesn't know if there are treatments the Paper Plain: Making Medical Research Papers Approachable to Healthcare Consumers with Natural Language Processing 13 doctor does not mention that Sarah would be interested in.To be informed on available treatments, Sarah finds a research paper about possible new treatment options, titled: "Therapeutic peptides for the treatment of systemic lupus erythematosus: a place in therapy." [98] After reading the title, Sarah has many questions -What is the paper about?What are therapeutic peptides?Are they a possible new treatments for SLE? -and begins reading.
4.1.1Sarah feels overwhelmed while using a default PDF reader.Sarah starts at the first paragraph of the introduction and immediately becomes stuck on sentences like: SLE is characterised by a multifactorial pathogenesis, in which the combination of a favourable genetics and the intervention of external agents may induce the chronic activation of the innate (neutrophils, macrophages, complement system) and the adaptive (T and B lymphocytes, plasma cells, auto-antibodies) immune system.
Even though Sarah is familiar with several technical terms like "innate immune system" and "chronic activation" from reading medical pamphlets and other patient-friendly SLE literature, Sarah does not know the meaning of many unfamiliar terms (e.g."multifactorial", "neutrophils", "complement system").Unable to gauge how critical these terms are for understanding the introduction, Sarah looks up every term one-by-one on the internet.The context switching makes it hard for Sarah to recover their place in the paper each time.After ten minutes, Sarah realizes that this first paragraph merely provides background on SLE.They haven't made it to the second half of the introduction.each section includes a description of how the peptide works and its clinical trial results.Sarah is motivated to get a high-level sense of each available peptide, but it will require reading 20 pages of dense text.From the introduction Sarah gathered that not every peptide has equally promising results and each might be used for different treatments of SLE (e.g., more moderate or more severe cases), so Sarah would prefer to only read in depth about the most promising peptides relevant to Sarah's mild case of SLE.Skimming through each section, Sarah believes some information might be relevant, but it is hard to tell without reading the section in depth.Sarah is disheartened that they can't get more details about promising clinical results without going through these walls of text.
Paper Plain makes it easy for Sarah to quickly determine what sections are interesting to them and understand the sections with in-situ plain language summaries (Section Gists).Sarah clicks on an angled flag next to the section title, and a tooltip appears adjacent to the section text (Figure 3).The tooltip contains a summary of the section stripped of jargon.
Rather than sentences like "SLE patients and animal models are characterized by the production of autoantibodies reacting against epitopes of the spliceosome.", the summary explains that "People with SLE have antibodies that attack parts of their own bodies." Sarah learns from the section gist that this particular peptide has had some good preliminary results, but that further studies have had less successful results.Sarah confirms these details by skimming the section and decides this section isn't so relevant to them.Sarah uses the Section Gists for the rest of the peptide sections, writing down a few peptides that they are interested in keeping track of, without having to parse all the dense, mostly irrelevant text.Sarah completes their reading of these sections in 15 minutes rather than spending hours going through each section in depth.
Fig. 4. Key Question Index guides readers to answering passages and their Answer Gists.When one of the questions is clicked (1), the interface will scroll (2) to the first answering passage (purple) and display a tooltip (orange) containing the Answer Gist.In (3), we show the simplified Answer Gist alongside the original paper text.

4.1.4
The Key Question Index and Answer Gists help Sarah focus on the most important questions and relevant passages.
Sarah gets to the end of the paper using the Section Gists to read only some sections in depth, but is worried they might miss important information in the paper because they didn't know to look for it.Sarah got a general sense of each section using the Section Gists but is curious if there is some information that the general summaries might not have surfaced, especially in larger sections containing lots of relevant information, such as the Discussion or Introduction.
As an alternative to assessing relevance with Section Gists, Paper Plain provides Sarah with key questions linked to answering passages in the paper along with plain language answers to point Sarah to important information.Sarah looks to Paper Plain's sidebar and sees questions about the paper that cover key information, such as "What did the paper do?" and "What did the paper find?" Sarah sees that the question "What did the paper find?" hyperlinks to multiple passages within the Discussion (see (1) Figure 4).They click on the first link.Paper Plain scrolls through the pages and settles on a highlighted paragraph in the Discussion summarizing the most promising therapeutics peptides (see (2) Figure 4).Unfortunately, the answering passage looks dense.As Sarah prepares to look up more terms, they notice a tooltip below the answering passage containing a plain language summary (an "Answer Gist").This answer gist is a quarter the length of the original paragraph and contains none of the unfamiliar terms (see (3) Figure 4).While the answer gist by itself might not contain all the information Sarah wants, they can read the original paragraph along with the answer gist, comparing the complex wording with plain language and get a general understanding of the paragraph without being overwhelmed by technical jargon.Similar to the Section Gists, Sarah can then dive into the original passage with this understanding to get more details.Sarah clicks through the rest of the links for the same question, which scrolls them to individual paragraphs in the discussion that cover the most important findings and interpretations of the paper.
Paper Plain's key questions also guide readers to the questions they might not know to ask about a paper.Before finishing reading the paper, Sarah looks through the rest of the questions in Paper Plain's sidebar.Each question is accompanied by a one-to-two sentence plain language answer preview and hyperlinks to one or more paragraphs in the paper that answer the question.With only a handful of key questions and short answers, a majority of the questions can be displayed in the sidebar without scrolling so Sarah can quickly read all the questions and answers with minimal effort (see (1) in Figure 4).Sarah sees and clicks on one question they hadn't thought to look for in the paper: "What Sarah has spent only a few minutes to learn the most important information about the paper for them: these are not treatments they could ask their doctor to prescribe them, but there might be some promising clinical trials Sarah could look into.They also feel confident that for future papers they could use this key question sidebar to quickly get a high-level summary of the most important information in a paper.

IMPLEMENTATION
Paper Plain leverages active research in NLP for biomedical question answering [108] and plain language summarization [42] to address reader barriers.Below we discuss the implementations powering each feature of Paper Plain.
While additional algorithmic advances or human oversight, specifically for ensuring factuality [71], are necessary to make deploying such a system safe, our current implementation indicates the potential for Paper Plain to be deployed at scale over the medical literature.

Term Definitions
Paper Plain identifies medical terms in the paper using scispaCy Named Entity Recognition (NER) [79] and links these terms to definitions from the Unified Medical Language System (UMLS) 5 or Wiktionary. 6The extraction and linking process led to many false positives (e.g., identifying terms like 'expert' or 'negative'), so we additionally filter terms based on word frequency and length.For both Wiktionary and UMLS, we preserve the bottom 80% of terms based on word frequency and remove all terms at or above 30 characters (terms over 30 characters were usually ill-formed, for example, containing a citation string or the beginning of the next sentence).We additionally filter all Wiktionary definitions to those containing at least one of the following tags: 'medicine', 'organism', 'pathology', 'biochemistry', 'autoantigen', 'genetics', 'cytology', 'physics', 'chemistry', 'organic chemistry', 'immunology', 'pharmacology', 'anatomy', or 'neuroanatomy.'

Section Gists
We define section gists for the lowest level subsections in the paper (e.g.2.2.1).To generate section gists, we concatenate the first sentence of every paragraph in a section and generate a plain language summary of it using GPT-3 [18].GPT-3 is a pretrained generative model released by OpenAI that has obtained state-of-the-art results on many language tasks using different prompts for generation [18] and is commonly used for many generative tasks (e.g., generating plain language).We engage in prompt engineering, a common practice for achieving fluent text for large generative models [66], to encourage fluent and specific plain language summaries.We use a GPT-3 prompt adapted from a preset example that OpenAI provides for simplifying text. 7We modified the prompt to suggest a fifth grade reading level rather than second grade.We also tested later grades, up to college, but found that the generated text using the fifth grade reading level prompt was the most coherent while still providing some details about the section.Sentences were extracted manually for our prototype system, but could be automatically extracted using PDF parsing methods [67,94].
Using the leading sentence of each paragraph is a common competitive baseline for summarization [36]; we choose this strategy rather than inputting the full section text because GPT-3 is prone to copying the text verbatim when given the full section.
We observed variations in summary quality, such as hallucinated or incorrect information (e.g., calling peptides a surgery), repeated words or sentences, and copied text from the original passage.In these cases, we would regenerate the summaries up to five times and select the most fluent or correct generation.This usually provided a coherent and correct summary, but there were examples of text copied from the original passages that persisted.We discuss generation quality and accuracy further in §8.3.More details on the GPT-3 prompt are in Appendix C.

Key Question Index and Answer Gists
Key questions were drawn from two sources designed to translate medical findings applicable to patients: the PICO framework [88] for clinical questions and Cochrane's guide on writing plain language summaries [3].Both sources focus on information in medical papers that are relevant to patients and caregivers.We curated 8 questions from the two sources for inclusion in Paper Plain; these are listed in Table 4 in the Appendix.
For each question, Paper Plain extracts relevant passages from the paper using an extractive question answering (QA) system trained on BioASQ, a biomedical question answering task [108].Because this QA model extracts single words or phrases rather than full passages, we used the entire paragraph that contains an answer extracted by the model.For our prototype system, we manually labelled sentence boundaries of the extracted answers on the PDF to ensure high quality bounding boxes for display.Recent work has improved the accuracy of automatic sentence bounding box extraction from PDFs [94], which could be used to automate this step in the future.We follow prior work on making QA models more robust by including semantically-equivalent variations of questions [40].
In the system, we highlight the paragraph containing the answer and display an answer gist summarizing the answer.We create answer gists by simplifying the extracted passages using GPT-3 [18] with the same prompt we use for simplifying section gists.We also include the first 1-2 sentences of the answer gist in the sidebar along with the question.

USABILITY STUDY
Paper Plain is meant to help readers engage with medical research papers important to them.We ran a within-subjects usability study to assess how well Paper Plain's features meet these goals.
The study answers the following questions: RQ1-How did participants use Paper Plain's features?
• Did participants prefer some features over others?
• Did participants use features throughout the reading session?
• Did presence of one feature affect usage of another feature?
• Did participants traverse linearly through a paper or employ a jumping reading strategy?
RQ2-How does Paper Plain affect participants' self-reported reading difficulty, understanding, and ability to identify relevant information?
• ...in comparison with a standard PDF reader?
• How does providing reading guidance (i.e., the Key Question Index and Answer Gists features) affect these selfreported metrics?• ...in comparison with an interface with only non-guidance features (i.e.Section Gists and Term Definitions)?
RQ3 -Do we observe any difference in paper comprehension when participants use Paper Plain? • ...in comparison with a standard PDF reader?
• What is participant behavior in the presence of incorrect system predictions (e.g., vague information or factual errors in generated gists) 6.1 Method 6.1.1Participants.We recruited participants from Upwork using the same recruiting materials as §3.0.1.We again recruited from both the "Editing & Proofreading" job category and "Customer Research" to attract a broad sample of workers with varied degrees of reading and writing experience and to remain consistent with §3.0.1.All participants were paid US$15 for the hour-long study.
A total of 24 Upworkers (9 male, 1 non-binary, and 14 female) participated in the study.Participants' age ranged from 19 to 67 ( = 35.04).All participants had completed college, and a third had completed professional or graduate school.79% of participants (19) had taken 3 or fewer STEM course since high school and 92% ( 22) had never been involved in publishing a research paper.Similar to §3, no participants had professional medical experience.
6.1.2Procedure.The usability study consisted of two parts, each corresponding to a scenario involving a patient with a particular diagnosis-systemic lupus erythematosis (SLE) or a herniated disc-and who was interested in exploring new treatments.The scenarios for each paper were drawn from §3.0.2.For each scenario, we selected a single paper ( [98] for SLE and [8] for a herniated disc) for participants to read based on the most common papers readers selected in §3.
Each participant underwent the following study procedure once for each scenario.First, participants read a description of the scenario, the MedlinePlus page about their diagnosis and the associated research paper.Then, they answered questions about the paper.
Participants read the scenario description and had 2 minutes to read the MedlinePlus page on the diagnosis.They went through a short tutorial on Paper Plain then read the paper for 10 minutes.They were told at 5 minutes and 9 minutes how much time they had remaining.After each paper, participants filled out subjective ratings and multiple choice questions about the paper (covered in §6.1.3).After the two scenarios, participants completed a questionnaire on their demographics, education, and research experience.Following the questionnaire, participants completed a short form on their experience using Paper Plain and what features they found most helpful.A researcher was present for the entire experiment and followed up on these answers with additional probing questions in a final interview.
6.1.3Measures.We collected measures for assessing feature usage (RQ1), subjective reading experience (RQ2) and comprehension (RQ3): Feature usage.To measure how participants used Paper Plain's features (RQ1) we collected telemetry data on interactions with Paper Plain's features (e.g., opening a definition tooltip or clicking on a key question).We report feature usage over the 10 minutes of reading each paper.We determine significant patterns of usage if the majority of participants exhibited this pattern, as observed by researchers present in the experiment and corroborated by the rest of the authors when examining usage data.
Subjective reading experience.We collected subjective ratings to understand how Paper Plain affected participants' reading experience.Participants completed the ratings after reading each paper.These included: (1) Reading difficulty: Participants rated their reading difficulty on a 1-5 Likert-style scale based on the question: "How hard did you have to work to read the paper?" (2) Understanding: Participants rated their understanding of the paper on a 1-5 Likert-style scale based on the question: "How much do you feel like you understood the paper?" (3) Relevance: Participants rated their confidence they got any relevant information from the paper on a 1-5 Likertstyle scale based on the question: "How confident are you that you got all the relevant information from the paper?"Comprehension.While Paper Plain's primary goal is to support readers in navigating and identifying relevant information in a paper (captured by our subjective reading experience measures), it is also important to ensure that Paper Plain's affordances do not detract from overall paper comprehension (e.g., by over-simplifying or incorrectly summarizing paper content).We developed multiple choice questions to measure the degree to which participants understood the paper (RQ3).Our goal when developing the multiple choice questions were to ensure that they: • were specific to the individual papers, • were relevant in a clinical context, and • could not be answered directly from the Key Question Index and Answer Gists in Paper Plain.
We achieved these goals by writing 15-20 questions for each paper and having two practicing physicians not involved in the study provide feedback on the questions.The clinicians read the papers without Paper Plain, gave feedback on all questions, and selected 5-7 they thought were most meaningful for overall paper understanding and were important in a clinical context.We revised wording on any questions or answers that were unclear or easy to misunderstand according to the clinicians and two additional pilot studies.At the end, we selected 14 multiple choice questions, 7 for each paper.All questions could be answered from text not highlighted by the Key Question Index and Answer Gists.
Paper comprehension was measured as the proportion of questions answered correctly.Participants answered these questions after completing the subjective ratings for a paper.
While it is important to ensure that any augmentation does not negatively impact paper comprehension, it is worth noting that prior work on augmented reading interfaces have not shown significant differences in the number of comprehension questions answered correctly across conditions [7,44].While observing an improvement in comprehension due to Paper Plain would be an exciting addition to our primary objective of improving reading experience, our goal with this measure was ensuring that Paper Plain's features did not lead to a loss in comprehension.
6.1.4Interface variants.To understand the impact of Paper Plain's novel guidance-offering features on readers' experience engaging with medical research papers, we evaluated variants of Paper Plain with and without these features.There were three versions of Paper Plain and one baseline: (1) Paper Plain -The full interface with the Key Question Index and Answer Gists, Section Gists, and Term Definitions.
(2) Questions and Answers -The guidance-focused variant with only the Key Question Index and Answer Gists.
(3) Sections and Terms -The variant without guidance, providing readers with the Section Gists and Term Definitions.
Assignment.We assigned each participant to two of the possible eight variant-paper configurations.Each participant saw each paper once, and all eight configurations had the same number of assigned participants.All assignments were counter-balanced so that each configuration was experienced as the first or second task the same number of times.
6.1.5Analysis.We compared readers' subjective ratings for reading difficulty, understanding, and relevance across the system variants (Paper Plain, Questions and Answers, Sections and Terms, PDF baseline) using mixed-effects linear models [65] with paper type and system variant as fixed effects and participant as a random effect.Using a mixed-effects model for each measurement, we first conducted  -tests for any significant difference across the system variants, and then we conducted -tests for differences in the estimated fixed-effects between all pairs of system variants.More details are in Appendix D.
We conducted a non-inferiority test [103] to assess Paper Plain's impact on comprehension.Our goal was to confirm that Paper Plain did not detract from paper comprehension.Prior work has suggested that plain language can overly-simplify scientific findings, risking reader misunderstandings [91,97].The multiple choice questions were designed to assess general paper understanding, meaning that all questions were answerable from information in the paper that appeared multiple times (e.g., the paper stating its main findings in the abstract, introduction, and discussion).Because Paper Plain was designed to make it easier for readers to access the information in a paper, not to add significant additional information, we did not expect that Paper Plain would dramatically improve comprehension as measured by our questions.However, we wanted to ensure that Paper Plain did not detract from comprehension.
The non-inferiority test was conducted using the statsmodels package in Python [92] as the lower bound of a t-tost (two independent t-tests).
For qualitative findings, one author conducted a thematic analysis on the observations of the study sessions and discussed themes with the other authors to refine these themes.Themes were identified via open coding and discussed in 3 weekly meetings with all authors.One author coded all interviews, while another author verified the themes in one of the interviews.

RESULTS
Below we report our findings from the usability study broken down by research question.

How did participants use Paper Plain's features?
Most participants interacted with all the features of Paper Plain available to them.All participants with access only to the Key Question Index and Answer Gists (Questions and Answers) clicked on at least one Key Question and opened an Answer Gist.Usually they clicked on many more: on average participants with this variant clicked on 15 Key Questions and Answer Gists.11 out of 12 participants with the Section Gists and Term Definitions (Sections and Terms) clicked on a Section Gist and a Term Definition.On average, participants with this variant clicked on 18 Section Gists and 5 Term Definitions.
When participants had access to all the features they often opted for the Key Question Index and Answer Gists.11 out of 12 participants with access to all of Paper Plain clicked on a Key Question and opened an Answer Gist, doing so on average 13 times for Key Questions and 14 for Answer Gists.In contrast, only 8 participants with Paper Plain clicked on a Section Gist or Term Definition.Participants that did engage with these latter features also used them much less, clicking on average only 7 Section Gists and 4 Term Definitions.Figure 7 plots the usage of each feature for Paper Plain and illustrates this preference for the Key Question Index and Answer Gists when all features were present.
Participants used Paper Plain's features throughout reading a paper.Figure 6 plots the number of participants using each feature over the course of reading a paper.There is a slight 'warm-up' period for each feature-usually in the Fig. 8. Participant's reading behavior differed when reading one paper with the Key Question Index and Answer Gists and another paper without.Each plot's y-axis calculates the vertical position of the participant's viewport relative to total paper length (e.g., the bottom and top of each graph are the beginning and end of the paper, respectively).We observe much more jumping behavior when the Key Question Index and Answer Gists are present.

Less Understanding Harder Easier
More Understanding Less Relevance More Relevance Fig. 9. Readers' subjective reading difficulty, confidence in their understanding of the paper and ability to get all relevant information from the paper for different variants of Paper Plain.
7.2 How does Paper Plain affect participants' self-reported reading difficulty, understanding, and ability to identify relevant information?
Figure 9 plots the reading difficulty, understanding, and relevance scores for both papers across each system variant, and we observe significant differences between them.This is also reflected in our mixed-effects model  -test ( < 0.001 for all three measurements after Holm-Bonferroni [46] correction).We report estimated fixed-effect coefficients in Appendix D and instead discuss more interpretable results comparing system variants in this section.We report here on medians (denoted x) for each subjective rating because ratings were scored on Likert-style scales.
Table 2 presents the differences in the fixed-effects between all pairs of interface variants.Participants with Paper Plain were significantly more confident that they got all relevant information from the papers ( x = 4.00,  = 0.87, with 5.00 being the most confident) and understood the papers ( x = 3.50,  = 0.69), compared to the PDF reader baseline ( x = 2.50,  = 1.00 and x = 2.00,  = 1.00).Participants with Paper Plain also rated their reading difficulty significantly lower ( x = 2.00,  = 1.06, with 5.00 being hardest) compared to participants who had the PDF reader baseline ( x = 4.00, Building on our qualitative findings in RQ1, we saw that participants' use of Paper Plain's features made them more confident in their ability to find information important to them in the papers.This support manifested differently based on the Paper Plain features available to a participant.There were two major ways we saw Paper Plain improving participants' reading experience: providing in-situ support with the Section Gists and Term Definitions and a high-level overview with the Key Question Index and Answer Gists. The in-situ nature of the Section Gists and Term Definitions helped participants understand the paper without switching contexts (P2, 6,7,11,[16][17]19).For example, P19 found the Term Definitions useful for understanding the paper and the more specific medications it mentioned.P2 reported that the Section Gists were helpful to understand the paper text in a language they understood and P17 found the Section Gists broke "down complicated medical text into layman's terms that are easily understandable and helped to keep up with the flow of the article." While participants could search for definitions of terms and potentially make sense of a passage with a search engine, both activities require turning away from the paper itself.This context switching can make it difficult to keep a thread of reading, especially when that reading is demanding.Our observations suggest that Paper Plain's in-situ support successfully provided information to participants with minimal context switching.
Participants also used the Key Question Index and Answer Gists to get an overview of a paper quickly and easily, boosting their confidence to then dive in to the paper text (P2-3, [9][10][11]20).P9 reported that "with so many sample sizes, numbers, and information to go through, it was helpful to get a summary to direct my reading and understanding."P20 mirrored this sentiment, explaining that the simplified answers gave them the gist of the entire paper quickly, so they had more time to get into its details.P3 illustrated these benefits well, explaining that the Key Question Index and Answer Gists were "beneficial because. . .I could have a baseline of what to expect and my mind would not have to pull in many random parts of information and could easily block what I did not need when I only needed a couple bits while I was reading." Similar to how the Key Question Index and Answer Gists supported a non-linear reading strategy (described in §7.1), it seemed that the Key Question Index and Answer Gists allowed participants to get a general sense of a paper early and focus their reading to sections they found most important.
The Key Question Index and Answer Gists provided useful guidance for readers in this reading context.As shown in Table 2, readers that only had the Key Question Index and Answer Gists rated their reading difficulty significantly lower ( x = 3.00,  = 0.97) than participants with the baseline PDF reader ( x = 4.00,  = 1.04).Participants with the Key Question Index and Answer Gists also rated their confidence that they got all relevant information in a paper ( x = 4.00,  = 0.94) and that they understood the paper ( x = 4.00,  = 0.89) significantly higher compared to the PDF baseline ( x = 2.50,  = 1.00 x = 2.00,  = 1.00).
The preference for the Key Question Index and Answer Gists illustrates the importance of the novel guidance technique in Paper Plain.18 out of 20 readers who had the Key Question Index and Answer Gists in at least one condition selected the Key Question index, not the Answer Gists, as the most helpful feature.P18, who selected the Key Question index as the most helpful feature, said they would absolutely use the questions, because "...medical papers are difficult to follow and understand without guidance." Participants reported liking the Key Questions for quickly finding and understanding relevant information (P2, 4, 7-10, 13, 18-20).P4 reported not having any idea how to approach the research papers, and the Key Questions helped guide them to questions they should have.P7 used the Key Questions because "It answered questions that I would have had if it was me in the scenario . . . it helped highlight directly to the passage instead of having to sift through all of the information." These findings support the insight of this paper that novel guidance-offering features are important for supporting readers in approaching medical research papers. < 0.05).Figure 10 plots the comprehension scores for each system variant.The scores suggest that Paper Plain's primary objective of improve reading experience was achieved while not hindering comprehension.We designed the comprehension questions so that their answers could be found in the original paper text, not in the gists.For this reason, it is not surprising that participants scored similarly on the comprehension questions across variants.While comprehension was not the primary objective of Paper Plain, improving readers' understanding of medical papers is also important for ensuring they have productive conversations with their healthcare providers and make informed decisions about their health.We discuss future interventions focused on improving paper comprehension in §8.5.
Participants generally found the generated gists useful, and when confronted with vague system predictions and generations, participants usually, though not always, used the original text to fill in missing information.We observed one participant, P11, who read only the Answer Gists for a paper and rated their confidence for understanding the paper at a 5 (the highest) while rating their reading difficulty at a 1 (the easiest).However, this participant got only 2 out of 7 comprehension questions correct, well below the average of 3.73 for all participants, suggesting that the gists were not sufficient for answering many of the comprehension questions.In contrast to this participant, other participants reported that the gists (both Answer and Section) were helpful as a starting point for understanding, but looked at the underlying text, too.Some participants also reported that information in the gists was vague or missed information in the original text, necessitating reading the original (P10, 22,24).P24 made sure to double check all the information in the gists with the original sections because the gists were automatically generated.While they did not find incorrect information in the gists, they did report that the Sections Gists sometimes were vague or reported on details less important to them while leaving out details that were more important to them (e.g., the percent of people who recovered from a surgery was reported in a section but not the Section Gist).P10 also noticed that the area surrounding some of the highlighted answers contained useful information, and so made sure to go back through the answering passages to read the surrounding text in addition to the Answer Gist and passage.While it seems that most participants found the gists useful and read the original text alongside the generated gists, we discuss future designs to encourage reading original paper content in §8.2.

DISCUSSION & FUTURE WORK
This paper illustrates how interactive information interfaces can make research papers approachable to healthcare consumers that need it.In particular, we develop Paper Plain, an interactive system that augments the paper itself with new affordances to help readers navigate, evaluate, and understand its contents.

Summary of the results
How did participants use Paper Plain's features?Participants used and appreciated Paper Plain's features throughout reading a paper.Readers used the Section Gists to easily make sense of dense passages while reading a paper and leveraged the guidance of the Key Question Index and Answer Gists to quickly find text that was informative for them.
All but one participant said they would use Paper Plain to read medical papers.
The Key Question Index and Answer Gists were a clear favorite in the usability study.When participants had access to all features, they used the Key Question Index and Answer Gists more often than the Section Gists and Term Definitions.Participants used the Key Questions to jump to sections informative to them compared to participants with the typical PDF reader or the Section Gists and Term Definitions.These results suggest that readers took advantage of the questions' affordance of fast-tracking to the important information in a paper.
How does Paper Plain affect participants' self-reported reading difficulty, understanding, and ability to identify relevant information?Participants who used Paper Plain rated their reading difficulty significantly lower and rated their confidence they got all relevant information from a paper significantly higher.Participants found it easier to read with Paper Plain because it gave them an approachable overview of a paper with the Key Question Index and Answer Gists and helped them understand dense text in the context of the paper with the in-situ Term Definitions and Section Gists.
It is worth noting that non-experts can be overly confident in their understanding of scientific material [91] and therefore our subjective ratings of understanding should be judged with caution.That being said, the strong results for reading difficulty suggest that Paper Plain was able to support readers in overcoming some of the barriers we observed for reading medical research papers.
Do we observe any difference in paper comprehension when participants use Paper Plain? Participants who used Paper Plain had similar comprehension scores compared to participants using the typical PDF reader.While improving comprehension would have been an exciting addition to our findings that Paper Plain improved reading experience, the similar scores in comprehension provide compelling evidence that Paper Plain achieves its primary focus of lowering barriers to paper reading without any loss in paper comprehension.
In summary, we take these results to indicate the promise of Paper Plain for assisting healthcare consumers in making sense of medical research papers.The sustained usage of Paper Plain's features and positive response from participants in our usability study suggest that such a tool would be a welcome addition to healthcare consumers' information seeking toolkit.

Design implications
This paper's exploration of features can inform future interactive reading systems.We offer the following guidance for developing such future systems: Provide reading guidance Interactive reading systems for non-experts can provide more active support for guiding readers.Experts have strategies to quickly gather relevant information in a paper without engaging in a deep read (e.g., skimming) [93], but most readers in our studies didn't have a particular strategy for reading the papers, defaulting to an exhaustive linear pattern.This led to readers getting stuck in dense passages with minimal relevance to their scenario.
The features in Paper Plain offering guidance (i.e., the Key Question Index and Answer Gists) led to the largest improvements in reading experience by providing an alternative reading strategy.Readers with the Key Questions in our usability study jumped to important sections of the paper within the first few minutes of reading.While we did observe some tension in our iterative design where the Key Questions distracted readers who wanted to approach papers on their own, it seems that our final design for the feature (a toggleable sidebar) offered useful guidance without distracting readers.Indeed, participant feedback supports this: 18 out of 20 participants selected the Key Questions themselves, not the Answer Gists, as the most helpful feature.
Incorporate plain language into the original document While guidance was helpful for directing readers' effort to sections of interest, readers still needed plain language to lower the effort needed to understand those sections.Every participant in our usability study with access to plain language features, either Answer or Section Gists, used them to make sense of the papers.
At the same time, any plain language should be in service of making the original document easier to read, not replacing it.Plain language from current generative models can contain inconsistencies [71], which risks misinforming readers.Paper Plain encourages readers to focus on the original paper text by having readers pull gist content rather than displaying it immediately with the paper and by placing gist content alongside the paper.Future systems could go further by reporting factuality measures along with generated text [83] or integrating a feedback mechanism for reporting inconsistent information in order to crowd-source these factuality checks.

Ethical and Social Implications
While we believe that lowering barriers to reading medical research papers can benefit healthcare consumers by informing them about their care, there are certain risks as well.One issue is that healthcare consumers can be unaccustomed to norms in the scientific process, such as the fact that a single paper does not represent scientific consensus.Readers might mistake findings or interpretations in a paper as truth, which could risk them making misinformed decisions about their care.At the same time, readers are already taking these risks and turning to medical research papers [30], and Paper Plain can help them understand a paper more easily than if they were on their own.
A key limitation of current generative models is their propensity to hallucinate, generating factually inconsistent or incorrect information [71].These hallucinations could misinform readers, which would be extremely costly in the context of personal health information seeking.There is growing interest in evaluating factuality in generations (for common factuality measures, see [37,39,83]).We are excited to integrate new advances for measuring and ensuring factuality in generations (e.g., [33,60,77]) into Paper Plain to help realize the promise of such models in real-world settings.

Limitations
Recruiting participants on Upwork might have have skewed our barriers and resulting design since participants were not reading medical papers that were personally relevant to them.This could have led participants to pay less attention to specific paper details or get less discouraged by negative findings and unclear results.We designed the studies to be as close as possible to the real work healthcare consumers reading medical research papers engage in by writing scenarios based on our findings from interviews with such readers.
The usability study was also a timed, relatively short (10 minute) reading task, which might have skewed participant reading habits.It could be that differences in comprehension would become more pronounced if more time was given.
Additionally, patterns of usage of the interface may start to look different after the first 10 minutes of reading.Some participants reported that if they had more time, they would have read the paper through again or looked for additional information.Others felt that the time limit made them anxious and they had a hard time remembering information.This might have also artificially inflated participants' use of the Key Question Index and Answer Gists, since they offered the fastest way of getting an overview of the paper.Participants also might have used the Section Gists more if given more time since the Section Gists were helpful at allowing participants to go off on their own to explore sections of the paper.Because reading time could influence comprehension and subjective reading confidence, we decided it was important to keep time consistent across the study.In future work we are excited to observe the use of Paper Plain in more naturalistic settings.

Future directions
Our studies and system, Paper Plain, reveal exciting areas of future research in information interfaces for medical information-seeking and augmented reading interfaces broadly.We discuss a few of these directions below.
Enabling intelligent reading interfaces As AI technology advances, new interfaces integrating this technology can provide tremendous value to users.Our work illustrates a path towards one such system with Paper Plain.When developing Paper Plain, we mapped its features to existing natural language processing (NLP) techniques like biomedical question answering (QA) [108] and plain language summarization [18].There are additional NLP techniques that could augment reading experiences, such as machine translation [54], toxic language detection [47,68] or news story mapping [58].We hope that our discussion in §5 of techniques to make machine output useful for readers (e.g., by providing full paragraphs rather than a single word in QA output) can provide useful insight for future reading interfaces integrating machine intelligence.
Improving paper comprehension While our results show that Paper Plain improved reading difficulty and confidence by addressing the barriers revealed in §3 without any loss in comprehension, an important future step is to design interventions for explicitly improving paper comprehension.Simplifying scientific information can risk over-inflating readers' sense of understanding and reduce their reliance on experts, even when such judgements are misplaced [91].
One possible way of improving comprehension would be to focus on protecting against common misunderstandings for healthcare consumers reading medical literature [30], such as by identifying predatory journals without peer review or by providing findings summarized from multiple papers.
Supporting healthcare providers and patient advocates Our work focused on making medical research papers more understandable to healthcare consumers, specifically patients and caregivers, but Paper Plain could also benefit other stakeholders in medical research.For example, healthcare providers and patient advocates often need to read medical research papers to apply the findings in clinical practice [19,43,86].The information needs and barriers of these groups differ from healthcare consumers, necessitating extensions of Paper Plain.Providers have to handle many more research papers covering many patients.For these needs Paper Plain could extend to cover multiple papers at a time, such as by extracting answers for key questions across the papers.At the same time, providers can draw upon their medical experience for understanding papers, so Paper Plain could focus on extraction and summarization over plain language.
Addressing additional barriers for healthcare consumers Extensions of Paper Plain could relate information in papers to readers' personal circumstances.During our formative study observing readers ( §3), some participants expressed interest in knowing patient testimonials about treatments in the paper or wanting to know how individual patients most similar to the reader responded to treatments.While the current design of Paper Plain did not address this need since it required information not available in the paper (e.g., patient testimonials), it is exciting to imagine future systems that draw on other information sources, such as online support groups, to relate the information in medical papers to reader personal experience.
Supporting non-experts in other domains Medical research is not the only context where non-expert readers wrestle with highly technical documents; Paper Plain's design can inspire efforts in addressing similar barriers in these other contexts.Some aspects of these contexts merit new design efforts, while others might benefit from similar designs as Paper Plain.For example, the important questions to ask while reading a medical paper are different than those for a legal contract or privacy statement.This necessitates a re-crafting of the Key Question Index and Answer Gists for other domains.Furthermore, some documents may need to be read in a particular order (e.g., a software tutorial), and providing an alternative index, as Paper Plain does, could confuse readers.In these cases, any key question indexes into the document would need to be aligned with the document's original structure.Providing in-situ Section Gists and Term Definitions could help address needs in these domains, as these features can help readers understand what they are reading within the flow of a document.

CONCLUSION
In this paper we ask how interactive interfaces can make medical research papers approachable to healthcare consumers that need it.Our key insight is that medical papers can be made more approachable by incorporating plain language summaries alongside original paper content and providing guidance on the most important passages to read.We illustrate these interactions with a novel system, Paper Plain, which draws on active research in natural language processing to show the potential for automated support in this reading context.In a usability study of Paper Plain, we found that participants who use Paper Plain have an easier time reading research papers compared to those who use a typical PDF reader, and that participants most appreciated the reading guidance offered by Paper Plain via a key question index.While further algorithmic advances are required to ensure a safe deployment, we envision Paper Plain as a system that can one day be enabled for any medical research paper.

3. 0
.3 Findings.Our study illustrated barriers readers face when reading medical research papers.The barriers were: unfamiliar terminology; dense text; knowing what to read; searching for answers; and relating findings to personal circumstances.
. .knowing what I know now I would probably skip the results section." This quote highlights that non-expert readers lack the knowledge of what they should-and shouldn't-read in a paper, leading them to take much longer learning what a paper has to offer.Other participants had similar experiences as T5, though did not quickly determine what the best passages were for them to read (T2-3, 6-8).

Fig. 2 .
Fig. 2. Term Definitions on an example passage with technical jargon.Terms with definitions are underlined ("armamentarium", "immunomodulatory").Clicking a term will open a tooltip with a definition and a reference to the definitional resource.

4.1. 2
Term Definitions help Sarah resolve technical jargon without distracting from reading.Paper Plain provides definitions for unfamiliar terms in the context of the paper so Sarah can seamlessly integrate new concepts into their reading.Continuing to read the introduction, Sarah reaches another passage full of technical jargon (Figure 2).In the first bullet, Sarah is unsure what "therapeutic ..................... armamentarium for SLE" means, preventing them from understanding what has been "poorly impacted."Rather than open a new tab to search, they click on the underlined term and a tooltip appears with a short definition retrieved from Wiktionary 4 explaining that "armamentarium" refers to medical equipment.In the next bullet, Sarah sees a list of promising properties of "therapeutic peptides in SLE, " but is unsure what "overall ......................... immunomodulatory effect" means.The definition tooltip again helps Sarah understand that therapeutic peptides can help control immune functions.Sarah continues reading, using the tooltips to resolve unfamiliar terms, eschewing the need to constantly switch tabs.

Fig. 3 .
Fig. 3. Section Gists on an example passage with dense text.(1) Clicking on a tab indicator next to a section title displays a plain language summary of the section.(2) Tabs are positioned throughout the paper, providing summaries that can cover a lot of paper content, even across pages.

4. 1
.3 The Section Gists help Sarah decide whether to invest in reading dense passages.Equipped with Term Definitions, Sarah manages to learn from the introduction that peptides are indeed possible treatments for SLE and wants to learn more.This particular paper reviews 15 different peptides, each with a dedicated section averaging one page in length;

Fig. 5 .
Fig. 5. Paper Plain uses machine learning models to add term definitions, section gists, and answer gists to the PDF.

Fig. 6 .
Fig. 6.Number of readers who used each feature of Paper Plain for each minute of reading a paper.Plot is for participants with access to all features of Paper Plain.

Fig. 7 .
Fig. 7. Number of readers who used each feature across the different variants of Paper Plain.Points represent individual readers.For example, all but one reader who had access to all features used Term Definitions more than 5 times, shown by the single blue dot above the rest in the far right of the plot, at the 'Term Definitions' tick.Notice the drop in usage of Section Gists and Term Definitions when all features are available (the blue boxplots).

Fig. 10 .
Fig. 10.Comprehension scores for variants of Paper Plain.Score is number of comprehension questions answered correctly out of seven for each paper.

7. 3
Do we observe any difference in paper comprehension when participants use Paper Plain? Participants on average answered 3.73 ( = 1.51) out of 7 comprehension questions correct.This is well above a random chance baseline of 1.75 (25% of 7-each question had four equally likely answers), suggesting that the comprehension questions were answerable given adequate paper reading.Participants scored no worse on the comprehension questions with Paper Plain ( = 3.67,  = 1.78) compared to the PDF reader ( = 3.50,  = 1.31) (Non inferiority t-test  28 = 1.82, Table 1 lists these barriers.Below we illustrate how these barriers manifested for non-experts reading Table1.Five barriers readers encountered when they sought an understanding of medical research papers without having prior medical research experience.All barriers were caused, or exacerbated by, their lack of expertise in medical research.

Table 2 .
[46]-hoc (two-sided) tests for pairwise differences in fixed-effects estimates between interfaces.This table reports the difference in fixed-effects estimates  −  and Holm-Bonferroni-corrected -values[46]under our mixed-effects model, where  and  correspond to interface options - = Paper Plain,  = Key Question Index and Answer Gists,  = Both Section Gists and Term Definitions, and  = PDF baseline.For example, in the column for  −  and row for "Reading Difficulty, " we can interpret the result as Paper Plain is associated with, on average, 1.983 points lower rating of reading difficulty than a PDF baseline when controlling for participant and paper.Statistically significant -values are bold.More details about this analysis are in Appendix D.