Digital Comprehensibility Assessment of Simplified Texts among Persons with Intellectual Disabilities

Text simplification refers to the process of increasing the comprehensibility of texts. Automatic text simplification models are most commonly evaluated by experts or crowdworkers instead of the primary target groups of simplified texts, such as persons with intellectual disabilities. We conducted an evaluation study of text comprehensibility including participants with and without intellectual disabilities reading unsimplified, automatically and manually simplified German texts on a tablet computer. We explored four different approaches to measuring comprehensibility: multiple-choice comprehension questions, perceived difficulty ratings, response time, and reading speed. The results revealed significant variations in these measurements, depending on the reader group and whether the text had undergone automatic or manual simplification. For the target group of persons with intellectual disabilities, comprehension questions emerged as the most reliable measure, while analyzing reading speed provided valuable insights into participants’ reading behavior.


INTRODUCTION
Text simplification refers to the process of improving the comprehensibility of texts by reducing complexity at several linguistic levels, for instance, by using simpler vocabulary and syntactic structures, reorganizing text structure, and explaining difficult words and concepts.Primary target groups of simplified language 1 include persons with intellectual disabilities, persons with dementia, prelingually deaf persons, and non-native readers [36].In recent years, both the demand for simplified texts and the amount of available data have been growing.Therefore, the development of quantitative human evaluation methods which include and represent the primary target groups becomes increasingly important.This is all the more true with increasing numbers of automatic text simplification (ATS) models being developed [2].However, current research in ATS mostly resorts to evaluations based on opinions of experts (e.g., simplified language professionals) or crowdworkers who are not part of the primary target groups.
Meanwhile, the use of information and communication technology, and mobile touchscreen devices in particular, is becoming an integral part of the daily lives of persons in these target groups [40,44].This offers the potential of conducting human evaluations with target groups of simplified language in digital form.Apart from the efficiency gain in data collection and analysis compared to paper-and-pencil methods, digital assessment methods also allow participants to read texts in a more natural environment (possibly at home, on their own device), and make it possible to record detailed user interactions, enabling measurements such as reading speed or scrolling interactions as proxies for reading comprehension.However, there is currently little research on the most suitable methods for measuring text comprehensibility among the different target groups as well as on the effects of text simplification on these measurements.This issue also fundamentally concerns humancomputer interaction, since persons with intellectual disabilities differ not only in their reading skills but also in their requirements for accessible user interfaces [10].
The aim of the present study is to explore different ways of utilizing digital tools for measuring comprehensibility.We determine the comprehensibility (sometimes also referred to as readability) of a text by measuring its comprehension on the part of members of a specific group of readers, while taking into account the fact that comprehensibility may differ between these groups.To discuss the suitability of these methods for evaluating ATS, we will also investigate the effect of the automatic simplification process on these measurements.More specifically, the study is guided by the following three research questions: (1) Which methods for measuring comprehensibility can distinguish between simplified and non-simplified texts?(2) What is the effect of manual and automatic text simplification on these measurements?(3) How do these effects differ between persons with intellectual disabilities (as a primary target group of simplified language) and a control group of persons without intellectual disabilities?
To answer these questions, we present results from an empirical study including participants with and without intellectual disabilities, using unsimplified, manually simplified (i.e., simplified by human experts), and automatically simplified German texts.To the best of our knowledge, this is the first study evaluating ATS for German with this target group.

RELATED WORK 2.1 Human evaluation of automatic text simplification
While human evaluation is the preferred way of evaluating the quality of ATS output, there is no consensus on best practices [4,5,55,60].In recent ATS research where human evaluation was used, the most commonly applied methods were Likert scale ratings, usually for the categories simplicity, fluency/grammaticality, and adequacy/meaning preservation [38,39,49,54].Less commonly, text comprehensibility or difficulty is evaluated using multiplechoice comprehension questions [3,32,33] or free recall questions [33].Reading behavior, e.g., by measuring reading speed [3,15,46,50], scrolling interactions [23], or eye movements [46], is rarely considered.
In most cases, the participants of such comprehensibility studies are persons without disabilities or crowdworkers without specific inclusion criteria, who are not part of the primary target group of simplified language.This can be problematic, because what is considered difficult varies between reader groups [24,60], and the requirements for text simplification should not be considered universal [22].Some exceptions of studies assessing ATS output among the target groups include experiments with deaf and hard-of-hearing adults [3], persons with intellectual disabilities [25,50] or dyslexia [46], and language learners [15].Among these, Saggion et al. [50] is the most similar to our study, as they evaluated both manually and automatically simplified texts with persons with Down syndrome based on comprehension questions and reading time, in addition to an expert evaluation using Likert scale ratings.Their quantitative results did not show significant differences in comprehensibility between the different text versions, but they reported positive subjective perception of the simplified texts among target readers.Our study differs from this contribution in that it is fully digital, also making use of recorded user interactions, and we conduct the same comprehension assessment with persons with and without intellectual disabilities, which allows us to compare its effectiveness between the two groups.

Comprehension of simplified language by persons with intellectual disabilities
Fajardo et al. [19] conducted a study with 28 students with intellectual disability reading news articles in easy-to-read Spanish on paper, and correlated response accuracy in literal and inferential comprehension questions with linguistic measures such as word and sentence length.In a pilot study by Saletta and Winberg [51], 20 participants with intellectual or developmental disabilities read English texts that had undergone (among others) controlled manipulations reducing lexical and syntactic complexity.They measured errors while reading aloud and comprehension question response accuracy and found a significant effect on the former but not on the latter.They also found a high variability in reading comprehension among participants.For German, several studies have investigated the effect of specific features of simplified language on comprehension by persons with intellectual disabilities.Schiffl [53] conducted an experiment using eye-tracking with more than 80 participants, investigating the effects of word length and frequency.They found fundamental differences in eye movements while reading between persons with and without intellectual disabilities.Pappert and Bock [43] studied compound segmentation (a feature in several varieties of simplified German) using a lexical decision task with participants with intellectual disability or functional illiteracy.Bock and Lange [9] tested sentence and text comprehension skills of 28 persons with intellectual disabilities and showed that certain phenomena that are assumed to be too difficult for this target group (such as negation and personal pronouns) hardly caused any problems for the participants.
More generally, reading comprehension by target groups of simplified language has been studied by Jones et al. [27], using several (adapted) standardized tests with participants with mild and borderline learning disabilities.This study revealed that the participant group was highly heterogeneous with respect to reading comprehension abilities.

Digital assessment of reading comprehension
As mentioned in Section 1, digital assessment has the advantage of enabling measurements of reading behavior even without expensive equipment and expertise necessary for eye-tracking experiments.Some previous work has studied the connection between behavioral measurements such as reading speed or scrolling behavior and comprehension [17,18,56,61], but there is only little research on exploiting these measurements for assessing comprehension or comprehensibility [23].Our work contributes to this line of research by studying reading speed and response time as proxies for comprehension.While there is a relatively large body of literature both in humancomputer interaction and in language assessment dealing with differences in comprehension and behavior when reading on digital devices compared to paper [1,13,29,30,57], almost no research has been conducted on how digital reading assessments need to be adapted for persons with intellectual disabilities.This is a significant research gap, given that these user groups have very different needs in terms of interface accessibility [10].By comparing different assessment approaches between readers with and without intellectual disabilities, the present paper represents a first step towards addressing this research gap.

Texts and comprehension questions
The texts used in this study originate from a parallel corpus of original and simplified German documents.The documents were created at capito, a provider of commercial text simplification services for German.Each document in the corpus was manually simplified by trained experts into one to three levels of simplification following the levels of the Common European Framework of Reference for Languages (A1, A2, and B2) [14].All manual simplifications used in this study are at level A2.This means that most of the information from the original text is retained (i.e., there is little to no summarization involved, as would be expected on a level of A1), but simpler syntactic structures and vocabulary are used, complex terminology is explained either inline or at the end of the text, and the layout is more readable, e.g., using bullet point lists and shorter line lengths.Level A2 is roughly comparable to Easy Language (in German: Leichte Sprache), for which persons with intellectual disabilities are commonly listed as a primary target group [8,11].
We used a subset of this parallel corpus to train a neural ATS system (fine-tuned mBART transformer model [35]) using the method described in Rios et al. [48].From the remaining documents, we selected twelve texts according to several criteria: (1) The texts should be between 100 and 600 words in length, (2) they should cover a diverse range of topics but exclude topics known to be familiar to a wide audience, and (3) the texts should not require extensive additional context for comprehension.For each of the twelve documents, we generated an automatic simplification at the A2 level using the trained model and created four multiplechoice comprehension questions.The first question was always "What is the text mainly about?" (four answer options, one correct), the remaining three questions were about specific details present in the text (three answer options, one correct).We created these questions such that they can be answered based on the original and the manually simplified text, without looking at the automatic simplifications in order to avoid an unfair bias in favor of the system output.Since the ATS model sometimes erroneously omits information present in the original, the latter three questions have an additional fourth answer option "Information does not appear in the text".Care was taken that the questions are unambiguous, independent of each other (i.e., being able to answer a question was not contingent on getting the correct answer to a previous question), and unanswerable using world knowledge alone.Each question was double-checked for these criteria by two co-authors.

Participants
We recruited two groups of participants from different populations, described in the following.All participants took part on a voluntary basis and were compensated monetarily.

Target group.
After approval by the institutional ethics review board, we recruited 18 participants from an educational program for persons with intellectual disabilities in Austria.Eight were female and ten were male, and they were aged between 18 and 32 (median: 23) at the time of recruitment.All participants had some form of cognitive impairment (most commonly: autism spectrum disorder, Down syndrome, or developmental delay), and a degree of disability of at least 50% according to regulations concerning the assessment of the degree of disability in Austria2 .Therefore, these participants represent a primary target group for simplified language.All participants were legally allowed to sign the consent forms themselves.
In a questionnaire, which all participants filled in before the first session, seven participants stated that they read texts in simplified language at least once per week, five at least once per month.A total of 17 stated that they used a touchscreen device on a daily basis, one person only weekly.Three participants did not list German as their native language, but all have completed compulsory education in German and are proficient at the CEFR level of A2 or higher.

Control group.
To compare the effects of text simplification on people outside the primary target groups, we additionally recruited 18 people without cognitive impairment-mostly current or former students-through university mailing lists.Twelve were female, six were male, and they were aged between 20 and 36 (median: 25).All were native German speakers.
Unlike in the target group, most participants in the control group were not used to reading simplified language (only 6 participants indicated reading simplified texts at least once per month).However, the information and consent forms which the participants received before the study were written in A2 simplified language to establish a basic level of familiarity.

Procedure
All experiments were conducted using the Okra app ( [57]; version 0.3.1-alpha) on Apple iPads (9.7-inch).Okra is an app for conducting reading experiments on mobile touchscreen devices, and it was specifically designed for and tested with users with intellectual disabilities.For instance, it reduces the complexity of the user interface and the amount of text on screen in order to decrease the cognitive load [57].
Each participant took part in three sessions on separate days.The target group sessions took place at the facilities of the educational program, the control group sessions in a university seminar room.Each session consisted of reading tasks, two sessions also included cognitive tasks.The app presented all instructions and guided participants through the entire session such that several participants could be tested simultaneously without interruptions.Each control group session included up to 12 participants, whereas for the target group, only up to 5 individuals participated per session.This was intended to provide better support in case of problems and to shorten waiting times, as reading speeds varied widely in the target group.One or two test administrators were present in the room and available for questions.
Before the main study, we conducted a usability test with 3 people from the same educational program to improve the usability and accessibility of the instructions and tasks implemented in the app.After finalizing the material and procedure, we piloted the entire experiment with 3 participants from the target group.Participants in the usability test and the pilot study were not recruited for the main study.

Cognitive tasks.
We included a total of four tasks testing several low-level cognitive skills related to reading.The purpose of these tasks was to provide a basic understanding of some of the differences between the two groups and the heterogeneity within each group.The tests we used were adaptations of tasks commonly used in psychological research (see references below).We adapted the tasks to the target group (by adjusting the difficulty and number of trials based on results from the usability test) and to the technical setup in the present study (by making the interface usable on a touchscreen).
• Digit span: Memorizing and repeating an increasingly long sequence of digits in the same order [62]; two trials, each of which ended after two consecutive mistakes; measurement: longest correctly repeated sequence.This task tests shortterm memory, sequencing ability, attention and automated learning [62].• Lexical decision: Deciding as quickly as possible whether the displayed strings of characters are words or pseudowords [41]; 37 stimuli; measurements: reaction time on correctly recognized words, ratio of correct responses.This task tests vocabulary knowledge and lexical access.• Reaction time: Tapping randomly appearing balloons as quickly as possible; 15 stimuli; measurement: mean time between stimulus appearance and tap.Apart from motor aspects, reaction time also depends on cognitive factors such as visual processing speed and attention [6].
• Trail making: Tapping randomly positioned numbers in ascending order as quickly as possible [45]; 3 trials; measurement: mean time between taps.We only included part A of the trail making task, which primarily assesses visual attention and psychomotor speed [7].Each task was preceded by a practice task, which participants could optionally repeat and whose results were excluded from the analysis.
3.3.2Reading tasks.Participants read four texts per session.The texts were presented in one of three versions (original, manually simplified, automatically simplified).No participant read the same text in more than one version.The design was counterbalanced so that all texts in all versions were read by the same number of participants in both groups.After reading the text, participants were asked to rate the text's difficulty on a 5-point scale (1 = very difficult, 5 = very easy), whereby the level descriptions were marked with textual labels, colors, and emoticons 3 .The text was then displayed again, along with the comprehension questions.Only one of four questions was shown at a time, and participants could switch back and forth between questions until they submitted their final answers.The screenshots in Figure 1 show this procedure for one text.After finishing the text, participants were asked to take a break if necessary, and then continue with the next text.
Apart from the responses, we also recorded timestamped user interactions such as reading times and scrolling interactions.In the present paper, we will focus on the following measurements: • Responses to comprehension questions [with our assessment: correct/incorrect] • Responses to text difficulty ratings [1][2][3][4][5] • Time taken to answer each question, i.e., the total time during which the question was visible to the participant [seconds] • Reading speed when initially reading the text [words per minute, WPM] Since the ATS model sometimes does not transfer all information accurately and we designed the comprehension questions without looking at the ATS output, the correct answers in the automatic simplification could be different from the other versions.For example, the ATS model at times deleted a sentence from the original which included relevant information for answering a question, changing the correct answer for this question with respect to the automatically simplified text to "Information does not appear in the text".Therefore, we manually recoded the answer correctness for the automatically simplified texts.We removed instances where the correct answer in the automatic simplification was "Information does not appear in the text" in order not to give the ATS model an unfair advantage in the analysis.In total, we removed 9 out of 48 questions from the results of the automatically simplified texts.

Statistical analysis
When analyzing responses to comprehension questions or ratings, we took into account that some participants may be more proficient than others, and some questions may be more difficult to answer (1) (3) (2) than others.To model these differences, we analyzed our data using the Rasch model, also called the one-parameter logistic model in item response theory (IRT).These models are widely used in language assessment and psychometrics, and formally comparable to generalized linear models with (fixed or random) effects for persons and items [20, p. 143-145][16].Whereas the classic Rasch model only considers persons and items in the analysis, the many-facets Rasch model allowed us to also take additional parameters (socalled facets) into account in the modeling of the data [34].As we were interested in the effects of the text version (original, manually simplified, automatically simplified) on participants' performance, we specified a many-facets Rasch model with three facets (persons, items, and text version).We used the estimated parameter values (the "latent traits") of the text version facet to compare the effect of manual and automatic text simplification.
We applied a dichotomous Rasch model for the comprehension questions [20, p. 7-9] (equivalent to logistic regression) and a graded response model for the difficulty rating [52].For modeling response time and reading speed, we used log-linear regression models as in [59] and [20, p. 228-231], fitting person, item, and text version parameters in the same way as for the Rasch models.All models are defined in Table 1.
We used Bayesian inference with a Markov chain Monte Carlo (MCMC) algorithm for fitting the models.This has several advantages compared to frequentist statistics: We get posterior distributions for parameter values, which provide more information than point estimates, it allows including prior knowledge, and Bayesian models are usually more accurate for complex IRT models and small sample sizes [20, p. 2][21] [63].We defined wide normal distributions as priors for person, question/document, and text version parameters (cf.Table 2).For each measurement, we fit two separate models for the target and control groups, since we did not want to generalize across the different populations they are sampled from.
We used Stan [12] with the PyStan interface [47] for sampling and ArviZ [31] for analysis.For MCMC, we used 4 chains with 2000 iterations, including 1000 warmup iterations.Model code and convergence diagnostics are published in the supplementary material.

RESULTS
Anonymized data and code for reproducing the analyses are available in the supplementary material.Five participants did not consent to publishing their anonymized raw data.Therefore, the data for these participants is not included in the supplementary material.The numbers and plots in the paper are based on the complete data.

Cognitive tasks
Figure 2 compares the measurements from the cognitive tasks between the target and control groups.The largest difference is in the digit span task for measuring working memory, with median scores of 4.5 for the target group and 7 for the control group.We also measured a longer reaction time in the lexical decision task, longer reaction times in general, and slower trail making in the target group.Moreover, variability in the target group is generally much higher than in the control group, which is likely to affect results in reading behavior and comprehension [26].

Reading tasks
In total, 1680 responses to comprehension questions (excluding the 108 responses to unanswerable questions in the automatically simplified versions, see Section 3.3.2) and 432 difficulty ratings are included in the analysis.
The estimated effects of the three text versions (original, manually simplified, automatically simplified) on the four measurements are visualized in Figure 3. Effects are centered around zero, and parameters for the two groups were estimated independently (as explained in Section 3.4), therefore the estimates cannot be compared across groups.We calculate the distribution of the difference between the three text version parameters at each MCMC sampling step and use highest density intervals (HDI) to quantify the credibility of the difference between the text version effects.target control Group Trail making Figure 2: Boxplots of the measurements from the cognitive tasks, compared between target and control group.Each data point is the measured values for a single participant aggregated across all trials/stimuli (maximum for digit span, mean for all others), excluding practice trials.

Comprehension questions.
Overall, the target group answered 47.5% of the questions correctly, whereas the control group answered 92.8% correctly.In the control group, 25 questions were answered correctly by all participants, and one participant answered all 48 questions correctly.In other words, in the control group, about half of the questions were uninformative because they were too easy and therefore unable to discriminate between more and less proficient readers and between more and less difficult text versions.This ceiling effect means that the parameter estimates of the Rasch model are less precise in the control group, as the wider credible intervals in Figure 3a show.Still, the estimated difficulty of the manual simplifications is measurably lower than both the original (CI 95% = [0.15,1.61]) and the automatic simplifications (CI 95% = [0.49,2.01]), meaning that participants had a significantly higher probability of answering questions correctly with the manually simplified version.The automatic simplifications appear to have been slightly more difficult than the originals (CI 80% = [0.01,0.86]).
In the target group, the effects are less pronounced, the original being the most difficult and the manual simplification the least difficult.
4.2.2Perceived difficulty ratings.In Figure 3b, again, the differences between text versions are much smaller in the target group compared to the control group.The target group seems to rate the automatically simplified texts slightly easier than the originals (CI 90% = [0.08,1.06]), whereas the control group rated the automatically simplified texts on par with the unsimplified ones.The control group had a strong tendency to rate the manually simplified texts as less difficult than the original (CI 95% = [1.22,2.58]) and the automatically simplified texts (CI 95% = [1.26,2.60]).

Comprehension question response times.
In response time models, a larger effect means a longer response time, which is generally associated with a higher item difficulty in tests [59].In the target group, from Figure 3c we can observe that manual simplifications led to slightly faster response times (CI 80% = [0.02,0.18]), while the automatic simplifications are on par with the originals.
In the control group, the differences are even stronger, and the automatic simplifications appear to have been the most difficult.The effects on response time (Figure 3c) look very similar to the effects on response accuracy (Figure 3a).This is in line with research on psychological research on test design [59], but in our case, the  A bracket with ▲ indicates that the 80% CI of the difference between the two parameters does not include zero (i.e., we are 80% confident that there is a difference).Similarly with ▲▲ for 90% CI and ▲▲▲ for 95% CI.
observations from the two groups of participants do not agree on the relative difficulty of the automatically simplified texts.

Reading speed.
In terms of reading times, the behavior of the target group was much more variable and less predictable, as becomes obvious from Figure 3d.Some participants had implausible reading speeds of up to thousands of words per minute, meaning that many only skimmed or even skipped reading the text the first time it was displayed.We found that a small number of target group participants had a stronger tendency towards skimming or skipping, but most of them did not do so consistently, and reading speeds were not distributed bimodally, such that there was no obvious threshold to discriminate between reading and skipping.The slowest reading speeds (50 WPM and slower) were also observed in the target group.Mean reading speeds were 203 WPM in the target group and 168 WPM in the control group.For comparison, a standardized assessment of reading speed reported a mean of 179 WPM for native German speakers [58].

DISCUSSION
The primary goal of this study was to investigate four different measurement methods, comparing them with regards to two different text simplification methods and two different reader groups, with the ultimate aim of improving methods for human evaluation of ATS.We will discuss these aspects mainly based on the results in Figure 3.The purpose of the cognitive tasks was to characterize the participant groups to support the interpretation of results.Therefore, we will not discuss them in further detail here.

Comparison of measurement methods and reader groups
By design, there are several fundamental differences between the four measurement methods: Comprehension questions measure objective comprehension, while difficulty ratings measure subjective perception.Measurements such as response time and reading speed can only serve as proxies for comprehension and require specific assumptions about the behavior of participants.For any of these measurements to be considered suitable for evaluating text simplification, they need to be able to capture a difference between less comprehensible and more comprehensible texts.Since the manually simplified texts were professionally edited by trained experts and according to guidelines developed and checked with target readers, it is safe to assume that there should be some measurable difference in comprehensibility between original and manually simplified texts.From this perspective, our results suggest that the measurement of comprehension question response accuracy was most successful, and perceived difficulty ratings were least successful with the target group.For the control group, all measurements except reading speed were successful in differentiating between original and manually simplified texts.There are two possible factors which may explain why ratings were less reliable than comprehension questions for the target group: First, the target group was quite heterogeneous (as evidenced by the cognitive tasks), which led to larger differences in subjective judgments of texts, especially because we did not give more specific instructions to calibrate ratings in order to reduce cognitive load.Second, when readers lose motivation and stop reading, which happened in the target group, rating responses may be more random, whereas responses to comprehension questions will reliably show a random-guessing accuracy.Both of these may be arguments against using perceived difficulty ratings with the target group.
Familiarity is a confounding factor, because participants in the target group were mostly very familiar with the specific variety of simplified language in the study, which may have biased their perception of the texts.However, if this bias was strong, we would expect a larger effect in the perceived difficulty ratings compared to the control group participants, who were mostly unfamiliar with simplified language.
While previous work heavily relied on ratings for evaluating the comprehensibility of simplified texts (see Section 2.1), our results show that this is not always sufficient, especially for readers with intellectual disabilities.The results also revealed significant differences in comprehensibility and perception between persons with and without intellectual disabilities, highlighting the importance of including the primary target groups in studies on text simplification and bridging the gap to insights from psycholinguistic research (see Section 2.2).
Although reading speed was mostly unsuccessful in discriminating between text versions, it revealed important behavioral patterns in the target group (skimming/skipping texts), which supports the interpretation of other results.Previous work has suggested that reading time is to be considered separately from comprehension [61].Our observations support this view, but our interpretation is limited by our study design: Since the text was shown again after the initial reading, participants were free to decide not to read the entire text the first time around.
Overall, the fact that reading behavior can be measured through a mobile application is a major advantage of using digital evaluation tools such as the one described in this paper compared to paperand-pencil assessment.Our work represents a first step towards exploiting this advantage to make comprehensibility assessment more inclusive (see Section 2.3).

Effect of automatic simplification
We have seen that manual simplification resulted in noticeable differences for most measures.By comparing the difficulty estimates in the automatic simplification to the original and manual simplification, we can evaluate the ATS output in terms of comprehensibility.
Based on the target group measurements, automatic simplification only had a modest effect.The largest improvement compared to the original texts was observed in the perceived difficulty ratings, which were generally less reliable with this group, as discussed in Section 5.1.However, in terms of reading speed, ATS had the same effect as manual simplification, which suggests that ATS was somewhat successful in keeping up motivation to continue reading for the target group.A possible explanation for this is that at the surface level, the texts looked more like the simplified texts the participants were familiar with.
In the control group, all measurements agreed that ATS outputs were equally difficult or more difficult than the original text.Apart from lack of quality in the automatic simplifications, several factors may have contributed to these results: First, as described in Section 3.1, the comprehension questions were written and optimized for the original and manually simplified texts.Although we removed responses to questions which were not answerable based on the ATS output, the wording of the questions may still have made the questions more difficult in the automatic simplification.Second, the control group may be more perceptive or sensitive to grammatical and semantic errors in the text than the target group.Evidence for this are the control group's higher difficulty ratings and lower reading speed for the automatically simplified texts.
Particularly the second factor requires further experimental research, as it could have a major influence on human evaluation of text simplification: If the linguistic fluency of ATS output only has a weak influence on comprehension in primary target groups of simplified texts, this should be accounted for when evaluating ATS systems.In addition, different types of linguistic errors may have a different effect on reading behavior depending on the specific type of cognitive impairment of the reader [42].
In the present work, we focused on the evaluation methodology as opposed to pinpointing specific problems in the ATS output.However, our findings on the comprehensibility of automatically simplified texts are in line with current research showing that ATS systems are still quite limited in the effective gain of simplicity they can achieve [49].Recent experiments with large instruction-tuned language models have already suggested significant improvements in this regard, and these models would likely outperform our models [28,37].It is all the more important that these improvements are evaluated with primary target reader groups in the future.

CONCLUSION
We conducted a study exploring different ways of measuring text comprehensibility using a mobile application and investigating the effect of manual and automatic text simplification on comprehension, including participants with and without intellectual disabilities.The results revealed several types of differences which must be taken into account when designing human evaluation studies: • Differences between measurement methods: Comprehension questions, difficulty ratings, and behavioral measurements can lead to different conclusions and complement each other when combined.• Differences between manual and automatic simplification: Issues in the ATS output may significantly impair objective comprehension without affecting subjective perception (especially in the target group), whereas manually simplified texts lead to more predictable results across measurement methods.• Differences between reader groups: Results from persons with intellectual disabilities can be different from (or even in contradiction to) those of persons without disabilities, particularly in terms of reading behavior and subjective perception of difficulty.
We consider measuring interactions of users reading on touchscreen devices to be a promising approach, especially for assessing comprehensibility with diverse target groups, as traditional tests based on comprehension questions can be cognitively demanding.Another advantage is that this approach allows assessing reading behavior in a more natural environment.However, further research is still required on other aspects of human-computer interaction, e.g., regarding the exact relationship between user interactions and text comprehension, the ways in which interactions differ between persons with and without intellectual disabilities, and how they can be used to design more reliable and accessible comprehensibility assessments with diverse user groups.
Overall, we show that applying digital assessment methods for comprehensibility evaluation to persons with intellectual disabilities is viable, and that combining subjective and objective comprehensibility assessment with behavioral measurements provides valuable insights into the impact of text simplification.

Figure 1 :
Figure 1: Screenshots of the reading task in Okra.(1) Initial reading screen, where only the text is visible.(2) Text difficulty rating screen.(3) Comprehension question screen; tapping the arrow buttons switches between questions.

Figure 3 :
Figure 3: Posterior distributions of the text version parameters for the four measurements in the reading task.Points are medians, error bars are 80%, 90%, and 95% credible intervals (CI).A bracket with ▲ indicates that the 80% CI of the difference between the two parameters does not include zero (i.e., we are 80% confident that there is a difference).Similarly with ▲▲ for 90% CI and ▲▲▲ for 95% CI.