Abstract
Text readability assessment is a well-known problem that has acquired even more importance in today’s information-rich world. In this article, we survey various approaches to measuring and assessing the readability of texts. Our specific goal is to provide a perspective on the state-of-the-art in readability assessment research for Arabic, which differs significantly from other languages on which readability studies have tended to focus. We provide background on readability assessment research and tools for English, for which readability studies are the most advanced. We then survey approaches adopted for Arabic, both classical formula-based approaches and studies that combine Machine Learning (ML) with Natural Language Processing (NLP) techniques. The works we cover target text corpora for different audiences: school-age first language readers (L1), foreign language learners (L2), and adult readers in non-academic contexts. Therefore, we explore differences between reading in L1 and L2 and consider how they play out specifically in Arabic after describing language characteristics that may impact readability. Finally, we highlight challenges for Arabic readability research and propose multiple future directions to improve readability assessment and related applications that would benefit from more attention.
- [1] . 2016. A hybrid arabic POS tagging for simple and compound morphosyntactic tags. International Journal of Speech Technology 19, 2 (2016), 289–302.
DOI: Google ScholarDigital Library
- [2] . 2014. The AMARA corpus: Building parallel language resources for the educational domain. In Proceedings of the 9th International Conference on Language Resources and Evaluation. European Language Resources Association (ELRA), Reykjavik, Iceland, 1856–1862. Retrieved from http://www.lrec-conf.org/proceedings/lrec2014/pdf/877_Paper.pdf.Google Scholar
- [3] . 2008. Towards the development of an automatic readability measurements for arabic language. In Proceedings of the 2008 3rd International Conference on Digital Information Management. 506–511.
DOI: Google ScholarCross Ref
- [4] . 2018. Readability of written medicine information materials in arabic language: Expert and consumer evaluation. BMC Health Services Research 18, 1 (2018), 1–7.
DOI: Google ScholarCross Ref
- [5] . 2016. Teachers’ perceptions on the effectiveness of using arabic language teaching software in omani basic education. International Journal of Education and Development Using Information and Communication Technology 12, 2 (2016), 139–157. Google Scholar
- [6] . 2010. Automatic readability measurements of the arabic text: An exploratory study. Arabian Journal for Science and Engineering 35, 2010 (2010), No. 2c, 103–124.Google Scholar
- [7] . 2020. A large-scale leveled readability lexicon for standard arabic. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 3053–3062. Retrieved from https://www.aclweb.org/anthology/2020.lrec-1.373.Google Scholar
- [8] . 2018. A leveled reading corpus of modern standard arabic. In Proceedings of the 11th International Conference on Language Resources and Evaluation. European Language Resources Association (ELRA), Miyazaki, Japan. Retrieved from https://www.aclweb.org/anthology/L18-1366.Google Scholar
- [9] . 2016. Compilation of an arabic children’s corpus. In Proceedings of the 10th International Conference on Language Resources and Evaluation. European Language Resources Association (ELRA), Portorož, Slovenia, 1808–1812. Retrieved from https://www.aclweb.org/anthology/L16-1285.Google Scholar
- [10] . 2014. AARI: Automatic arabic readability index. International Arab Journal of Information Technology 11, 4 (2014), 370–378.Google Scholar
- [11] . 2016. Free/open KACSTAC and its processing tools: Lexical resources for arabic lexicogrammatical microstructures based on collocational indicators. In Proceedings of the Input a Word, Analyze the World: Selected Approaches to Corpus Linguistics. Newcastle Upon Tyne: Cambridge Scholars Publishing (2016), 153–170.Google Scholar
- [12] . 2016. Readability of arabic medicine information leaflets: A machine learning approach. Procedia Computer Science 82 (2016), 122–126.
DOI: Google ScholarCross Ref
- [13] . 2013. King abdullah bin abdulaziz arabic health encyclopedia (www.kaahe.org): A reliable source for health information in arabic in the internet. Saudi Journal of Medicine and Medical Sciences 1, 1 (2013), 53.
DOI: Google ScholarCross Ref
- [14] . 2016. The effects of L2 reading skills on L1 reading skills through transfer. English Language Teaching 9, 9 (2016), 28–35.Google Scholar
Cross Ref
- [15] . 2010. Readability assessment for text simplification. In Proceedings of the NAACL HLT 2010 5th Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, 1–9.Google Scholar
Digital Library
- [16] . 1983. Lix and rix: Variations on a little-known readability index. Journal of Reading 26, 6 (1983), 490–496. Retrieved from http://www.jstor.org/stable/40031755.Google Scholar
- [17] . 2008. A comparative study on strategies of the children for L1 and L2 reading comprehension in K12. College Teaching Methods and Styles Journal 4, 2 (2008), 37–48.Google Scholar
- [18] . 2013. Decision tree analysis on j48 algorithm for data mining. In Proceedings of The International Journal of Advanced Research in Computer Science and Software Engineering 3, 6 (2013).Google Scholar
- [19] . 2018. Improving the arabic root extraction by using the quadratic splines. In Proceedings of the 2018 International Conference on Intelligent Systems and Computer Vision. 1–5.
DOI: Google ScholarCross Ref
- [20] . 2019. A hybrid approach for arabic lemmatization. International Journal of Speech Technology 22, 3 (2019), 563–573.
DOI: Google ScholarDigital Library
- [21] . 2001. Random forests. Machine Learning 45, 1 (2001), 5–32.Google Scholar
Digital Library
- [22] . 1981. Why readability formulas fail. IEEE Transactions on Professional Communication1 (1981), 50–52.Google Scholar
Cross Ref
- [23] . 2014. A Frequency Dictionary of Arabic: Core Vocabulary for Learners. Routledge.Google Scholar
Cross Ref
- [24] . 2014. Matching an arabic text to a learners’ curriculum. In Proceedings of the 5th International Conference on Arabic Language Processing.Google Scholar
- [25] . 1988. The FSI/ILR/ACTFL proficiency scales and testing techniques: Development, current status, and needed research. Studies in Second Language Acquisition 10, 2 (1988), 129–147.Google Scholar
Cross Ref
- [26] . 2011. Readability level of patient information leaflets for older people. Irish Journal of Medical Science 180, 1 (2011), 139–142.Google Scholar
Cross Ref
- [27] . 2016. The tool for the automatic analysis of text cohesion (TAACO): Automatic assessment of local, global, and text cohesion. Behavior Research Methods 48, 4 (2016), 1227–1237.Google Scholar
Cross Ref
- [28] . 2017. Predicting text comprehension, processing, and familiarity in adult readers: New approaches to readability formulas. Discourse Processes 54, 5-6 (2017), 340–359.Google Scholar
Cross Ref
- [29] . 1948. A formula for predicting readability: Instructions. Educational Research Bulletin (1948), 37–54.Google Scholar
- [30] . 2018. Please readerbench this text: A multi-dimensional textual complexity assessment framework. In Proceedings of the Tutoring and Intelligent Tutoring Systems. Nova Science Publishers, Inc., 251–271.Google Scholar
- [31] . 2013. A corpus-based readability formula for estimate of arabic texts reading difficulty. World Applied Sciences Journal 21 (2013), 168–173.Google Scholar
- [32] . 1977. The relationship between readability and selected language variables. Tesis Unpublished Master Thesis (In Arabic), Iraq Baghdad University.Google Scholar
- [33] . 2014. Assessing document and sentence readability in less resourced languages and across textual genres. ITL-International Journal of Applied Linguistics 165, 2 (2014), 163–193.Google Scholar
Cross Ref
- [34] . 2020. Linguistic features for readability assessment. In Proceedings of the 15th Workshop on Innovative Use of NLP for Building Educational Applications.Google Scholar
Cross Ref
- [35] . 2013. Readability assessment of online ophthalmic patient information. JAMA Ophthalmology 131, 12 (2013), 1610–1616.Google Scholar
Cross Ref
- [36] . 2016. OSMAN: A novel arabic readability metric. (2016). Retrieved from http://eprints.lancs.ac.uk/78553/.Google Scholar
- [37] . 2009. Cognitively motivated features for readability assessment. In Proceedings of the 12th Conference of the European Chapter of the ACL. 229–237.Google Scholar
Cross Ref
- [38] . 2010. A comparison of features for automatic readability assessment. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters. Association for computational linguistics, 276–284.Google Scholar
Digital Library
- [39] . 1948. A new readability yardstick. Journal of Applied Psychology 32, 3 (1948), 221.
DOI: Google ScholarCross Ref
- [40] . 2014. Automatic readability detection for modern standard arabic. Thesis Dissertation, Brigham Young University - Provo (2014). Retrieved from http://scholarsarchive.byu.edu/etd/3983.Google Scholar
- [41] . 2012. Do NLP and machine learning improve traditional readability formulas?. In Proceedings of the 1st Workshop on Predicting and Improving Text Readability for Target Reader Populations. Association for Computational Linguistics, 49–57.Google Scholar
- [42] . 2014. Linguistic features for development of arabic text readability formula in malaysia: A preliminary study. Middle-East Journal of Scientific Research 19, 3 (2014), 319–331.Google Scholar
- [43] . 2014. Simple or complex? Assessing the readability of basque texts. In Proceedings of the COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. 334–344.Google Scholar
- [44] . 1991. Current developments in second language reading research. TESOL Quarterly 25, 3 (1991), 375–406.Google Scholar
Cross Ref
- [45] . 2004. AutoTutor: A tutor with dialogue in natural language. Behavior Research Methods, Instruments, and Computers 36, 2 (2004), 180–192.Google Scholar
Cross Ref
- [46] . 2004. Coh-metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments, and Computers 36, 2 (2004), 193–202.Google Scholar
Cross Ref
- [47] . 1969. The fog index after twenty years. Journal of Business Communication 6, 2 (1969), 3–13.Google Scholar
Cross Ref
- [48] . 2013. QALB: Qatar arabic language bank. In Proceedings of the Qatar Annual Research Conference. Doha, Qatar.Google Scholar
- [49] . 2009. MADA+TOKAN: A toolkit for arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools (MEDAR’09). Cairo, 62.Google Scholar
- [50] . 2021. Quality and readability of web-based arabic health information on COVID-19: An infodemiological study. BMC Public Health 21, 1 (2021), 1–7.Google Scholar
Cross Ref
- [51] . 2006. Classroom success of an intelligent tutoring system for lexical practice and reading comprehension. In Proceedings of the 9th International Conference on Spoken Language Processing.Google Scholar
Cross Ref
- [52] . 2021. BERT embeddings for automatic readability assessment. In Proceedings of the International Conference on Recent Advances in Natural Language Processing. 611–618.Google Scholar
- [53] . 2015. USAID (2015) research on reading in morocco: Analysis of the national education curriculum and textbooks. final report. USAID. Part 1 (Curriculum Analysis), Part 2 (Textbook Analysis, Parts A and B) (2015).Google Scholar
- [54] . 2022. Evaluating breast cancer websites targeting arabic speakers: Empirical investigation of popularity, availability, accessibility, readability, and quality. BMC Medical Informatics and Decision Making 22, 1 (2022), 1–15.Google Scholar
Cross Ref
- [55] . 2020. An online readability leveled arabic thesaurus. In Proceedings of the 28th International Conference on Computational Linguistics: System Demonstrations. International Committee on Computational Linguistics (ICCL), Barcelona, Spain (Online), 59–63.
DOI: Google ScholarCross Ref
- [56] . 2020. A web-based medical text simplification tool. In Proceedings of the 53rd Hawaii International Conference on System Sciences.Google Scholar
Cross Ref
- [57] . 2021. Automatic difficulty classification of arabic sentences. arXiv:2103.04386. Retrieved from https://arxiv.org/abs/2103.04386.Google Scholar
- [58] . 1975. Derivation of new readability formulas (automated readability index, fog count, and flesch reading ease formula) for navy enlisted personnel. Naval Technical Training Command Millington TN Research Branch.Google Scholar
- [59] . 1963. Measurement of readability. (1963).Google Scholar
- [60] . 2021. Pushing on text readability assessment: A transformer meets handcrafted linguistic features. arXiv:2109.12258. Retrieved from https://arxiv.org/abs/2109.12258.Google Scholar
- [61] . 2020. Multilingual unsupervised sentence simplification. arXiv:2005.00352. Retrieved from https://arxiv.org/abs/2005.00352.Google Scholar
- [62] . 2021. Supervised and unsupervised neural approaches to text readability. Computational Linguistics 47, 1 (2021), 141–179.Google Scholar
Cross Ref
- [63] . 1969. SMOG grading-a new readability formula. Journal of Reading 12, 8 (1969), 639–646.Google Scholar
- [64] . 2012. Coh-metrix: An automated tool for theoretical and applied natural language processing. In Proceedings of the Applied Natural Language Processing: Identification, Investigation, and Resolution. IGI Global, 188–205.Google Scholar
Cross Ref
- [65] . 2014. Automated Evaluation of Text and Discourse with Coh-Metrix. Cambridge University Press.Google Scholar
Cross Ref
- [66] . 2019. Text as environment: A deep reinforcement learning text readability assessment model. arXiv:1912.05957. Retrieved from https://arxiv.org/abs/1912.05957.Google Scholar
- [67] . 2019. MoSAR: Modern standard arabic readability corpus for L1 learners. In Proceedings of the 4th International Conference on Big Data and Internet of Things. 1–7.Google Scholar
Digital Library
- [68] . 2018. Arabic readability assessment for foreign language learners. In Proceedings of the Natural Language Processing and Information Systems. , , , , and (Eds.), Springer International Publishing, 480–488. Google Scholar
Cross Ref
- [69] . 2018. Modern standard arabic readability prediction. In Proceedings of the Arabic Language Processing: From Theory to Practice. , , , , and (Eds.), Springer International Publishing, 120–133. Google Scholar
Cross Ref
- [70] . 2021. Arabic L2 readability assessment: Dimensionality reduction study. Journal of King Saud University-Computer and Information Sciences 34, 6 (2021), 3789–3799. Google Scholar
Cross Ref
- [71] . 2022. Combining classical and non-classical features to improve readability measures for arabic first language texts. In Proceedings of the Advanced Intelligent Systems for Sustainable Development., , and (Eds.), Springer International Publishing, Cham, 463–470. Google Scholar
Cross Ref
- [72] . 2014. AutoTutor and family: A review of 17 years of natural language tutoring. International Journal of Artificial Intelligence in Education 24, 4 (2014), 427–469.Google Scholar
Cross Ref
- [73] . 2014. MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In Proceedings of the LREC. 1094–1101.Google Scholar
- [74] . 2021. Learning syntactic dense embedding with correlation graph for automatic readability assessment. arXiv:2107.04268. Retrieved from https://arxiv.org/abs/2107.04268.Google Scholar
- [75] . 2000. Readability formulas have even more limitations than klare discusses. ACM Journal of Computer Documentation 24, 3 (2000), 132–137.Google Scholar
Digital Library
- [76] . 2015. Text readability for arabic as a foreign language. In Proceedings of the 2015 IEEE/ACS 12th International Conference of Computer Systems and Applications.1–8.
DOI: Google ScholarCross Ref
- [77] . 2018. Feature optimization for predicting readability of arabic L1 and L2. In Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications. Association for Computational Linguistics, Melbourne, Australia, 20–29. Retrieved from http://aclweb.org/anthology/W18-3703.Google Scholar
Cross Ref
- [78] . 2020. Measuring the impact of readability features in fake news detection. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 1404–1413. Google Scholar
- [79] . 1967. Automated Readability Index.
Technical Report . CINCINNATI UNIV OH.Google Scholar - [80] . 2000. Improvements to the SMO algorithm for SVM regression. IEEE Transactions on Neural Networks 11, 5 (2000), 1188–1193.Google Scholar
Digital Library
- [81] . 1998. A comparison of L1 and L2 reading: Cultural differences and schema. The Internet TESL Journal 4, 10 (1998), 4–10.Google Scholar
- [82] . 1972. Identification Of Sub-skills Of Reading Comprehension By Maximum Likelihood Factor Analysis 1. ETS Research Bulletin Series 1972, 1 (1972), i–24.Google Scholar
Cross Ref
- [83] . 1921. The teacher’s word book. (1921).Google Scholar
- [84] . 2021. Trends, limitations and open challenges in automatic readability assessment research. arXiv:2105.00973. Retrieved from https://arxiv.org/abs/2105.00973.Google Scholar
- [85] . 2012. On improving the accuracy of readability classification using insights from second language acquisition. In Proceedings of the 7th Workshop on Building Educational Applications Using NLP. Association for Computational Linguistics, 163–173.Google Scholar
- [86] . 1983. Learning Purpose and Language use. Oxford University Press.Google Scholar
- [87] . 2017. Tashkeela: Novel corpus of arabic vocalized texts, data for auto-diacritization systems. Data in Brief 11 (2017), 147–151.
DOI: Google ScholarCross Ref
- [88] . 1949. Human behavior and the principle of least effort. Ravenio Books, 2016.Google Scholar
Index Terms
Approaches, Methods, and Resources for Assessing the Readability of Arabic Texts
Recommendations
Arabic Reading Machine for Visually Impaired People Using TTS and OCR
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and SimulationThis paper suggests a standalone Arabic Reading Machine using TTS (Text-to-speech) and OCR (Optical Character Recognition) software built in a user friendly way for Visually Impaired People. In the Arab world, the assistive reading technology for ...
The Effect of Font Type on Screen Readability by People with Dyslexia
Around 10% of the people have dyslexia, a neurological disability that impairs a person’s ability to read and write. There is evidence that the presentation of the text has a significant effect on a text’s accessibility for people with dyslexia. However,...
Word-Sense disambiguation system for text readability
DSAI '20: Proceedings of the 9th International Conference on Software Development and Technologies for Enhancing Accessibility and Fighting Info-exclusionPeople with cognitive, language and learning disabilities face accessibility barriers when reading texts with complex or specialized words. In order for these needs to be addressed, and in accordance with accessibility guidelines, it is beneficial to ...






Comments