Abstract
Diacritics (short vowels) are typically omitted when writing Arabic text, and readers have to reintroduce them to correctly pronounce words. There are two types of Arabic diacritics: The first are core-word diacritics (CW), which specify the lexical selection, and the second are case endings (CE), which typically appear at the end of word stems and generally specify their syntactic roles. Recovering CEs is relatively harder than recovering core-word diacritics due to inter-word dependencies, which are often distant. In this article, we use feature-rich recurrent neural network model that use a variety of linguistic and surface-level features to recover both core word diacritics and case endings. Our model surpasses all previous state-of-the-art systems with a CW error rate (CWER) of 2.9% and a CE error rate (CEER) of 3.7% for Modern Standard Arabic (MSA) and CWER of 2.2% and CEER of 2.5% for Classical Arabic (CA). When combining diacritized word cores with case endings, the resultant word error rates are 6.0% and 4.3% for MSA and CA, respectively. This highlights the effectiveness of feature engineering for such deep neural models.
- Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from https://www.tensorflow.org/.Google Scholar
- Gheith A. Abandah, Alex Graves, Balkees Al-Shagoor, Alaa Arabiyat, Fuad Jamour, and Majid Al-Taee. 2015. Automatic diacritization of Arabic text using recurrent neural networks. Int. J. Doc. Anal. Recogn. 18, 2 (2015), 183–197. Google Scholar
Digital Library
- Hamza Abbad and Shengwu Xiong. 2020. Multi-components system for automatic Arabic diacritization. In Proceedings of the European Conference on Information Retrieval. Springer, 341–355.Google Scholar
- Sawsan Alqahtani and Mona Diab. 2019. Investigating input and output units in diacritic restoration. In Proceedings of the 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA’19). IEEE, 811–817.Google Scholar
Cross Ref
- Mohamed Seghir Hadj Ameur, Youcef Moulahoum, and Ahmed Guessoum. 2015. Restoration of Arabic diacritics using a multilevel statistical model. In Proceedings of the IFIP International Conference on Computer Science and Its Applications_x000D_. Springer, 181–192.Google Scholar
Cross Ref
- Mohammed Attia. 2008. Handling Arabic Morphological and Syntactic Ambiguity within the LFG Framework with a View to Machine Translation. Ph.D. Thesis. School of Languages, Linguistics and Cultures, The University of Manchester, UK.Google Scholar
- Aqil M. Azmi and Reham S. Almajed. 2015. A survey of automatic Arabic diacritization techniques. Nat. Lang. Eng. 21, 03 (2015), 477–495.Google Scholar
Cross Ref
- Mohamed Bebah, Chennoufi Amine, Mazroui Azzeddine, and Lakhouaja Abdelhak. 2014. Hybrid approaches for automatic vowelization of Arabic texts. arXiv:1410.2646. Retrieved from https://arxiv.org/abs/1410.2646.Google Scholar
- Yonatan Belinkov and James Glass. 2015. Arabic diacritization with recurrent neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2281–2285.Google Scholar
Cross Ref
- Tim Buckwalter. 2002. Buckwalter {Arabic} morphological analyzer version 1.0.Google Scholar
- Tim Buckwalter. 2004. Buckwalter Arabic morphological analyzer version 2.0. LDC Catalog Number LDC2004L02.Google Scholar
- François Chollet et al. 2015. Keras. Retrieved from https://keras.io.Google Scholar
- Kareem Darwish. 2013. Named entity recognition using cross-lingual resources: Arabic as an example. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 1558–1567.Google Scholar
- Kareem Darwish, Ahmed Abdelali, Hamdy Mubarak, Younes Samih, and Mohammed Attia. 2018. Diacritization of Moroccan and Tunisian Arabic dialects: A CRF approach. In Proceedings of the 3rd Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT 3). 62.Google Scholar
- Kareem Darwish and Wei Gao. 2014. Simple effective microblog named entity recognition: Arabic as an example. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’14). 2513–2517.Google Scholar
- Kareem Darwish and Hamdy Mubarak. 2016. Farasa: A new fast and accurate Arabic word segmenter. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA). Google Scholar
- Kareem Darwish, Hamdy Mubarak, and Ahmed Abdelali. 2017. Arabic diacritization: Stats, rules, and hacks. In Proceedings of the 3rd Arabic Natural Language Processing Workshop. 9–17.Google Scholar
Cross Ref
- Guy De Pauw, Peter W. Wagacha, and Gilles-Maurice De Schryver. 2007. Automatic diacritic restoration for resource-scarce languages. In Proceedings of the International Conference on Text, Speech and Dialogue. Springer, 170–179. Google Scholar
Digital Library
- Tarek A. El-Sadany and Mohamed A. Hashish. 1989. An Arabic morphological system. IBM Syst. J. 28, 4 (1989), 600–612. Google Scholar
Digital Library
- Moustafa Elshafei, Husni Al-Muhtaseb, and Mansour Alghamdi. 2006. Statistical methods for automatic diacritization of Arabic text. In Proceedings of the Saudi 18th National Computer Conference, Vol. 18. 301–306.Google Scholar
- Ya’akov Gal. 2002. An HMM approach to vowel restoration in Arabic and Hebrew. In Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages. Association for Computational Linguistics, 1–7. Google Scholar
Digital Library
- Nizar Habash and Owen Rambow. 2007. Arabic diacritization through full morphological tagging. In Proceedings of the Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers. Association for Computational Linguistics, 53–56. Google Scholar
Digital Library
- Salima Harrat, Mourad Abbas, Karima Meftouh, and Kamel Smaili. 2013. Diacritics restoration for Arabic dialects. In Proceedings of the 14th Annual Conference of the International Speech Communication Association (INTERSPEECH’13). ISCA.Google Scholar
- Y. Hifny. 2018. Hybrid LSTM/MaxEnt networks for Arabic syntactic diacritics restoration. IEEE Sign. Process. Lett. 25, 10 (Oct. 2018), 1515–1519. DOI:https://doi.org/10.1109/LSP.2018.2865098Google Scholar
Cross Ref
- Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580. Retrieved from https://arxiv.org/abs/1207.0580.Google Scholar
- A. Hucko and P. Lacko. 2018. Diacritics restoration using deep neural networks. In Proceedings of the 2018 World Symposium on Digital Intelligence for Systems and Machines (DISA’18). 195–200. DOI:https://doi.org/10.1109/DISA.2018.8490624Google Scholar
Cross Ref
- Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arxiv:cs.LG/1412.6980. Retrieved from https://arxiv.org/abs/1412.6980.Google Scholar
- Tuan Anh Luu and Kazuhide Yamamoto. 2012. A pointwise approach for Vietnamese diacritics restoration. In Proceedings of the 2012 International Conference on Asian Language Processing. IEEE, 189–192. Google Scholar
Digital Library
- Mohammed Maamouri, Ann Bies, Tim Buckwalter, and Wigdan Mekki. 2004. The Penn Arabic treebank: Building a large-scale annotated Arabic corpus. In Proceedings of the NEMLAR Conference on Arabic Language Resources and Tools. 102–109.Google Scholar
- Yuval Marton, Nizar Habash, and Owen Rambow. 2010. Improving Arabic dependency parsing with lexical and inflectional morphological features. In Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages. Association for Computational Linguistics, 13–21. Google Scholar
Digital Library
- Rada F. Mihalcea. 2002. Diacritics restoration: Learning from letters versus learning from words. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 339–348. Google Scholar
Digital Library
- Bebah Mohamed Ould Abdallahi Ould, Abdelouafi Meziane, Azzeddine Mazroui, and Abdelhak Lakhouaja. 2011. Alkhalil MorphoSys. In Proceedings of the 7th International Computing Conference in Arabic. 66–73.Google Scholar
- Hamdy Mubarak, Ahmed Abdelali, Hassan Sajjad, Younes Samih, and Kareem Darwish. 2019. Highly effective Arabic diacritization using sequence to sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 2390–2395.Google Scholar
- Hamdy Mubarak and Kareem Darwish. 2014. Automatic correction of Arabic text: A cascaded approach. In Proceedings of the Arabic NLP Workshop (EMNLP’14).Google Scholar
Cross Ref
- Rani Nelken and Stuart M Shieber. 2005. Arabic diacritization using weighted finite-state transducers. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages. Association for Computational Linguistics, 79–86. Google Scholar
Digital Library
- Iroro Orife. 2018. Attentive sequence-to-sequence learning for diacritic restoration of Yorùbá language text. arXiv:1804.00832. Retrieved from https://arxiv.org/abs/1804.00832.Google Scholar
- Osama Hamed and Torsten Zesch. 2017. A survey and comparative study of arabic diacritization tools. J. Lang. Technol. Comput. Ling. 32, 1 (2017), 27–47.Google Scholar
- Arfath Pasha, Mohamed Al-Badrashiny, Mona Diab, Ahmed El Kholy, Ramy Eskander, Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan M. Roth. 2014. Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’14).Google Scholar
- Mohsen Rashwan, Ahmad Al Sallab, M. Raafat, and Ahmed Rafea. 2015. Deep learning framework with confused sub-set resolution architecture for automatic Arabic diacritization. IEEE Trans. Aud. Speech Lang. Process. 23, 3 (2015), 505–516. Google Scholar
Digital Library
- Ryan Roth, Owen Rambow, Nizar Habash, Mona Diab, and Cynthia Rudin. 2008. Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers. Association for Computational Linguistics, 117–120. Google Scholar
Digital Library
- Ahmed Said, Mohamed El-Sharqwi, Achraf Chalabi, and Eslam Kamal. 2013. A hybrid approach for Arabic diacritization. In Natural Language Processing and Information Systems, Elisabeth Métais, Farid Meziane, Mohamad Saraee, Vijayan Sugumaran, and Sunil Vadera (Eds.). Springer, Berlin, 53–64. Google Scholar
- Nikola Šantić, Jan Šnajder, and Bojana Dalbelo Bašić. 2009. Automatic diacritics restoration in Croatian texts. In INFuture2009: Digital Resources and Knowledge Sharing (2009), 309–318.Google Scholar
- Mike Schuster and Kuldip K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Trans. Sign. Process. 45, 11 (1997), 2673–2681. Google Scholar
Digital Library
- Dan Tufiş and Alexandru Ceauşu. 2008. DIAC+: A professional diacritics recovering system. Proceedings of the International Conference on Language Resources and Evaluation (LREC’08).Google Scholar
- Dimitra Vergyri and Katrin Kirchhoff. 2004. Automatic diacritization of Arabic for acoustic modeling in speech recognition. In Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages (COLING’04). Association for Computational Linguistics, 66–73. Google Scholar
Digital Library
- Imed Zitouni, Jeffrey S. Sorensen, and Ruhi Sarikaya. 2006. Maximum entropy based restoration of Arabic diacritics. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 577–584. Google Scholar
Digital Library
- Pierre Zweigenbaum and Natalia Grabar. 2002. Restoring accents in unknown biomedical words: Application to the French MeSH thesaurus. Int. J. Med. Inf. 67, 1–3 (2002), 113–126.Google Scholar
Cross Ref
Index Terms
Arabic Diacritic Recovery Using a Feature-rich biLSTM Model
Recommendations
Simple Extensible Deep Learning Model for Automatic Arabic Diacritization
Automatic diacritization is an Arabic natural language processing topic based on the sequence labeling task where the labels are the diacritics and the letters are the sequence elements. A letter can have from zero up to two diacritics. The dataset used ...
A Stochastic Arabic Diacritizer Based on a Hybrid of Factorized and Unfactorized Textual Features
This paper introduces a large-scale dual-mode stochastic system to automatically diacritize raw Arabic text. The first of these modes determines the most likely diacritics by choosing the sequence of full-form Arabic word diacritizations with maximum ...
Thinning Arabic Characters for Feature Extraction
IV '01: Proceedings of the Fifth International Conference on Information VisualisationAbstract: A successful approach to the recognition of Latin characters is to extract features from that character such as the number of strokes, stroke intersections and holes, and to use ad-hoc tests to differentiate between characters which have ...






Comments