skip to main content
research-article

Arabic Diacritic Recovery Using a Feature-rich biLSTM Model

Authors Info & Claims
Published:15 April 2021Publication History
Skip Abstract Section

Abstract

Diacritics (short vowels) are typically omitted when writing Arabic text, and readers have to reintroduce them to correctly pronounce words. There are two types of Arabic diacritics: The first are core-word diacritics (CW), which specify the lexical selection, and the second are case endings (CE), which typically appear at the end of word stems and generally specify their syntactic roles. Recovering CEs is relatively harder than recovering core-word diacritics due to inter-word dependencies, which are often distant. In this article, we use feature-rich recurrent neural network model that use a variety of linguistic and surface-level features to recover both core word diacritics and case endings. Our model surpasses all previous state-of-the-art systems with a CW error rate (CWER) of 2.9% and a CE error rate (CEER) of 3.7% for Modern Standard Arabic (MSA) and CWER of 2.2% and CEER of 2.5% for Classical Arabic (CA). When combining diacritized word cores with case endings, the resultant word error rates are 6.0% and 4.3% for MSA and CA, respectively. This highlights the effectiveness of feature engineering for such deep neural models.

References

  1. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from https://www.tensorflow.org/.Google ScholarGoogle Scholar
  2. Gheith A. Abandah, Alex Graves, Balkees Al-Shagoor, Alaa Arabiyat, Fuad Jamour, and Majid Al-Taee. 2015. Automatic diacritization of Arabic text using recurrent neural networks. Int. J. Doc. Anal. Recogn. 18, 2 (2015), 183–197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Hamza Abbad and Shengwu Xiong. 2020. Multi-components system for automatic Arabic diacritization. In Proceedings of the European Conference on Information Retrieval. Springer, 341–355.Google ScholarGoogle Scholar
  4. Sawsan Alqahtani and Mona Diab. 2019. Investigating input and output units in diacritic restoration. In Proceedings of the 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA’19). IEEE, 811–817.Google ScholarGoogle ScholarCross RefCross Ref
  5. Mohamed Seghir Hadj Ameur, Youcef Moulahoum, and Ahmed Guessoum. 2015. Restoration of Arabic diacritics using a multilevel statistical model. In Proceedings of the IFIP International Conference on Computer Science and Its Applications_x000D_. Springer, 181–192.Google ScholarGoogle ScholarCross RefCross Ref
  6. Mohammed Attia. 2008. Handling Arabic Morphological and Syntactic Ambiguity within the LFG Framework with a View to Machine Translation. Ph.D. Thesis. School of Languages, Linguistics and Cultures, The University of Manchester, UK.Google ScholarGoogle Scholar
  7. Aqil M. Azmi and Reham S. Almajed. 2015. A survey of automatic Arabic diacritization techniques. Nat. Lang. Eng. 21, 03 (2015), 477–495.Google ScholarGoogle ScholarCross RefCross Ref
  8. Mohamed Bebah, Chennoufi Amine, Mazroui Azzeddine, and Lakhouaja Abdelhak. 2014. Hybrid approaches for automatic vowelization of Arabic texts. arXiv:1410.2646. Retrieved from https://arxiv.org/abs/1410.2646.Google ScholarGoogle Scholar
  9. Yonatan Belinkov and James Glass. 2015. Arabic diacritization with recurrent neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2281–2285.Google ScholarGoogle ScholarCross RefCross Ref
  10. Tim Buckwalter. 2002. Buckwalter {Arabic} morphological analyzer version 1.0.Google ScholarGoogle Scholar
  11. Tim Buckwalter. 2004. Buckwalter Arabic morphological analyzer version 2.0. LDC Catalog Number LDC2004L02.Google ScholarGoogle Scholar
  12. François Chollet et al. 2015. Keras. Retrieved from https://keras.io.Google ScholarGoogle Scholar
  13. Kareem Darwish. 2013. Named entity recognition using cross-lingual resources: Arabic as an example. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 1558–1567.Google ScholarGoogle Scholar
  14. Kareem Darwish, Ahmed Abdelali, Hamdy Mubarak, Younes Samih, and Mohammed Attia. 2018. Diacritization of Moroccan and Tunisian Arabic dialects: A CRF approach. In Proceedings of the 3rd Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT 3). 62.Google ScholarGoogle Scholar
  15. Kareem Darwish and Wei Gao. 2014. Simple effective microblog named entity recognition: Arabic as an example. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’14). 2513–2517.Google ScholarGoogle Scholar
  16. Kareem Darwish and Hamdy Mubarak. 2016. Farasa: A new fast and accurate Arabic word segmenter. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA). Google ScholarGoogle Scholar
  17. Kareem Darwish, Hamdy Mubarak, and Ahmed Abdelali. 2017. Arabic diacritization: Stats, rules, and hacks. In Proceedings of the 3rd Arabic Natural Language Processing Workshop. 9–17.Google ScholarGoogle ScholarCross RefCross Ref
  18. Guy De Pauw, Peter W. Wagacha, and Gilles-Maurice De Schryver. 2007. Automatic diacritic restoration for resource-scarce languages. In Proceedings of the International Conference on Text, Speech and Dialogue. Springer, 170–179. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Tarek A. El-Sadany and Mohamed A. Hashish. 1989. An Arabic morphological system. IBM Syst. J. 28, 4 (1989), 600–612. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Moustafa Elshafei, Husni Al-Muhtaseb, and Mansour Alghamdi. 2006. Statistical methods for automatic diacritization of Arabic text. In Proceedings of the Saudi 18th National Computer Conference, Vol. 18. 301–306.Google ScholarGoogle Scholar
  21. Ya’akov Gal. 2002. An HMM approach to vowel restoration in Arabic and Hebrew. In Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages. Association for Computational Linguistics, 1–7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Nizar Habash and Owen Rambow. 2007. Arabic diacritization through full morphological tagging. In Proceedings of the Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers. Association for Computational Linguistics, 53–56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Salima Harrat, Mourad Abbas, Karima Meftouh, and Kamel Smaili. 2013. Diacritics restoration for Arabic dialects. In Proceedings of the 14th Annual Conference of the International Speech Communication Association (INTERSPEECH’13). ISCA.Google ScholarGoogle Scholar
  24. Y. Hifny. 2018. Hybrid LSTM/MaxEnt networks for Arabic syntactic diacritics restoration. IEEE Sign. Process. Lett. 25, 10 (Oct. 2018), 1515–1519. DOI:https://doi.org/10.1109/LSP.2018.2865098Google ScholarGoogle ScholarCross RefCross Ref
  25. Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580. Retrieved from https://arxiv.org/abs/1207.0580.Google ScholarGoogle Scholar
  26. A. Hucko and P. Lacko. 2018. Diacritics restoration using deep neural networks. In Proceedings of the 2018 World Symposium on Digital Intelligence for Systems and Machines (DISA’18). 195–200. DOI:https://doi.org/10.1109/DISA.2018.8490624Google ScholarGoogle ScholarCross RefCross Ref
  27. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arxiv:cs.LG/1412.6980. Retrieved from https://arxiv.org/abs/1412.6980.Google ScholarGoogle Scholar
  28. Tuan Anh Luu and Kazuhide Yamamoto. 2012. A pointwise approach for Vietnamese diacritics restoration. In Proceedings of the 2012 International Conference on Asian Language Processing. IEEE, 189–192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Mohammed Maamouri, Ann Bies, Tim Buckwalter, and Wigdan Mekki. 2004. The Penn Arabic treebank: Building a large-scale annotated Arabic corpus. In Proceedings of the NEMLAR Conference on Arabic Language Resources and Tools. 102–109.Google ScholarGoogle Scholar
  30. Yuval Marton, Nizar Habash, and Owen Rambow. 2010. Improving Arabic dependency parsing with lexical and inflectional morphological features. In Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages. Association for Computational Linguistics, 13–21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Rada F. Mihalcea. 2002. Diacritics restoration: Learning from letters versus learning from words. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 339–348. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Bebah Mohamed Ould Abdallahi Ould, Abdelouafi Meziane, Azzeddine Mazroui, and Abdelhak Lakhouaja. 2011. Alkhalil MorphoSys. In Proceedings of the 7th International Computing Conference in Arabic. 66–73.Google ScholarGoogle Scholar
  33. Hamdy Mubarak, Ahmed Abdelali, Hassan Sajjad, Younes Samih, and Kareem Darwish. 2019. Highly effective Arabic diacritization using sequence to sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 2390–2395.Google ScholarGoogle Scholar
  34. Hamdy Mubarak and Kareem Darwish. 2014. Automatic correction of Arabic text: A cascaded approach. In Proceedings of the Arabic NLP Workshop (EMNLP’14).Google ScholarGoogle ScholarCross RefCross Ref
  35. Rani Nelken and Stuart M Shieber. 2005. Arabic diacritization using weighted finite-state transducers. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages. Association for Computational Linguistics, 79–86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Iroro Orife. 2018. Attentive sequence-to-sequence learning for diacritic restoration of Yorùbá language text. arXiv:1804.00832. Retrieved from https://arxiv.org/abs/1804.00832.Google ScholarGoogle Scholar
  37. Osama Hamed and Torsten Zesch. 2017. A survey and comparative study of arabic diacritization tools. J. Lang. Technol. Comput. Ling. 32, 1 (2017), 27–47.Google ScholarGoogle Scholar
  38. Arfath Pasha, Mohamed Al-Badrashiny, Mona Diab, Ahmed El Kholy, Ramy Eskander, Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan M. Roth. 2014. Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’14).Google ScholarGoogle Scholar
  39. Mohsen Rashwan, Ahmad Al Sallab, M. Raafat, and Ahmed Rafea. 2015. Deep learning framework with confused sub-set resolution architecture for automatic Arabic diacritization. IEEE Trans. Aud. Speech Lang. Process. 23, 3 (2015), 505–516. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Ryan Roth, Owen Rambow, Nizar Habash, Mona Diab, and Cynthia Rudin. 2008. Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers. Association for Computational Linguistics, 117–120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Ahmed Said, Mohamed El-Sharqwi, Achraf Chalabi, and Eslam Kamal. 2013. A hybrid approach for Arabic diacritization. In Natural Language Processing and Information Systems, Elisabeth Métais, Farid Meziane, Mohamad Saraee, Vijayan Sugumaran, and Sunil Vadera (Eds.). Springer, Berlin, 53–64. Google ScholarGoogle Scholar
  42. Nikola Šantić, Jan Šnajder, and Bojana Dalbelo Bašić. 2009. Automatic diacritics restoration in Croatian texts. In INFuture2009: Digital Resources and Knowledge Sharing (2009), 309–318.Google ScholarGoogle Scholar
  43. Mike Schuster and Kuldip K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Trans. Sign. Process. 45, 11 (1997), 2673–2681. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Dan Tufiş and Alexandru Ceauşu. 2008. DIAC+: A professional diacritics recovering system. Proceedings of the International Conference on Language Resources and Evaluation (LREC’08).Google ScholarGoogle Scholar
  45. Dimitra Vergyri and Katrin Kirchhoff. 2004. Automatic diacritization of Arabic for acoustic modeling in speech recognition. In Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages (COLING’04). Association for Computational Linguistics, 66–73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Imed Zitouni, Jeffrey S. Sorensen, and Ruhi Sarikaya. 2006. Maximum entropy based restoration of Arabic diacritics. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 577–584. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Pierre Zweigenbaum and Natalia Grabar. 2002. Restoring accents in unknown biomedical words: Application to the French MeSH thesaurus. Int. J. Med. Inf. 67, 1–3 (2002), 113–126.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Arabic Diacritic Recovery Using a Feature-rich biLSTM Model

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 20, Issue 2
      March 2021
      313 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3454116
      Issue’s Table of Contents

      Copyright © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 15 April 2021
      • Revised: 1 November 2020
      • Accepted: 1 November 2020
      • Received: 1 February 2020
      Published in tallip Volume 20, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!