skip to main content
research-article

Simple Extensible Deep Learning Model for Automatic Arabic Diacritization

Published:18 November 2021Publication History
Skip Abstract Section

Abstract

Automatic diacritization is an Arabic natural language processing topic based on the sequence labeling task where the labels are the diacritics and the letters are the sequence elements. A letter can have from zero up to two diacritics. The dataset used was a subset of the preprocessed version of the Tashkeela corpus. We developed a deep learning model composed of a stack of four bidirectional long short-term memory hidden layers of the same size and an output layer at every level. The levels correspond to the groups that we classified the diacritics into (short vowels, double case-endings, Shadda, and Sukoon). Before training, the data were divided into input vectors containing letter indexes and outputs vectors containing the indexes of diacritics regarding their groups. Both input and output vectors are concatenated, then a sliding window operation with overlapping is performed to generate continuous and fixed-size data. Such data is used for both training and evaluation. Finally, we realize some tests using the standard metrics with all of their variations and compare our results with two recent state-of-the-art works. Our model achieved 3% diacritization error rate and 8.99% word error rate when including all letters. We have also generated the confusion matrix to show the performances per output and analyzed the mismatches of the first 500 lines to classify the model errors according to their linguistic nature.

REFERENCES

  1. [1] Abadi Martín, Barham Paul, Chen Jianmin, Chen Zhifeng, Davis Andy, Dean Jeffrey, Devin Matthieu, et al. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 265283. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Abandah Gheith A., Graves Alex, Al-Shagoor Balkees, Arabiyat Alaa, Jamour Fuad, and Al-Taee Majid. 2015. Automatic diacritization of Arabic text using recurrent neural networks. International Journal on Document Analysis and Recognition 18, 2 (2015), 183197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Abbad Hamza and Xiong Shengwu. 2020. Multi-components system for automatic Arabic diacritization. In Proceedings of the European Conference on Information Retrieval. 341355. https://doi.org/10.1007/978-3-030-45439-5_23Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Al-Ayyoub Mahmoud, Nuseir Aya, Alsmearat Kholoud, Jararweh Yaser, and Gupta Brij. 2018. Deep learning for Arabic NLP: A survey. Journal of Computational Science 26 (2018), 522531.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Sallab Ahmad Al, Rashwan Mohsen, Raafat Hazem M., and Rafea Ahmed. 2014. Automatic Arabic diacritics restoration based on deep nets. In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP’14). 6572.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Al-Thubaity Abdulmohsen, Alkhalifa Atheer, Almuhareb Abdulrahman, and Alsanie Waleed. 2020. Arabic diacritization using bidirectional long short-term memory neural networks with conditional random fields. IEEE Access 8 (2020), 154984154996.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Alghamdi Mansour and Muzaffar Zeeshan. 2007. KACST Arabic diacritizer. In Proceedings of the 1st International Symposium on Computers and Arabic Language. 2528.Google ScholarGoogle Scholar
  8. [8] Alghamdi Mansour, Muzaffar Zeeshan, and Alhakami Hazim. 2010. Automatic restoration of Arabic diacritics: A simple, purely statistical approach. Arabian Journal for Science and Engineering 35, 2 (2010), 125.Google ScholarGoogle Scholar
  9. [9] AlKhamissi Badr, ElNokrashy Muhammad, and Gabr Mohamed. 2020. Deep diacritization: Efficient hierarchical recurrence for improved Arabic diacritization. In Proceedings of the 5th Arabic Natural Language Processing Workshop. 3848. https://www.aclweb.org/anthology/2020.wanlp-1.4.Google ScholarGoogle Scholar
  10. [10] Alkhatlan Ali, Kateb Faris, and Kalita Jugal. 2020. Attention-based sequence learning model for Arabic diacritic restoration. In Proceedings of the 2020 6th Conference on Data Science and Machine Learning Applications (CDMA’20). IEEE, Los Alamitos, CA, 712.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Alqahtani Sawsan, Mishra Ajay, and Diab Mona. 2020. A multitask learning approach for diacritic restoration. arXiv preprint arXiv:2006.04016 (2020).Google ScholarGoogle Scholar
  12. [12] Ananthakrishnan Sankaranarayanan, Narayanan Shrikanth, and Bangalore Srinivas. 2005. Automatic diacritization of Arabic transcripts for automatic speech recognition. In Proceedings of the 4th International Conference on Natural Language Processing. 4754.Google ScholarGoogle Scholar
  13. [13] Azmi Aqil M. and Almajed Reham S.. 2015. A survey of automatic Arabic diacritization techniques. Natural Language Engineering 21, 3 (2015), 477495.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Belinkov Yonatan and Glass James. 2015. Arabic diacritization with recurrent neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 22812285.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Bisong Ekaba. 2019. Google colaboratory. In Building Machine Learning and Deep Learning Models on Google Cloud Platform. Apress, 5964. https://doi.org/10.1007/978-1-4842-4470-8_7Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Chollet Francois. 2018. Keras: The Python Deep Learning Library. Astrophysics Source Code Library.Google ScholarGoogle Scholar
  17. [17] Darwish Kareem, Abdelali Ahmed, Mubarak Hamdy, and Eldesouki Mohamed. 2021. Arabic diacritic recovery using a feature-rich biLSTM model. ACM Transactions on Asian and Low-Resource Language Information Processing 20, 2 (April 2021), 118. https://doi.org/10.1145/3434235Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Diab Mona, Ghoneim Mahmoud, and Habash Nizar. 2007. Arabic diacritization in the context of statistical machine translation. In Proceedings of the MT Summit.Google ScholarGoogle Scholar
  19. [19] El-Imam Yousif A.. 2004. Phonetization of Arabic: Rules and algorithms. Computer Speech & Language 18, 4 (2004), 339373. https://doi.org/10.1016/S0885-2308(03)00035-4Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Elshafei Moustafa, Al-Muhtaseb Husni, and Alghamdi Mansour. 2006. Statistical methods for automatic diacritization of Arabic text. In Proceedings of the Saudi 18th National Computer Conference, Vol. 18. 301306.Google ScholarGoogle Scholar
  21. [21] Fadel Ali, Tuffaha Ibraheem, Al-Ayyoub Mahmoud, and Mahmoud Al-Ayyoub. 2019. Neural Arabic text diacritization: State of the art results and a novel approach for machine translation. In Proceedings of the 6th Workshop on Asian Translation. 215225.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Fadel Ali, Tuffaha Ibraheem, Al-Jawarneh Bara’, and Al-Ayyoub Mahmoud. 2019. Arabic text diacritization using deep neural networks. arXiv preprint arXiv:1905.01965 (2019).Google ScholarGoogle Scholar
  23. [23] Gal Ya’akov. 2002. An HMM approach to vowel restoration in Arabic and Hebrew. In Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages. 17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Habash Nizar and Rambow Owen. 2007. Arabic diacritization through full morphological tagging. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers. 5356. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 17351780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Hunter J. D.. 2007. Matplotlib: A 2D graphics environment. Computing in Science & Engineering 9, 3 (2007), 9095. https://doi.org/10.1109/MCSE.2007.55 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Maamouri Mohamed, Bies Ann, Buckwalter Tim, Jin Hubert, and Mekki Wigdan. 2005. Arabic treebank: Part 3 (full corpus) v 2.0 (MPG+ syntactic analysis). In Proceedings of the Linguistic Data Consortium.Google ScholarGoogle Scholar
  28. [28] Maamouri Mohamed, Bies Ann, Buckwalter Tim, and Mekki Wigdan. 2004. The Penn Arabic treebank: Building a large-scale annotated arabic corpus. In Proceedings of the NEMLAR Conference on Arabic Language Resources and Tools, Vol. 27. 466467.Google ScholarGoogle Scholar
  29. [29] Moumen Rajae, Chiheb Raddouane, Faizi Rdouan, and Afia Abdellatif El. 2018. Arabic diacritization with gated recurrent unit. In Proceedings of the International Conference on Learning and Optimization Algorithms: Theory and Applications. ACM, New York, NY, 37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Mubarak Hamdy, Abdelali Ahmed, Sajjad Hassan, Samih Younes, and Darwish Kareem. 2019. Highly effective Arabic diacritization using sequence to sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long and Short Papers). 23902395.Google ScholarGoogle Scholar
  31. [31] Nelken Rani and Shieber Stuart M.. 2005. Arabic diacritization using weighted finite-state transducers. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages. 7986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Olah Christopher. 2015. Understanding LSTM Networks. Retrieved October 17, 2021 from https://colah.github.io/posts/2015-08-Understanding-LSTMs/.Google ScholarGoogle Scholar
  33. [33] Pasha Arfath, Al-Badrashiny Mohamed, Diab Mona T., Kholy Ahmed El, Eskander Ramy, Habash Nizar, Pooleery Manoj, Rambow Owen, and Roth Ryan. 2014. Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14), Vol. 14. 10941101.Google ScholarGoogle Scholar
  34. [34] Rashwan Mohsen A. A., Al-Badrashiny Mohamed A. S. A. A., Attia Mohamed, Abdou Sherif M., and Rafea Ahmed. 2011. A stochastic Arabic diacritizer based on a hybrid of factorized and unfactorized textual features. IEEE Transactions on Audio, Speech, and Language Processing 19, 1 (2011), 166175. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Rashwan Mohsen A. A., Sallab Ahmad A. Al, Raafat Hazem M., and Rafea Ahmed. 2015. Deep learning framework with confused sub-set resolution architecture for automatic Arabic diacritization. IEEE/ACM Transactions on Audio, Speech and Language Processing 23, 3 (2015), 505516. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Said Ahmed, El-Sharqwi Mohamed, Chalabi Achraf, and Kamal Eslam. 2013. A hybrid approach for Arabic diacritization. In Proceedings of the International Conference on Application of Natural Language to Information Systems. 5364.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Schlippe Tim, Nguyen T., and Vogel Stephan. 2008. Diacritization as a machine translation problem and as a sequence labeling problem. In AMTA-2008—MT at Work: In Proceedings of the 8th Conference of the Association for Machine Translation in the Americas. 270278.Google ScholarGoogle Scholar
  38. [38] Shaaban O.. 2013. Automatic Diacritics Restoration for Arabic Text. King Fahd University of Petroleum and Minerals.Google ScholarGoogle Scholar
  39. [39] Khaled Shaalan. 2010. Rule-based approach in Arabic natural language processing. International Journal on Information and Communication Technologies 3, 3 (2010), 1119.Google ScholarGoogle Scholar
  40. [40] Shaalan Khaled, Bakr Hitham M. Abo, and Ziedan Ibrahim. 2009. A hybrid approach for building Arabic diacritizer. In Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages. 2735. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning 4, 2 (2012), 2631.Google ScholarGoogle Scholar
  42. [42] Zerrouki Taha and Balla Amar. 2017. Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems. Data in Brief 11 (2017), 147.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Zitouni Imed, Sorensen Jeffrey S., and Sarikaya Ruhi. 2006. Maximum entropy based restoration of Arabic diacritics. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. 577584.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Simple Extensible Deep Learning Model for Automatic Arabic Diacritization

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Asian and Low-Resource Language Information Processing
            ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 2
            March 2022
            413 pages
            ISSN:2375-4699
            EISSN:2375-4702
            DOI:10.1145/3494070
            Issue’s Table of Contents

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 18 November 2021
            • Accepted: 1 August 2021
            • Revised: 1 July 2021
            • Received: 1 April 2021
            Published in tallip Volume 21, Issue 2

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Full Text

          View this article in Full Text.

          View Full Text

          HTML Format

          View this article in HTML Format .

          View HTML Format
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!