skip to main content
research-article

Neural Arabic Text Diacritization: State-of-the-Art Results and a Novel Approach for Arabic NLP Downstream Tasks

Published:24 December 2021Publication History
Skip Abstract Section

Abstract

In this work, we present several deep learning models for the automatic diacritization of Arabic text. Our models are built using two main approaches, viz. Feed-Forward Neural Network (FFNN) and Recurrent Neural Network (RNN), with several enhancements such as 100-hot encoding, embeddings, Conditional Random Field (CRF), and Block-Normalized Gradient (BNG). The models are tested on the only freely available benchmark dataset and the results show that our models are either better or on par with other models even those requiring human-crafted language-dependent post-processing steps, unlike ours. Moreover, we show how diacritics in Arabic can be used to enhance the models of downstream NLP tasks such as Machine Translation (MT) and Sentiment Analysis (SA) by proposing novel Translation over Diacritization (ToD) and Sentiment over Diacritization (SoD) approaches.

REFERENCES

  1. [1] Abandah Gheith and Abdel-Karim Asma. 2020. Accurate and fast recurrent neural network solution for the automatic diacritization of Arabic text. Jordanian Journal of Computers and Information Technology 6, 2 (2020), 103121.Google ScholarGoogle Scholar
  2. [2] Abandah Gheith A., Graves Alex, Al-Shagoor Balkees, Arabiyat Alaa, Jamour Fuad, and Al-Taee Majid. 2015. Automatic diacritization of Arabic text using recurrent neural networks. International Journal on Document Analysis and Recognition 18, 2 (2015), 183197.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Abandah Gheith A., Khedher Mohammed Z., Abdel-Majeed Mohammad R., Mansour Hamdi M., Hulliel Salma F., and Bisharat Lara M.. 2020. Classifying and diacritizing Arabic poems using deep recurrent neural networks. Journal of King Saud University-Computer and Information Sciences (2020).Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Al-Smadi Mohammad, Talafha Bashar, Al-Ayyoub Mahmoud, and Jararweh Yaser. 2019. Using long short-term memory deep neural networks for aspect-based sentiment analysis of Arabic reviews. International Journal of Machine Learning and Cybernetics 10, 8 (2019), 21632175.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Al-Thubaity Abdulmohsen, Alkhalifa Atheer, Almuhareb Abdulrahman, and Alsanie Waleed. 2020. Arabic diacritization using bidirectional long short-term memory neural networks with conditional random fields. IEEE Access 8 (2020), 154984154996.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Alnefaie Rehab and Azmi Aqil M.. 2017. Automatic minimal diacritization of Arabic texts. Procedia Computer Science 117 (2017), 169174.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Alqahtani Sawsan, Aldarmaki Hanan, and Diab Mona. 2019. Homograph disambiguation through selective diacritic restoration. In Proceedings of the 4th Arabic Natural Language Processing Workshop. 4959.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Alqudah Saba’, Abandah Gheith, and Arabiyat Alaa. 2017. Investigating hybrid approaches for Arabic text diacritization with recurrent neural networks. In Proceedings of the 2017 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies. IEEE, 16.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Appleyard Jeremy, Kocisky Tomas, and Blunsom Phil. 2016. Optimizing performance of recurrent neural networks on gpus. arXiv:1604.01946 (2016). Retrieved from https://arxiv.org/abs/1604.01946.Google ScholarGoogle Scholar
  10. [10] Azmi Aqil M. and Almajed Reham S.. 2015. A survey of automatic Arabic diacritization techniques. Natural Language Engineering 21, 3 (2015), 477495.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2014. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15), San Diego, CA, USA. http://arxiv.org/abs/1409.0473Google ScholarGoogle Scholar
  12. [12] Barqawi Ahmad and Zerrouki Taha. 2017. Shakkala, Arabic Text Vocalization. Retrieved February 20, 2021 from https://github.com/Barqawiz/Shakkala.Google ScholarGoogle Scholar
  13. [13] Bebah Mohamed, Amine Chennoufi, Azzeddine Mazroui, and Abdelhak Lakhouaja. 2014. Hybrid approaches for automatic vowelization of Arabic texts. arXiv:1410.2646. Retrieved from https://arxiv.org/abs/1410.2646.Google ScholarGoogle Scholar
  14. [14] Belinkov Yonatan and Glass James. 2015. Arabic diacritization with recurrent neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 22812285.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Bojanowski Piotr, Grave Edouard, Joulin Armand, and Mikolov Tomas. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, 1 (2017), 135146.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Carneiro Tiago, Da Nóbrega Raul Victor Medeiros, Nepomuceno Thiago, Bian Gui-Bin, De Albuquerque Victor Hugo C., and Reboucas Filho Pedro Pedrosa. 2018. Performance analysis of Google colaboratory as a tool for accelerating deep learning applications. IEEE Access 6 (2018), 6167761685.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Chennoufi Amine and Mazroui Azzeddine. 2017. Morphological, syntactic and diacritics rules for automatic diacritization of Arabic sentences. Journal of King Saud University-Computer and Information Sciences 29, 2 (2017), 156163.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Darwish Kareem, Mubarak Hamdy, and Abdelali Ahmed. 2017. Arabic diacritization: Stats, rules, and hacks. In Proceedings of the 3rd Arabic Natural Language Processing Workshop. 917.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Diab Mona, Habash Nizar, Rambow Owen, and Roth Ryan. 2013. LDC Arabic treebanks and associated corpora: Data divisions manual. arXiv:1309.5652 (2013). Retrieved from https://arxiv.org/abs/1309.5652.Google ScholarGoogle Scholar
  20. [20] Duchi John, Hazan Elad, and Singer Yoram. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, 61 (2011), 21212159. http://jmlr.org/papers/v12/duchi11a.html.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Elnagar Ashraf and Einea Omar. 2016. Brad 1.0: Book reviews in arabic dataset. In Proceedings of the 2016 IEEE/ACS 13th International Conference of Computer Systems and Applications. IEEE, 18.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Elnagar Ashraf, Khalifa Yasmin S., and Einea Anas. 2018. Hotel Arabic-reviews dataset construction for sentiment analysis applications. In Proceedings of the Intelligent Natural Language Processing: Trends and Applications. Springer, 3552.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] ElSahar Hady and El-Beltagy Samhaa R.. 2015. Building large arabic multi-domain resources for sentiment analysis. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 2334.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Fadel Ali, Tuffaha Ibraheem, Al-Jawarneh Bara’, and Al-Ayyoub Mahmoud. 2019. Arabic text diacritization using deep neural networks. In Proceedings of the 2nd International Conference on Computer Applications & and Amp; Information Security.Google ScholarGoogle Scholar
  25. [25] Fashwan Amany and Alansary Sameh. 2017. SHAKKIL: An automatic diacritization system for modern standard Arabic texts. In Proceedings of the 3rd Arabic Natural Language Processing Workshop. 8493.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Huang Zhiheng, Xu Wei, and Yu Kai. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv:1508.01991. Retrieved from https://arxiv.org/abs/1508.01991.Google ScholarGoogle Scholar
  27. [27] Kingma Diederik P. and Ba Jimmy. 2014. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15), San Diego, CA, USA. http://arxiv.org/abs/1412.6980.Google ScholarGoogle Scholar
  28. [28] Maaten Laurens van der and Hinton Geoffrey. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, 86 (2008), 25792605. http://jmlr.org/papers/v9/vandermaaten08a.html.Google ScholarGoogle Scholar
  29. [29] Masmoudi Abir, Mdhaffar Salima, Sellami Rahma, and Belguith Lamia Hadrich. 2019. Automatic diacritics restoration for Tunisian dialect. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 3 (2019), 118.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Moumen Rajae, Chiheb Raddouane, Faizi Rdouan, and El Afia Abdellatif. 2018. Evaluation of gated recurrent unit in Arabic diacritization. International Journal of Advanced Computer Science and Applications 9, 11 (2018). DOI: 10.14569/IJACSA.2018.091150Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Mubarak Hamdy, Abdelali Ahmed, Sajjad Hassan, Samih Younes, and Darwish Kareem. 2019. Highly effective Arabic diacritization using sequence to sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 23902395.Google ScholarGoogle Scholar
  32. [32] Pasha Arfath, Al-Badrashiny Mohamed, Diab Mona T., El Kholy Ahmed, Eskander Ramy, Habash Nizar, Pooleery Manoj, Rambow Owen, and Roth Ryan. 2014. Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation, Vol. 14. 10941101.Google ScholarGoogle Scholar
  33. [33] Sennrich Rico, Haddow Barry, and Birch Alexandra. 2015. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1715–1725. https://aclanthology.org/P16-1162.Google ScholarGoogle Scholar
  34. [34] Shahrour Anas, Khalifa Salam, and Habash Nizar. 2015. Improving arabic diacritization through syntactic analysis. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 13091315.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Tiedemann Jörg. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation, Vol. 2012. 22142218.Google ScholarGoogle Scholar
  36. [36] Yu Adams Wei, Huang Lei, Lin Qihang, Salakhutdinov Ruslan, and Carbonell Jaime. 2017. Block-normalized gradient method: An empirical study for training deep neural network. arXiv:1707.04822. Retrieved from https://arxiv.org/abs/1707.04822.Google ScholarGoogle Scholar
  37. [37] Zerrouki Taha and Balla Amar. 2017. Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems. Data in Brief 11 (2017), 147.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Zitouni Imed and Sarikaya Ruhi. 2009. Arabic diacritic restoration approach based on maximum entropy models. Computer Speech and Language 23, 3 (2009), 257276.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Neural Arabic Text Diacritization: State-of-the-Art Results and a Novel Approach for Arabic NLP Downstream Tasks

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 1
      January 2022
      442 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3494068
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 December 2021
      • Accepted: 1 June 2021
      • Revised: 1 February 2021
      • Received: 1 August 2020
      Published in tallip Volume 21, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed
    • Article Metrics

      • Downloads (Last 12 months)85
      • Downloads (Last 6 weeks)3

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!