Abstract
In this work, we present several deep learning models for the automatic diacritization of Arabic text. Our models are built using two main approaches, viz. Feed-Forward Neural Network (FFNN) and Recurrent Neural Network (RNN), with several enhancements such as 100-hot encoding, embeddings, Conditional Random Field (CRF), and Block-Normalized Gradient (BNG). The models are tested on the only freely available benchmark dataset and the results show that our models are either better or on par with other models even those requiring human-crafted language-dependent post-processing steps, unlike ours. Moreover, we show how diacritics in Arabic can be used to enhance the models of downstream NLP tasks such as Machine Translation (MT) and Sentiment Analysis (SA) by proposing novel Translation over Diacritization (ToD) and Sentiment over Diacritization (SoD) approaches.
- [1] . 2020. Accurate and fast recurrent neural network solution for the automatic diacritization of Arabic text. Jordanian Journal of Computers and Information Technology 6, 2 (2020), 103–121.Google Scholar
- [2] . 2015. Automatic diacritization of Arabic text using recurrent neural networks. International Journal on Document Analysis and Recognition 18, 2 (2015), 183–197.Google Scholar
Cross Ref
- [3] . 2020. Classifying and diacritizing Arabic poems using deep recurrent neural networks. Journal of King Saud University-Computer and Information Sciences (2020).Google Scholar
Cross Ref
- [4] . 2019. Using long short-term memory deep neural networks for aspect-based sentiment analysis of Arabic reviews. International Journal of Machine Learning and Cybernetics 10, 8 (2019), 2163–2175.Google Scholar
Cross Ref
- [5] . 2020. Arabic diacritization using bidirectional long short-term memory neural networks with conditional random fields. IEEE Access 8 (2020), 154984–154996.Google Scholar
Cross Ref
- [6] . 2017. Automatic minimal diacritization of Arabic texts. Procedia Computer Science 117 (2017), 169–174.Google Scholar
Cross Ref
- [7] . 2019. Homograph disambiguation through selective diacritic restoration. In Proceedings of the 4th Arabic Natural Language Processing Workshop. 49–59.Google Scholar
Cross Ref
- [8] . 2017. Investigating hybrid approaches for Arabic text diacritization with recurrent neural networks. In Proceedings of the 2017 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies. IEEE, 1–6.Google Scholar
Cross Ref
- [9] . 2016. Optimizing performance of recurrent neural networks on gpus. arXiv:1604.01946 (2016). Retrieved from https://arxiv.org/abs/1604.01946.Google Scholar
- [10] . 2015. A survey of automatic Arabic diacritization techniques. Natural Language Engineering 21, 3 (2015), 477–495.Google Scholar
Cross Ref
- [11] . 2014. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15), San Diego, CA, USA. http://arxiv.org/abs/1409.0473Google Scholar
- [12] . 2017. Shakkala, Arabic Text Vocalization. Retrieved February 20, 2021 from https://github.com/Barqawiz/Shakkala.Google Scholar
- [13] . 2014. Hybrid approaches for automatic vowelization of Arabic texts. arXiv:1410.2646. Retrieved from https://arxiv.org/abs/1410.2646.Google Scholar
- [14] . 2015. Arabic diacritization with recurrent neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2281–2285.Google Scholar
Cross Ref
- [15] . 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, 1 (2017), 135–146.Google Scholar
Cross Ref
- [16] . 2018. Performance analysis of Google colaboratory as a tool for accelerating deep learning applications. IEEE Access 6 (2018), 61677–61685.Google Scholar
Cross Ref
- [17] . 2017. Morphological, syntactic and diacritics rules for automatic diacritization of Arabic sentences. Journal of King Saud University-Computer and Information Sciences 29, 2 (2017), 156–163.Google Scholar
Digital Library
- [18] . 2017. Arabic diacritization: Stats, rules, and hacks. In Proceedings of the 3rd Arabic Natural Language Processing Workshop. 9–17.Google Scholar
Cross Ref
- [19] . 2013. LDC Arabic treebanks and associated corpora: Data divisions manual. arXiv:1309.5652 (2013). Retrieved from https://arxiv.org/abs/1309.5652.Google Scholar
- [20] . 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, 61 (2011), 2121–2159. http://jmlr.org/papers/v12/duchi11a.html.Google Scholar
Digital Library
- [21] . 2016. Brad 1.0: Book reviews in arabic dataset. In Proceedings of the 2016 IEEE/ACS 13th International Conference of Computer Systems and Applications. IEEE, 1–8.Google Scholar
Cross Ref
- [22] . 2018. Hotel Arabic-reviews dataset construction for sentiment analysis applications. In Proceedings of the Intelligent Natural Language Processing: Trends and Applications. Springer, 35–52.Google Scholar
Cross Ref
- [23] . 2015. Building large arabic multi-domain resources for sentiment analysis. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 23–34.Google Scholar
Cross Ref
- [24] . 2019. Arabic text diacritization using deep neural networks. In Proceedings of the 2nd International Conference on Computer Applications & and Amp; Information Security.Google Scholar
- [25] . 2017. SHAKKIL: An automatic diacritization system for modern standard Arabic texts. In Proceedings of the 3rd Arabic Natural Language Processing Workshop. 84–93.Google Scholar
Cross Ref
- [26] . 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv:1508.01991. Retrieved from https://arxiv.org/abs/1508.01991.Google Scholar
- [27] . 2014. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15), San Diego, CA, USA. http://arxiv.org/abs/1412.6980.Google Scholar
- [28] . 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, 86 (2008), 2579–2605. http://jmlr.org/papers/v9/vandermaaten08a.html.Google Scholar
- [29] . 2019. Automatic diacritics restoration for Tunisian dialect. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 3 (2019), 1–18.Google Scholar
Digital Library
- [30] . 2018. Evaluation of gated recurrent unit in Arabic diacritization. International Journal of Advanced Computer Science and Applications 9, 11 (2018).
DOI: 10.14569/IJACSA.2018.091150Google ScholarCross Ref
- [31] . 2019. Highly effective Arabic diacritization using sequence to sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2390–2395.Google Scholar
- [32] . 2014. Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation, Vol. 14. 1094–1101.Google Scholar
- [33] . 2015. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1715–1725. https://aclanthology.org/P16-1162.Google Scholar
- [34] . 2015. Improving arabic diacritization through syntactic analysis. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1309–1315.Google Scholar
Cross Ref
- [35] . 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation, Vol. 2012. 2214–2218.Google Scholar
- [36] . 2017. Block-normalized gradient method: An empirical study for training deep neural network. arXiv:1707.04822. Retrieved from https://arxiv.org/abs/1707.04822.Google Scholar
- [37] . 2017. Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems. Data in Brief 11 (2017), 147.Google Scholar
Cross Ref
- [38] . 2009. Arabic diacritic restoration approach based on maximum entropy models. Computer Speech and Language 23, 3 (2009), 257–276.Google Scholar
Digital Library
Index Terms
Neural Arabic Text Diacritization: State-of-the-Art Results and a Novel Approach for Arabic NLP Downstream Tasks
Recommendations
Simple Extensible Deep Learning Model for Automatic Arabic Diacritization
Automatic diacritization is an Arabic natural language processing topic based on the sequence labeling task where the labels are the diacritics and the letters are the sequence elements. A letter can have from zero up to two diacritics. The dataset used ...
Automatic processing of Arabic text
IIT'09: Proceedings of the 6th international conference on Innovations in information technologyAutomatic recognition of printed and handwritten documents remains an active area of research. Arabic is one of the languages that present special problems. Arabic is cursive and therefore necessitates a segmentation process to determine the boundaries ...
Arabic online handwriting recognition: a survey
IML '17: Proceedings of the 1st International Conference on Internet of Things and Machine LearningNowadays, Arabic handwriting recognition is an active research area. The optical character recognition is classified into two approaches offline and online. There are many studies and applications for Arabic offline recognition, both typed and ...






Comments