Abstract
Automatic diacritization is an Arabic natural language processing topic based on the sequence labeling task where the labels are the diacritics and the letters are the sequence elements. A letter can have from zero up to two diacritics. The dataset used was a subset of the preprocessed version of the Tashkeela corpus. We developed a deep learning model composed of a stack of four bidirectional long short-term memory hidden layers of the same size and an output layer at every level. The levels correspond to the groups that we classified the diacritics into (short vowels, double case-endings, Shadda, and Sukoon). Before training, the data were divided into input vectors containing letter indexes and outputs vectors containing the indexes of diacritics regarding their groups. Both input and output vectors are concatenated, then a sliding window operation with overlapping is performed to generate continuous and fixed-size data. Such data is used for both training and evaluation. Finally, we realize some tests using the standard metrics with all of their variations and compare our results with two recent state-of-the-art works. Our model achieved 3% diacritization error rate and 8.99% word error rate when including all letters. We have also generated the confusion matrix to show the performances per output and analyzed the mismatches of the first 500 lines to classify the model errors according to their linguistic nature.
- [1] . 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 265–283. Google Scholar
Digital Library
- [2] . 2015. Automatic diacritization of Arabic text using recurrent neural networks. International Journal on Document Analysis and Recognition 18, 2 (2015), 183–197. Google Scholar
Digital Library
- [3] . 2020. Multi-components system for automatic Arabic diacritization. In Proceedings of the European Conference on Information Retrieval. 341–355. https://doi.org/10.1007/978-3-030-45439-5_23Google Scholar
Digital Library
- [4] . 2018. Deep learning for Arabic NLP: A survey. Journal of Computational Science 26 (2018), 522–531.Google Scholar
Cross Ref
- [5] . 2014. Automatic Arabic diacritics restoration based on deep nets. In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP’14). 65–72.Google Scholar
Cross Ref
- [6] . 2020. Arabic diacritization using bidirectional long short-term memory neural networks with conditional random fields. IEEE Access 8 (2020), 154984–154996.Google Scholar
Cross Ref
- [7] . 2007. KACST Arabic diacritizer. In Proceedings of the 1st International Symposium on Computers and Arabic Language. 25–28.Google Scholar
- [8] . 2010. Automatic restoration of Arabic diacritics: A simple, purely statistical approach. Arabian Journal for Science and Engineering 35, 2 (2010), 125.Google Scholar
- [9] . 2020. Deep diacritization: Efficient hierarchical recurrence for improved Arabic diacritization. In Proceedings of the 5th Arabic Natural Language Processing Workshop. 38–48. https://www.aclweb.org/anthology/2020.wanlp-1.4.Google Scholar
- [10] . 2020. Attention-based sequence learning model for Arabic diacritic restoration. In Proceedings of the 2020 6th Conference on Data Science and Machine Learning Applications (CDMA’20). IEEE, Los Alamitos, CA, 7–12.Google Scholar
Cross Ref
- [11] . 2020. A multitask learning approach for diacritic restoration. arXiv preprint arXiv:2006.04016 (2020).Google Scholar
- [12] . 2005. Automatic diacritization of Arabic transcripts for automatic speech recognition. In Proceedings of the 4th International Conference on Natural Language Processing. 47–54.Google Scholar
- [13] . 2015. A survey of automatic Arabic diacritization techniques. Natural Language Engineering 21, 3 (2015), 477–495.Google Scholar
Cross Ref
- [14] . 2015. Arabic diacritization with recurrent neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2281–2285.Google Scholar
Cross Ref
- [15] . 2019. Google colaboratory. In Building Machine Learning and Deep Learning Models on Google Cloud Platform. Apress, 59–64. https://doi.org/10.1007/978-1-4842-4470-8_7Google Scholar
Cross Ref
- [16] . 2018. Keras: The Python Deep Learning Library. Astrophysics Source Code Library.Google Scholar
- [17] . 2021. Arabic diacritic recovery using a feature-rich biLSTM model. ACM Transactions on Asian and Low-Resource Language Information Processing 20, 2 (April 2021), 1–18. https://doi.org/10.1145/3434235Google Scholar
Digital Library
- [18] . 2007. Arabic diacritization in the context of statistical machine translation. In Proceedings of the MT Summit.Google Scholar
- [19] . 2004. Phonetization of Arabic: Rules and algorithms. Computer Speech & Language 18, 4 (2004), 339–373. https://doi.org/10.1016/S0885-2308(03)00035-4Google Scholar
Cross Ref
- [20] . 2006. Statistical methods for automatic diacritization of Arabic text. In Proceedings of the Saudi 18th National Computer Conference, Vol. 18. 301–306.Google Scholar
- [21] . 2019. Neural Arabic text diacritization: State of the art results and a novel approach for machine translation. In Proceedings of the 6th Workshop on Asian Translation. 215–225.Google Scholar
Cross Ref
- [22] . 2019. Arabic text diacritization using deep neural networks. arXiv preprint arXiv:1905.01965 (2019).Google Scholar
- [23] . 2002. An HMM approach to vowel restoration in Arabic and Hebrew. In Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages. 1–7. Google Scholar
Digital Library
- [24] . 2007. Arabic diacritization through full morphological tagging. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers. 53–56. Google Scholar
Digital Library
- [25] . 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.Google Scholar
Digital Library
- [26] . 2007. Matplotlib: A 2D graphics environment. Computing in Science & Engineering 9, 3 (2007), 90–95. https://doi.org/10.1109/MCSE.2007.55 Google Scholar
Digital Library
- [27] . 2005. Arabic treebank: Part 3 (full corpus) v 2.0 (MPG+ syntactic analysis). In Proceedings of the Linguistic Data Consortium.Google Scholar
- [28] . 2004. The Penn Arabic treebank: Building a large-scale annotated arabic corpus. In Proceedings of the NEMLAR Conference on Arabic Language Resources and Tools, Vol. 27. 466–467.Google Scholar
- [29] . 2018. Arabic diacritization with gated recurrent unit. In Proceedings of the International Conference on Learning and Optimization Algorithms: Theory and Applications. ACM, New York, NY, 37. Google Scholar
Digital Library
- [30] . 2019. Highly effective Arabic diacritization using sequence to sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long and Short Papers). 2390–2395.Google Scholar
- [31] . 2005. Arabic diacritization using weighted finite-state transducers. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages. 79–86. Google Scholar
Digital Library
- [32] . 2015. Understanding LSTM Networks. Retrieved October 17, 2021 from https://colah.github.io/posts/2015-08-Understanding-LSTMs/.Google Scholar
- [33] . 2014. Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14), Vol. 14. 1094–1101.Google Scholar
- [34] . 2011. A stochastic Arabic diacritizer based on a hybrid of factorized and unfactorized textual features. IEEE Transactions on Audio, Speech, and Language Processing 19, 1 (2011), 166–175. Google Scholar
Digital Library
- [35] . 2015. Deep learning framework with confused sub-set resolution architecture for automatic Arabic diacritization. IEEE/ACM Transactions on Audio, Speech and Language Processing 23, 3 (2015), 505–516. Google Scholar
Digital Library
- [36] . 2013. A hybrid approach for Arabic diacritization. In Proceedings of the International Conference on Application of Natural Language to Information Systems. 53–64.Google Scholar
Cross Ref
- [37] . 2008. Diacritization as a machine translation problem and as a sequence labeling problem. In AMTA-2008—MT at Work: In Proceedings of the 8th Conference of the Association for Machine Translation in the Americas. 270–278.Google Scholar
- [38] . 2013. Automatic Diacritics Restoration for Arabic Text. King Fahd University of Petroleum and Minerals.Google Scholar
- [39] . 2010. Rule-based approach in Arabic natural language processing. International Journal on Information and Communication Technologies 3, 3 (2010), 11–19.Google Scholar
- [40] . 2009. A hybrid approach for building Arabic diacritizer. In Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages. 27–35. Google Scholar
Digital Library
- [41] . 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning 4, 2 (2012), 26–31.Google Scholar
- [42] . 2017. Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems. Data in Brief 11 (2017), 147.Google Scholar
Cross Ref
- [43] . 2006. Maximum entropy based restoration of Arabic diacritics. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. 577–584.Google Scholar
Digital Library
Index Terms
Simple Extensible Deep Learning Model for Automatic Arabic Diacritization
Recommendations
Multi-components System for Automatic Arabic Diacritization
Advances in Information RetrievalAbstractIn this paper, we propose an approach to tackle the problem of the automatic restoration of Arabic diacritics that includes three components stacked in a pipeline: a deep learning model which is a multi-layer recurrent neural network with LSTM and ...
Automatic processing of Arabic text
IIT'09: Proceedings of the 6th international conference on Innovations in information technologyAutomatic recognition of printed and handwritten documents remains an active area of research. Arabic is one of the languages that present special problems. Arabic is cursive and therefore necessitates a segmentation process to determine the boundaries ...
Offline Arabic Handwriting Identification Using Language Diacritics
ICPR '10: Proceedings of the 2010 20th International Conference on Pattern RecognitionIn this paper, we present an approach for writer identification using off-line Arabic handwriting. The proposed method introduced Arabic writing in a new form, by presenting Arabic writing in its basic components instead of alphabetic. We split the ...






Comments