Abstract
Research on tools for automating the proofreading of Arabic text has received much attention in recent years. There is an increasing demand for applications that can detect and correct Arabic spelling and grammatical errors to improve the quality of Arabic text content and application input. Our review of previous studies indicates that few Arabic spell-checking research efforts appropriately address the detection and correction of ill-formed words that do not conform to the Arabic morphology system. Even fewer systems address the detection and correction of erroneous well-formed Arabic words that are either contextually or semantically inconsistent within the text. We introduce an approach that investigates employing deep neural network technology for error detection in Arabic text. We have developed a systematic framework for spelling and grammar error detection, as well as correction at the word level, based on a bidirectional long short-term memory mechanism and word embedding, in which a polynomial network classifier is at the top of the system. To get conclusive results, we have developed the most significant gold standard annotated corpus to date, containing 15 million fully inflected Arabic words. The data were collected from diverse text sources and genres, in which every erroneous and ill-formed word has been annotated, validated, and manually revised by Arabic specialists. This valuable asset is available for the Arabic natural language processing research community. The experimental results confirm that our proposed system significantly outperforms the performance of Microsoft Word 2013 and Open Office Ayaspell 3.4, which have been used in the literature for evaluating similar research.
- K. Shaalan, M. Magdy, and A. Fahmy. 2015. Analysis and feedback of erroneous Arabic verbs. Nat. Lang. Eng. 21, 2 (Mar. 2015), 271--323.Google Scholar
Cross Ref
- S. K. Ray and K. Shaalan. 2016. A review and future perspectives of Arabic question answering systems. IEEE Trans. Knowl. Data Eng. 28, 12 (Dec. 2016), 3169--3190.Google Scholar
Digital Library
- Y. Bassil and M. Alwani. 2012. OCR post-processing error correction algorithm using Google's online spelling suggestion. J. Emerg. Trends Comput. Inf. Sci. 3, 1 (2012), 90--99.Google Scholar
- T. Rachidi, M. Bouzoubaa, L. Elmortaji, B. Boussouab, and A. Bensaid. 2003. Arabic user search query correction and expansion. Proc. of COPSTIC 3 (2003), 11--13.Google Scholar
- A. El Kholy and N. Habash. 2010. Orthographic and morphological processing for English-Arabic statistical machine translation. In Proceedings of the 17th Conference on Natural Language Processing. 19--23.Google Scholar
- K. F. Shaalan, M. Magdy, and A. Fahmy. 2010. Morphological analysis of ill-formed Arabic verbs in intelligent language tutoring framework. In Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference (FLAIRS-23). 277--282.Google Scholar
- K. F. Shaalan. 2005. Arabic GramCheck: A grammar checker for Arabic. Softw. Pract. Exp. 35, 7 (2005), 643--665.Google Scholar
Digital Library
- P. Wang, Z. Jia, and H. Zhao. 2014. Grammatical error detection and correction using a single maximum entropy model. In Proceedings of the 18th Conference on Computational Natural Language Learning: Shared Task. 74--82.Google Scholar
- D. Shi. 2017. A study on neural network language modeling. arXiv preprint arXiv:1708.07252.Google Scholar
- G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior, et al. 2012. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 29, 6 (Nov. 2012), 82--97.Google Scholar
Cross Ref
- A. Zeyer, P. Doetsch, P. Voigtlaender, R. Schlüter, and H. Ney. 2016. A comprehensive study of deep bidirectional LSTM RNNs for acoustic modeling in speech recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’17). 3--7.Google Scholar
- L. Lu and S. Renals. 2017. Small-footprint highway deep neural networks for speech recognition. IEEE-ACM Trans. Audio Speech Lang. Process. 25, 7 (2017), 1502--1511.Google Scholar
Digital Library
- N. Kalchbrenner and P. Blunsom. 2013. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP’13). 1700--1709.Google Scholar
- T. Young, D. Hazarika, S. Poria, and E. Cambria. 2018. Recent trends in deep learning based natural language processing. IEEE Computational Intelligence Magazine 13, 3 (2018), 55--75Google Scholar
Cross Ref
- K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078.Google Scholar
- B. Haddad and M. Yaseen. 2007. Detection and correction of non-words in Arabic: A hybrid approach. Int. J. Comput. Process. Lang. 20, 4 (2007), 237.Google Scholar
Cross Ref
- N. Y. Habash. 2010. Introduction to Arabic Natural Language Processing. Morgan 8 Claypool, San Rafael, CA.Google Scholar
- K. Shaalan. 2003. Development of computer assisted language learning system for Arabic using natural language processing techniques. Egyptian Informatics Journal 4, 2 (2003), 131--155.Google Scholar
- W. Zaghouani, T. Zerrouki, and A. Balla. 2015. [email protected] Shared Task: A rule-based correction method of common Arabic native and non-native speakers’ errors. In Proceedings of the 2nd Workshop on Arabic Natural Language Processing. 155--160.Google Scholar
- D. Awad. 2015. The evolution of Arabic writing due to European influence: The case of punctuation. J. Arabic Islamic Stud. 15 (2015), 117--136.Google Scholar
Cross Ref
- M. M. Al-Jefri and S. A. Mahmoud. 2015. Context-sensitive Arabic spell checker using context words and N-gram language models. In Proceedings of the 2013 Taibah University International Conference on Advances in Information Technology for the Holy Quran and Its Sciences (NOORIC’13). 258--263.Google Scholar
- A. Y. G. Alfaifi. 2015. Building the Arabic Learner Corpus and a System for Arabic Error Annotation. Ph.D. Thesis. University of Leeds.Google Scholar
- M. Attia, P. Pecina, Y. Samih, and K. Shaalan. 2011. Improved spelling error detection and correction for Arabic. In Proceedings of COLING 2012: Posters. 103--112.Google Scholar
- G. A. Abandah, A. Graves, B. Al-Shagoor, A. Arabiyat, F. Jamour, and M. Al-Taee. 2015. Automatic diacritization of Arabic text using recurrent neural networks. Int. J. Doc. Anal. Recognit. 18, 2 (June 2015), 183--197.Google Scholar
Cross Ref
- S. Farwaneh and M. Tamimi. 2012. Arabic Learners Written Corpus: A resource for research and learning. The Center for Educational Resources in Culture, Language and Literacy (2012).Google Scholar
- B. Hamza, Y. Abdellah, G. Hicham, and B. Mostafa. 2014. For an independent spell-checking system from the Arabic language vocabulary. Int. J. Adv. Comput. Sci. Appl. 5, 1 (2014), 113--116.Google Scholar
Cross Ref
- A. B. Soliman, K. Eissa, and S. R. El-Beltagy. 2017. AraVec: A set of Arabic word embedding models for use in Arabic NLP. Procedia Comput. Sci. 117 (2017), 256--265.Google Scholar
Cross Ref
- M. Alkhatib, A. A. Monem, and K. Shaalan. 2017. A rich Arabic WordNet resource for Al-Hadith Al-Shareef. Procedia Comput. Sci. 117 (2017), 101--110.Google Scholar
Cross Ref
- M. Rei and H. Yannakoudakis. 2016. Compositional sequence labeling models for error detection in learner writing. arXiv:1607.06153.Google Scholar
- T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Vol. 2. 3111--3119.Google Scholar
- Z.-R. Liu and Y. Liu. 2017. Exploiting unlabeled data for neural grammatical error detection. J. Comput. Sci. Technol. 32, 4 (July 2017), 758--767.Google Scholar
Cross Ref
- R. Lebret, D. Grangier, and M. Auli. 2016. Neural text generation from structured data with application to the biography domain. arXiv:160307771.Google Scholar
- A. Graves, N. Jaitly, and A.-R. Mohamed. 2013. Hybrid speech recognition with deep bidirectional LSTM. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding. 273--278.Google Scholar
Cross Ref
- A. Graves and N. Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the 31st International Conference on Machine Learning (ICML’14). 1764--1772.Google Scholar
- M.-T. Luong and C. D. Manning. 2016. Achieving open vocabulary neural machine translation with hybrid word-character models. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1054--1063.Google Scholar
- M. Luong and C. D. Manning. 2015. Stanford neural machine translation systems for spoken language domains. In Proceedings of the 12th International Workshop on Spoken Language Translation (IWSLT’15). 76--79.Google Scholar
- M.-T. Luong, H. Pham, and C. D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP’15). 1412--1421.Google Scholar
- T.-H. Wen, M. Gasic, N. Mrksic, P.-H. Su, D. Vandyke, and S. Young. 2015. Semantically conditioned LSTM-based natural language generation for spoken dialogue systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMLNP’15). 1711--1721.Google Scholar
- A. Graves, A. Mohamed, and G. Hinton. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of the 2013 International Conference on Acoustics, Speech, and Signal Processing.Google Scholar
- Z. Huang, W. Xu, and K. Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv:1508.01991.Google Scholar
- D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.Google Scholar
- H. T. Ng, S. M. Wu, T. Briscoe, C. Hadiwinoto, R. H. Susanto, and C. Bryant. 2014. The CoNLL-2014 Shared Task on grammatical error correction. In Proceedings of the 18th Conference on Computational Natural Language Learning: Shared Task. 1--14.Google Scholar
- M. Abbas and K. Smaili. 2005. Comparison of topic identification methods for Arabic language. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’14). 14--17.Google Scholar
- M. Abbas, K. Smaili, and D. Berkani. 2011. Evaluation of topic identification methods on Arabic corpora. J. Digit. Inf. Manag. 9, 5 (2011), 185--192.Google Scholar
- M. El-Haj, U. Kruschwitz, and C. Fox. 2011. Multi-document Arabic text summarisation. In Proceedings of the 2011 3rd Computer Science and Electronic Engineering Conference (CEEC’11). 40--44.Google Scholar
- M. Attia, P. Pecina, Y. Samih, K. Shaalan, and J. Van Genabith. 2016. Arabic spelling error detection and correction. Nat. Lang. Eng. 22, 5 (2016), 751--773.Google Scholar
Cross Ref
- M. K. Saad. 2010. OSAC : Open Source Arabic Corpora. In Proceedings of the 6th International Conference on Electrical and Computer Systems (EECS’10). 1--6.Google Scholar
Index Terms
Deep Learning for Arabic Error Detection and Correction
Recommendations
OCR Error Correction for Unconstrained Vietnamese Handwritten Text
SoICT '19: Proceedings of the 10th International Symposium on Information and Communication TechnologyPost-processing is an essential step in detecting and correcting errors in OCR-generated texts. In this paper, we present an automatic OCR post-processing model which comprises both error detection and error correction phases for OCR output texts of ...
A hybrid model for spelling error detection and correction for Urdu language
AbstractDetecting and correcting misspelled words in a written text are of great importance in many natural language processing applications. Errors can be broadly classified into two groups, namely spelling error and contextual errors. Spelling errors ...
An evaluation to detect and correct erroneous characters wrongly substituted, deleted and inserted in Japanese and English sentences using Markov models
COLING '94: Proceedings of the 15th conference on Computational linguistics - Volume 1In optical character recognition and continuous speech recognition of a natural language, it has been difficult to detect error characters which are wrongly deleted and inserted. In order to judge three types of the errors, which are characters wrongly ...






Comments