skip to main content
short-paper

Deep Learning for Arabic Error Detection and Correction

Published:12 August 2020Publication History
Skip Abstract Section

Abstract

Research on tools for automating the proofreading of Arabic text has received much attention in recent years. There is an increasing demand for applications that can detect and correct Arabic spelling and grammatical errors to improve the quality of Arabic text content and application input. Our review of previous studies indicates that few Arabic spell-checking research efforts appropriately address the detection and correction of ill-formed words that do not conform to the Arabic morphology system. Even fewer systems address the detection and correction of erroneous well-formed Arabic words that are either contextually or semantically inconsistent within the text. We introduce an approach that investigates employing deep neural network technology for error detection in Arabic text. We have developed a systematic framework for spelling and grammar error detection, as well as correction at the word level, based on a bidirectional long short-term memory mechanism and word embedding, in which a polynomial network classifier is at the top of the system. To get conclusive results, we have developed the most significant gold standard annotated corpus to date, containing 15 million fully inflected Arabic words. The data were collected from diverse text sources and genres, in which every erroneous and ill-formed word has been annotated, validated, and manually revised by Arabic specialists. This valuable asset is available for the Arabic natural language processing research community. The experimental results confirm that our proposed system significantly outperforms the performance of Microsoft Word 2013 and Open Office Ayaspell 3.4, which have been used in the literature for evaluating similar research.

References

  1. K. Shaalan, M. Magdy, and A. Fahmy. 2015. Analysis and feedback of erroneous Arabic verbs. Nat. Lang. Eng. 21, 2 (Mar. 2015), 271--323.Google ScholarGoogle ScholarCross RefCross Ref
  2. S. K. Ray and K. Shaalan. 2016. A review and future perspectives of Arabic question answering systems. IEEE Trans. Knowl. Data Eng. 28, 12 (Dec. 2016), 3169--3190.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Y. Bassil and M. Alwani. 2012. OCR post-processing error correction algorithm using Google's online spelling suggestion. J. Emerg. Trends Comput. Inf. Sci. 3, 1 (2012), 90--99.Google ScholarGoogle Scholar
  4. T. Rachidi, M. Bouzoubaa, L. Elmortaji, B. Boussouab, and A. Bensaid. 2003. Arabic user search query correction and expansion. Proc. of COPSTIC 3 (2003), 11--13.Google ScholarGoogle Scholar
  5. A. El Kholy and N. Habash. 2010. Orthographic and morphological processing for English-Arabic statistical machine translation. In Proceedings of the 17th Conference on Natural Language Processing. 19--23.Google ScholarGoogle Scholar
  6. K. F. Shaalan, M. Magdy, and A. Fahmy. 2010. Morphological analysis of ill-formed Arabic verbs in intelligent language tutoring framework. In Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference (FLAIRS-23). 277--282.Google ScholarGoogle Scholar
  7. K. F. Shaalan. 2005. Arabic GramCheck: A grammar checker for Arabic. Softw. Pract. Exp. 35, 7 (2005), 643--665.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Wang, Z. Jia, and H. Zhao. 2014. Grammatical error detection and correction using a single maximum entropy model. In Proceedings of the 18th Conference on Computational Natural Language Learning: Shared Task. 74--82.Google ScholarGoogle Scholar
  9. D. Shi. 2017. A study on neural network language modeling. arXiv preprint arXiv:1708.07252.Google ScholarGoogle Scholar
  10. G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior, et al. 2012. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 29, 6 (Nov. 2012), 82--97.Google ScholarGoogle ScholarCross RefCross Ref
  11. A. Zeyer, P. Doetsch, P. Voigtlaender, R. Schlüter, and H. Ney. 2016. A comprehensive study of deep bidirectional LSTM RNNs for acoustic modeling in speech recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’17). 3--7.Google ScholarGoogle Scholar
  12. L. Lu and S. Renals. 2017. Small-footprint highway deep neural networks for speech recognition. IEEE-ACM Trans. Audio Speech Lang. Process. 25, 7 (2017), 1502--1511.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. N. Kalchbrenner and P. Blunsom. 2013. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP’13). 1700--1709.Google ScholarGoogle Scholar
  14. T. Young, D. Hazarika, S. Poria, and E. Cambria. 2018. Recent trends in deep learning based natural language processing. IEEE Computational Intelligence Magazine 13, 3 (2018), 55--75Google ScholarGoogle ScholarCross RefCross Ref
  15. K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078.Google ScholarGoogle Scholar
  16. B. Haddad and M. Yaseen. 2007. Detection and correction of non-words in Arabic: A hybrid approach. Int. J. Comput. Process. Lang. 20, 4 (2007), 237.Google ScholarGoogle ScholarCross RefCross Ref
  17. N. Y. Habash. 2010. Introduction to Arabic Natural Language Processing. Morgan 8 Claypool, San Rafael, CA.Google ScholarGoogle Scholar
  18. K. Shaalan. 2003. Development of computer assisted language learning system for Arabic using natural language processing techniques. Egyptian Informatics Journal 4, 2 (2003), 131--155.Google ScholarGoogle Scholar
  19. W. Zaghouani, T. Zerrouki, and A. Balla. 2015. [email protected] Shared Task: A rule-based correction method of common Arabic native and non-native speakers’ errors. In Proceedings of the 2nd Workshop on Arabic Natural Language Processing. 155--160.Google ScholarGoogle Scholar
  20. D. Awad. 2015. The evolution of Arabic writing due to European influence: The case of punctuation. J. Arabic Islamic Stud. 15 (2015), 117--136.Google ScholarGoogle ScholarCross RefCross Ref
  21. M. M. Al-Jefri and S. A. Mahmoud. 2015. Context-sensitive Arabic spell checker using context words and N-gram language models. In Proceedings of the 2013 Taibah University International Conference on Advances in Information Technology for the Holy Quran and Its Sciences (NOORIC’13). 258--263.Google ScholarGoogle Scholar
  22. A. Y. G. Alfaifi. 2015. Building the Arabic Learner Corpus and a System for Arabic Error Annotation. Ph.D. Thesis. University of Leeds.Google ScholarGoogle Scholar
  23. M. Attia, P. Pecina, Y. Samih, and K. Shaalan. 2011. Improved spelling error detection and correction for Arabic. In Proceedings of COLING 2012: Posters. 103--112.Google ScholarGoogle Scholar
  24. G. A. Abandah, A. Graves, B. Al-Shagoor, A. Arabiyat, F. Jamour, and M. Al-Taee. 2015. Automatic diacritization of Arabic text using recurrent neural networks. Int. J. Doc. Anal. Recognit. 18, 2 (June 2015), 183--197.Google ScholarGoogle ScholarCross RefCross Ref
  25. S. Farwaneh and M. Tamimi. 2012. Arabic Learners Written Corpus: A resource for research and learning. The Center for Educational Resources in Culture, Language and Literacy (2012).Google ScholarGoogle Scholar
  26. B. Hamza, Y. Abdellah, G. Hicham, and B. Mostafa. 2014. For an independent spell-checking system from the Arabic language vocabulary. Int. J. Adv. Comput. Sci. Appl. 5, 1 (2014), 113--116.Google ScholarGoogle ScholarCross RefCross Ref
  27. A. B. Soliman, K. Eissa, and S. R. El-Beltagy. 2017. AraVec: A set of Arabic word embedding models for use in Arabic NLP. Procedia Comput. Sci. 117 (2017), 256--265.Google ScholarGoogle ScholarCross RefCross Ref
  28. M. Alkhatib, A. A. Monem, and K. Shaalan. 2017. A rich Arabic WordNet resource for Al-Hadith Al-Shareef. Procedia Comput. Sci. 117 (2017), 101--110.Google ScholarGoogle ScholarCross RefCross Ref
  29. M. Rei and H. Yannakoudakis. 2016. Compositional sequence labeling models for error detection in learner writing. arXiv:1607.06153.Google ScholarGoogle Scholar
  30. T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Vol. 2. 3111--3119.Google ScholarGoogle Scholar
  31. Z.-R. Liu and Y. Liu. 2017. Exploiting unlabeled data for neural grammatical error detection. J. Comput. Sci. Technol. 32, 4 (July 2017), 758--767.Google ScholarGoogle ScholarCross RefCross Ref
  32. R. Lebret, D. Grangier, and M. Auli. 2016. Neural text generation from structured data with application to the biography domain. arXiv:160307771.Google ScholarGoogle Scholar
  33. A. Graves, N. Jaitly, and A.-R. Mohamed. 2013. Hybrid speech recognition with deep bidirectional LSTM. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding. 273--278.Google ScholarGoogle ScholarCross RefCross Ref
  34. A. Graves and N. Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the 31st International Conference on Machine Learning (ICML’14). 1764--1772.Google ScholarGoogle Scholar
  35. M.-T. Luong and C. D. Manning. 2016. Achieving open vocabulary neural machine translation with hybrid word-character models. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1054--1063.Google ScholarGoogle Scholar
  36. M. Luong and C. D. Manning. 2015. Stanford neural machine translation systems for spoken language domains. In Proceedings of the 12th International Workshop on Spoken Language Translation (IWSLT’15). 76--79.Google ScholarGoogle Scholar
  37. M.-T. Luong, H. Pham, and C. D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP’15). 1412--1421.Google ScholarGoogle Scholar
  38. T.-H. Wen, M. Gasic, N. Mrksic, P.-H. Su, D. Vandyke, and S. Young. 2015. Semantically conditioned LSTM-based natural language generation for spoken dialogue systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMLNP’15). 1711--1721.Google ScholarGoogle Scholar
  39. A. Graves, A. Mohamed, and G. Hinton. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of the 2013 International Conference on Acoustics, Speech, and Signal Processing.Google ScholarGoogle Scholar
  40. Z. Huang, W. Xu, and K. Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv:1508.01991.Google ScholarGoogle Scholar
  41. D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.Google ScholarGoogle Scholar
  42. H. T. Ng, S. M. Wu, T. Briscoe, C. Hadiwinoto, R. H. Susanto, and C. Bryant. 2014. The CoNLL-2014 Shared Task on grammatical error correction. In Proceedings of the 18th Conference on Computational Natural Language Learning: Shared Task. 1--14.Google ScholarGoogle Scholar
  43. M. Abbas and K. Smaili. 2005. Comparison of topic identification methods for Arabic language. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’14). 14--17.Google ScholarGoogle Scholar
  44. M. Abbas, K. Smaili, and D. Berkani. 2011. Evaluation of topic identification methods on Arabic corpora. J. Digit. Inf. Manag. 9, 5 (2011), 185--192.Google ScholarGoogle Scholar
  45. M. El-Haj, U. Kruschwitz, and C. Fox. 2011. Multi-document Arabic text summarisation. In Proceedings of the 2011 3rd Computer Science and Electronic Engineering Conference (CEEC’11). 40--44.Google ScholarGoogle Scholar
  46. M. Attia, P. Pecina, Y. Samih, K. Shaalan, and J. Van Genabith. 2016. Arabic spelling error detection and correction. Nat. Lang. Eng. 22, 5 (2016), 751--773.Google ScholarGoogle ScholarCross RefCross Ref
  47. M. K. Saad. 2010. OSAC : Open Source Arabic Corpora. In Proceedings of the 6th International Conference on Electrical and Computer Systems (EECS’10). 1--6.Google ScholarGoogle Scholar

Index Terms

  1. Deep Learning for Arabic Error Detection and Correction

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!