Abstract
We annotate 60,000 words of Classical Arabic (CA) with topics in philosophy, religion, literature, and law with fine-grain segment-based morphological descriptions. We use these annotations for building a morphological segmenter and part-of-speech (POS) tagger for CA. With character-level classification and features from the word and its lexical context, the segmenter achieves a word accuracy of 96.8% with the main issue being a high rate of out-of-vocabulary words. A token-based POS tagger achieves an accuracy of 96.22% with 97.72% on known tokens despite the small size of the corpus. An error analysis shows that most of the tagging errors are results of segmentation and that quality improves with more data being added. The morphological segmenter and tagger have a wide range of potential applications in processing CA, a low-resource variety of the language.
- Musaed Bin-Muqbil. 2006. Phonetic and Phonological Aspects of Arabic Emphatics and Gutturals. Ph.D. Dissertation. University of Wisconsin-Madison.Google Scholar
- Walter Daelemans and Antal van den Bosch. 2005. Memory-Based Language Processing. Cambridge University Press.Google Scholar
Digital Library
- Mona T. Diab. 2007. Improved Arabic base phrase chunking with a new enriched POS tag set. In Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources (Semitic’07). 89--96. http://dl.acm.org/citation.cfm?id=1654576.1654592. Google Scholar
Digital Library
- Kais Dukes and Tim Buckwalter. 2010. A dependency treebank of the Quran using traditional Arabic grammar. In Proceedings of the 7th International Conference on Informatics and Systems (INFOS’10). 1--7.Google Scholar
- O. Mohamed Elhadj. 2010. Statistical part-of-speech tagger for traditional Arabic texts. Journal of Computer Science 5, 11, 794--800.Google Scholar
- Souhir Gahbiche, Helene Bonneau-Maynard, Thomas Lavergne, and Franois Yvon. 2012. Joint segmentation and POS tagging for Arabic using a CRF-based classifier. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12).Google Scholar
- Nizar Habash and Owen Rambow. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05). Google Scholar
Digital Library
- Seth Kulick. 2010. Simultaneous tokenization and part-of-speech tagging for Arabic without a morphological analyzer. In Proceedings of the ACL 2010 Conference Short Papers (ACLShort’10). 342--347. http://dl.acm.org/citation.cfm?id=1858842.1858905. Google Scholar
Digital Library
- Sandra Kübler and Emad Mohamed. 2012. Part of speech tagging for Arabic. Natural Language Engineering 18, 4, 521--548. Google Scholar
Digital Library
- Mohamed Maamouri, Ann Bies, Tim Buckwalter, and Wigdan Mekki. 2004. The Penn Arabic Treebank: Building a large-scale annotated Arabic corpus. In Proceedings of the NEMLAR Conference on Arabic Language Resources and Tools.Google Scholar
- Christopher Manning and Dan Klein. 2003. Optimization, maxent models, and conditional estimation without magic. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Tutorials—Volume 5 (NAACL-Tutorials’03). 8. Google Scholar
Digital Library
- Emad Mohamed. 2012. Morphological segmentation and part of speech tagging for religious Arabic. In Proceedings of the 2012 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT’12).Google Scholar
- Emad Mohamed and Sandra Kübler. 2010. Is Arabic part of speech tagging feasible without word segmentation? In Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT’10). 705--708. http://dl.acm.org/citation.cfm?id=1857999.1858104. Google Scholar
Digital Library
- Arfath Pasha, Mohamed Al-Badrashiny, Mona Diab, Ahmed El Kholy, Ramy Eskander, Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan Roth. 2014. MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 26--31.Google Scholar
Index Terms
Morphological Segmentation and Part-of-Speech Tagging for the Arabic Heritage
Recommendations
Exploiting Separation of Closed-Class Categories for Arabic Tokenization and Part-of-Speech Tagging
Research on the problem of morphological disambiguation of Arabic has noted that techniques developed for lexical disambiguation in English do not easily transfer over, since the affixation present in Arabic creates a very different tag set than for ...
Korean Part-of-speech Tagging Based on Morpheme Generation
Two major problems of Korean part-of-speech (POS) tagging are that the word-spacing unit is not mapped one-to-one to a POS tag and that morphemes should be recovered during POS tagging. Therefore, this article proposes a novel two-step Korean POS tagger ...
A Cross-lingual Part-of-Speech Tagging for Malay Language
ICAART 2015: Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2Cross-lingual annotation projection methods can benefit from rich-resourced languages to improve the performance
of Natural Language Processing (NLP) tasks in less-resourced languages. In this research, Malay
is experimented as the less-resourced ...






Comments