skip to main content
short-paper

Morphological Segmentation and Part-of-Speech Tagging for the Arabic Heritage

Published:02 April 2018Publication History
Skip Abstract Section

Abstract

We annotate 60,000 words of Classical Arabic (CA) with topics in philosophy, religion, literature, and law with fine-grain segment-based morphological descriptions. We use these annotations for building a morphological segmenter and part-of-speech (POS) tagger for CA. With character-level classification and features from the word and its lexical context, the segmenter achieves a word accuracy of 96.8% with the main issue being a high rate of out-of-vocabulary words. A token-based POS tagger achieves an accuracy of 96.22% with 97.72% on known tokens despite the small size of the corpus. An error analysis shows that most of the tagging errors are results of segmentation and that quality improves with more data being added. The morphological segmenter and tagger have a wide range of potential applications in processing CA, a low-resource variety of the language.

References

  1. Musaed Bin-Muqbil. 2006. Phonetic and Phonological Aspects of Arabic Emphatics and Gutturals. Ph.D. Dissertation. University of Wisconsin-Madison.Google ScholarGoogle Scholar
  2. Walter Daelemans and Antal van den Bosch. 2005. Memory-Based Language Processing. Cambridge University Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Mona T. Diab. 2007. Improved Arabic base phrase chunking with a new enriched POS tag set. In Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources (Semitic’07). 89--96. http://dl.acm.org/citation.cfm?id=1654576.1654592. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Kais Dukes and Tim Buckwalter. 2010. A dependency treebank of the Quran using traditional Arabic grammar. In Proceedings of the 7th International Conference on Informatics and Systems (INFOS’10). 1--7.Google ScholarGoogle Scholar
  5. O. Mohamed Elhadj. 2010. Statistical part-of-speech tagger for traditional Arabic texts. Journal of Computer Science 5, 11, 794--800.Google ScholarGoogle Scholar
  6. Souhir Gahbiche, Helene Bonneau-Maynard, Thomas Lavergne, and Franois Yvon. 2012. Joint segmentation and POS tagging for Arabic using a CRF-based classifier. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12).Google ScholarGoogle Scholar
  7. Nizar Habash and Owen Rambow. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Seth Kulick. 2010. Simultaneous tokenization and part-of-speech tagging for Arabic without a morphological analyzer. In Proceedings of the ACL 2010 Conference Short Papers (ACLShort’10). 342--347. http://dl.acm.org/citation.cfm?id=1858842.1858905. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Sandra Kübler and Emad Mohamed. 2012. Part of speech tagging for Arabic. Natural Language Engineering 18, 4, 521--548. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Mohamed Maamouri, Ann Bies, Tim Buckwalter, and Wigdan Mekki. 2004. The Penn Arabic Treebank: Building a large-scale annotated Arabic corpus. In Proceedings of the NEMLAR Conference on Arabic Language Resources and Tools.Google ScholarGoogle Scholar
  11. Christopher Manning and Dan Klein. 2003. Optimization, maxent models, and conditional estimation without magic. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Tutorials—Volume 5 (NAACL-Tutorials’03). 8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Emad Mohamed. 2012. Morphological segmentation and part of speech tagging for religious Arabic. In Proceedings of the 2012 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT’12).Google ScholarGoogle Scholar
  13. Emad Mohamed and Sandra Kübler. 2010. Is Arabic part of speech tagging feasible without word segmentation? In Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT’10). 705--708. http://dl.acm.org/citation.cfm?id=1857999.1858104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Arfath Pasha, Mohamed Al-Badrashiny, Mona Diab, Ahmed El Kholy, Ramy Eskander, Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan Roth. 2014. MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 26--31.Google ScholarGoogle Scholar

Index Terms

  1. Morphological Segmentation and Part-of-Speech Tagging for the Arabic Heritage

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 17, Issue 3
        September 2018
        196 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3184403
        Issue’s Table of Contents

        Copyright © 2018 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 2 April 2018
        • Revised: 1 December 2017
        • Accepted: 1 December 2017
        • Received: 1 June 2015
        Published in tallip Volume 17, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • short-paper
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!