skip to main content
note

A Semisupervised Tag-Transition-Based Markovian Model for Uyghur Morphology Analysis

Authors Info & Claims
Published:04 November 2016Publication History
Skip Abstract Section

Abstract

Morphological analysis, which includes analysis of part-of-speech (POS) tagging, stemming, and morpheme segmentation, is one of the key components in natural language processing (NLP), particularly for agglutinative languages. In this article, we investigate the morphological analysis of the Uyghur language, which is the native language of the people in the Xinjiang Uyghur autonomous region of western China. Morphological analysis of Uyghur is challenging primarily because of factors such as (1) ambiguities arising due to the likelihood of association of a multiple number of POS tags with a word stem or a multiple number of functional tags with a word suffix, (2) ambiguous morpheme boundaries, and (3) complex morphopholonogy of the language. Further, the unavailability of a manually annotated training set in the Uyghur language for the purpose of word segmentation makes Uyghur morphological analysis more difficult. In our proposed work, we address these challenges by undertaking a semisupervised approach of learning a Markov model with the help of a manually constructed dictionary of “suffix to tag” mappings in order to predict the most likely tag transitions in the Uyghur morpheme sequence. Due to the linguistic characteristics of Uyghur, we incorporate a prior belief in our model for favoring word segmentations with a lower number of morpheme units. Empirical evaluation of our proposed model shows an accuracy of about 82%. We further improve the effectiveness of the tag transition model with an active learning paradigm. In particular, we manually investigated a subset of words for which the model prediction ambiguity was within the top 20%. Manually incorporating rules to handle these erroneous cases resulted in an overall accuracy of 93.81%.

References

  1. Mijit Ablimit, Graham Neubig, Masato Mimura, Shinsuke Mori, Tatsuya Kawahara, and Askar Hamdulla. 2010. Uyghur morpheme-based language models and ASR. In 2010 IEEE 10th International Conference on Signal Processing (ICSP’10). IEEE, 581--584.Google ScholarGoogle ScholarCross RefCross Ref
  2. Mairehaba Aili, Wen-Bin Jiang, Zhi-Yang Wang, Tuergen Yibulayin, and Qun Liu. 2012. Directed graph model of Uyghur morphological analysis. Ruanjian Xuebao/Journal of Software 23, 12 (2012), 3115--3129.Google ScholarGoogle Scholar
  3. Batuer Aisha and Maosong Sun. 2009. A statistical method for Uyghur tokenization. In Natural Language Processing and Knowledge Engineering, 2009 (NLP-KE’09). IEEE, 1--5.Google ScholarGoogle Scholar
  4. Leonard E. Baum and Ted Petrie. 1966. Statistical inference for probabilistic functions of finite state markov chains. Ann. Math. Statist. 37, 6 (1966), 1554--1563. DOI:http://dx.doi.org/10.1214/aoms/1177699147Google ScholarGoogle ScholarCross RefCross Ref
  5. Mathias Creutz and Krista Lagus. 2007. Unsupervised models for morpheme segmentation and morphology learning. ACM Transactions on Speech and Language Processing (TSLP) 4, 1 (2007), 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Mathias Creutz, Krista Lagus, Krister Lindén, and Sami Virpioja. 2005. Morfessor and hutmegs: Unsupervised morpheme segmentation for highly-inflecting and compounding languages. In Proceedings of the Second Baltic Conference on Human Language Technologies. 107--112.Google ScholarGoogle Scholar
  7. Sandipan Dandapat, Sudeshna Sarkar, and Anupam Basu. 2007. Automatic part-of-speech tagging for Bengali: An approach for morphologically rich languages in a poor resource scenario. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. Association for Computational Linguistics, 221--224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Sajib Dasgupta and Vincent Ng. 2006. Unsupervised morphological parsing of Bengali. Language Resources and Evaluation 40, 3--4 (2006), 311--330.Google ScholarGoogle Scholar
  9. John Goldsmith. 2001. Unsupervised learning of the morphology of a natural language. Computational Linguistics 27, 2 (2001), 153--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Sharon Goldwater and Tom Griffiths. 2007. A fully Bayesian approach to unsupervised part-of-speech tagging. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Association for Computational Linguistics, Prague, Czech Republic, 744--751.Google ScholarGoogle Scholar
  11. Zhang Haibo, Cai Qiawu, Jiang Wenbin, Lv Yajuan, and Liu Qun. 2014. Joint voice harmony restoration and morphological segmentation for morphological analysis. Journal of Chinese Information Processing 28, 6 (2014).Google ScholarGoogle Scholar
  12. Hemdulla Abdurahman Imam. 2011. A Brief Explanatory Dictionary of Modern Uyghur. Xinjiang Ethnic Language Work Committee.Google ScholarGoogle Scholar
  13. Oskar Kohonen, Sami Virpioja, Laura Leppänen, and Krista Lagus. 2010. Semi-supervised extensions to Morfessor baseline. In Proceedings of the Morpho Challenge 2010 Workshop. 30--34.Google ScholarGoogle Scholar
  14. Leonid Kontorovich, Dana Ron, and Yoram Singer. 2003. A Markov Model for the Acquisition of Morphological Structure. Technical Report CMU-CS-03-147. School of Computer Science Carnegie Mellon University, Pittsburgh, PA 15213.Google ScholarGoogle Scholar
  15. Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying conditional random fields to Japanese morphological analysis. In Proc. of EMNLP’04, Vol. 4. 230--237.Google ScholarGoogle Scholar
  16. Christopher D. Manning and Hinrich Schutze. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, London, England. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Bernard Merialdo. 1994. Tagging English text with a probabilistic model. Computational Linguistics 20, 2 (June 1994), 155--171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Daichi Mochihashi, Takeshi Yamada, and Naonori Ueda. 2009. Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 (ACL’09). Association for Computational Linguistics, Stroudsburg, PA, 100--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Kevin P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Fredrik Olsson. 2009. A Literature Survey of Active Machine Learning in the Context of Natural Language Processing. Technical Report T2009:06.Google ScholarGoogle Scholar
  21. Murat Orhun, A. Cüneyd Tantug, and Esref Adali. 2009. Rule based analysis of the Uyghur nouns. International Journal of Asian Language Processing 19, 1 (2009), 33--44.Google ScholarGoogle Scholar
  22. Teemu Ruokolainena, Oskar Kohonena, Sami Virpiojaa, and Mikko Kurimob. 2013. Supervised morphological segmentation in a low-resource learning setting using conditional random fields. CoNLL-2013 (2013), 29.Google ScholarGoogle Scholar
  23. Teemu Ruokolainena, Oskar Kohonenb, Sami Virpiojab, and Mikko Kurimoa. 2014. Painless semi-supervised morphological segmentation using conditional random fields. EACL 2014 (2014), 84.Google ScholarGoogle ScholarCross RefCross Ref
  24. Kouichi Takeuchi and Yuji Matsumoto. 1995. HMM parameter learning for Japanese morphological analyzer. In Proceedings of the 10th Pacific Asia Conference on Language, Information and Computation. 163--172.Google ScholarGoogle Scholar
  25. Litip Tohti. 2012. Modern Uyghur Reference Grammar. China Social Science Press.Google ScholarGoogle Scholar
  26. Khāmit Tömür. 2003. Modern Uyghur Grammar: Morphology. Vol. 3. Yıldız.Google ScholarGoogle Scholar
  27. Sami Virpioja, Peter Smit, Stig-Arne Grönroos, and Mikko Kurimo. 2013. Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline. Technical Report 1799--4896. School of Electrical Engineering, Aalto University.Google ScholarGoogle Scholar
  28. Ian H. Witten and Timothy C. Bell. 1991. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory 37, 4 (1991), 1085--1094. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Aishan Wumaier, Tuergen Yibulayin, Zaokere Kadeer, and Shengwei Tian. 2009. Conditional random fields combined FSM stemming method for Uyghur. In 2nd IEEE International Conference on Computer Science and Information Technology, 2009 (ICCSIT’09). IEEE, 295--299.Google ScholarGoogle Scholar
  30. Huajian Xue, Yong Yang, Turghun Osman, Xiao Li, and Ronghui Zhang. 2011. Uyghur word segmentation using a combination of rules and statistics. Advances in Information Sciences 8 Service Sciences 3, 11 (2011).Google ScholarGoogle Scholar

Index Terms

  1. A Semisupervised Tag-Transition-Based Markovian Model for Uyghur Morphology Analysis

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 16, Issue 2
        TALLIP Notes and Regular Papers
        June 2017
        136 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3008658
        Issue’s Table of Contents

        Copyright © 2016 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 4 November 2016
        • Accepted: 1 July 2016
        • Revised: 1 June 2016
        • Received: 1 February 2015
        Published in tallip Volume 16, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • note
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!