Abstract
Morphological analysis, which includes analysis of part-of-speech (POS) tagging, stemming, and morpheme segmentation, is one of the key components in natural language processing (NLP), particularly for agglutinative languages. In this article, we investigate the morphological analysis of the Uyghur language, which is the native language of the people in the Xinjiang Uyghur autonomous region of western China. Morphological analysis of Uyghur is challenging primarily because of factors such as (1) ambiguities arising due to the likelihood of association of a multiple number of POS tags with a word stem or a multiple number of functional tags with a word suffix, (2) ambiguous morpheme boundaries, and (3) complex morphopholonogy of the language. Further, the unavailability of a manually annotated training set in the Uyghur language for the purpose of word segmentation makes Uyghur morphological analysis more difficult. In our proposed work, we address these challenges by undertaking a semisupervised approach of learning a Markov model with the help of a manually constructed dictionary of “suffix to tag” mappings in order to predict the most likely tag transitions in the Uyghur morpheme sequence. Due to the linguistic characteristics of Uyghur, we incorporate a prior belief in our model for favoring word segmentations with a lower number of morpheme units. Empirical evaluation of our proposed model shows an accuracy of about 82%. We further improve the effectiveness of the tag transition model with an active learning paradigm. In particular, we manually investigated a subset of words for which the model prediction ambiguity was within the top 20%. Manually incorporating rules to handle these erroneous cases resulted in an overall accuracy of 93.81%.
- Mijit Ablimit, Graham Neubig, Masato Mimura, Shinsuke Mori, Tatsuya Kawahara, and Askar Hamdulla. 2010. Uyghur morpheme-based language models and ASR. In 2010 IEEE 10th International Conference on Signal Processing (ICSP’10). IEEE, 581--584.Google Scholar
Cross Ref
- Mairehaba Aili, Wen-Bin Jiang, Zhi-Yang Wang, Tuergen Yibulayin, and Qun Liu. 2012. Directed graph model of Uyghur morphological analysis. Ruanjian Xuebao/Journal of Software 23, 12 (2012), 3115--3129.Google Scholar
- Batuer Aisha and Maosong Sun. 2009. A statistical method for Uyghur tokenization. In Natural Language Processing and Knowledge Engineering, 2009 (NLP-KE’09). IEEE, 1--5.Google Scholar
- Leonard E. Baum and Ted Petrie. 1966. Statistical inference for probabilistic functions of finite state markov chains. Ann. Math. Statist. 37, 6 (1966), 1554--1563. DOI:http://dx.doi.org/10.1214/aoms/1177699147Google Scholar
Cross Ref
- Mathias Creutz and Krista Lagus. 2007. Unsupervised models for morpheme segmentation and morphology learning. ACM Transactions on Speech and Language Processing (TSLP) 4, 1 (2007), 3. Google Scholar
Digital Library
- Mathias Creutz, Krista Lagus, Krister Lindén, and Sami Virpioja. 2005. Morfessor and hutmegs: Unsupervised morpheme segmentation for highly-inflecting and compounding languages. In Proceedings of the Second Baltic Conference on Human Language Technologies. 107--112.Google Scholar
- Sandipan Dandapat, Sudeshna Sarkar, and Anupam Basu. 2007. Automatic part-of-speech tagging for Bengali: An approach for morphologically rich languages in a poor resource scenario. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. Association for Computational Linguistics, 221--224. Google Scholar
Digital Library
- Sajib Dasgupta and Vincent Ng. 2006. Unsupervised morphological parsing of Bengali. Language Resources and Evaluation 40, 3--4 (2006), 311--330.Google Scholar
- John Goldsmith. 2001. Unsupervised learning of the morphology of a natural language. Computational Linguistics 27, 2 (2001), 153--198. Google Scholar
Digital Library
- Sharon Goldwater and Tom Griffiths. 2007. A fully Bayesian approach to unsupervised part-of-speech tagging. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Association for Computational Linguistics, Prague, Czech Republic, 744--751.Google Scholar
- Zhang Haibo, Cai Qiawu, Jiang Wenbin, Lv Yajuan, and Liu Qun. 2014. Joint voice harmony restoration and morphological segmentation for morphological analysis. Journal of Chinese Information Processing 28, 6 (2014).Google Scholar
- Hemdulla Abdurahman Imam. 2011. A Brief Explanatory Dictionary of Modern Uyghur. Xinjiang Ethnic Language Work Committee.Google Scholar
- Oskar Kohonen, Sami Virpioja, Laura Leppänen, and Krista Lagus. 2010. Semi-supervised extensions to Morfessor baseline. In Proceedings of the Morpho Challenge 2010 Workshop. 30--34.Google Scholar
- Leonid Kontorovich, Dana Ron, and Yoram Singer. 2003. A Markov Model for the Acquisition of Morphological Structure. Technical Report CMU-CS-03-147. School of Computer Science Carnegie Mellon University, Pittsburgh, PA 15213.Google Scholar
- Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying conditional random fields to Japanese morphological analysis. In Proc. of EMNLP’04, Vol. 4. 230--237.Google Scholar
- Christopher D. Manning and Hinrich Schutze. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, London, England. Google Scholar
Digital Library
- Bernard Merialdo. 1994. Tagging English text with a probabilistic model. Computational Linguistics 20, 2 (June 1994), 155--171. Google Scholar
Digital Library
- Daichi Mochihashi, Takeshi Yamada, and Naonori Ueda. 2009. Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 (ACL’09). Association for Computational Linguistics, Stroudsburg, PA, 100--108. Google Scholar
Digital Library
- Kevin P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. MIT Press. Google Scholar
Digital Library
- Fredrik Olsson. 2009. A Literature Survey of Active Machine Learning in the Context of Natural Language Processing. Technical Report T2009:06.Google Scholar
- Murat Orhun, A. Cüneyd Tantug, and Esref Adali. 2009. Rule based analysis of the Uyghur nouns. International Journal of Asian Language Processing 19, 1 (2009), 33--44.Google Scholar
- Teemu Ruokolainena, Oskar Kohonena, Sami Virpiojaa, and Mikko Kurimob. 2013. Supervised morphological segmentation in a low-resource learning setting using conditional random fields. CoNLL-2013 (2013), 29.Google Scholar
- Teemu Ruokolainena, Oskar Kohonenb, Sami Virpiojab, and Mikko Kurimoa. 2014. Painless semi-supervised morphological segmentation using conditional random fields. EACL 2014 (2014), 84.Google Scholar
Cross Ref
- Kouichi Takeuchi and Yuji Matsumoto. 1995. HMM parameter learning for Japanese morphological analyzer. In Proceedings of the 10th Pacific Asia Conference on Language, Information and Computation. 163--172.Google Scholar
- Litip Tohti. 2012. Modern Uyghur Reference Grammar. China Social Science Press.Google Scholar
- Khāmit Tömür. 2003. Modern Uyghur Grammar: Morphology. Vol. 3. Yıldız.Google Scholar
- Sami Virpioja, Peter Smit, Stig-Arne Grönroos, and Mikko Kurimo. 2013. Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline. Technical Report 1799--4896. School of Electrical Engineering, Aalto University.Google Scholar
- Ian H. Witten and Timothy C. Bell. 1991. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory 37, 4 (1991), 1085--1094. Google Scholar
Digital Library
- Aishan Wumaier, Tuergen Yibulayin, Zaokere Kadeer, and Shengwei Tian. 2009. Conditional random fields combined FSM stemming method for Uyghur. In 2nd IEEE International Conference on Computer Science and Information Technology, 2009 (ICCSIT’09). IEEE, 295--299.Google Scholar
- Huajian Xue, Yong Yang, Turghun Osman, Xiao Li, and Ronghui Zhang. 2011. Uyghur word segmentation using a combination of rules and statistics. Advances in Information Sciences 8 Service Sciences 3, 11 (2011).Google Scholar
Index Terms
A Semisupervised Tag-Transition-Based Markovian Model for Uyghur Morphology Analysis
Recommendations
Stemming resource-poor Indian languages
Stemming is a basic method for morphological normalization of natural language texts. In this study, we focus on the problem of stemming several resource-poor languages from Eastern India, viz., Assamese, Bengali, Bishnupriya Manipuri and Bodo. While ...
Building Contemporary Uyghur Grammatical Information Dictionary
WLSI 2015: Revised Selected Papers of the Second International Workshop on Worldwide Language Service Infrastructure - Volume 9442"Contemporary Uyghur Grammatical Information Dictionary" is the basic language knowledge base for the Uyghur information processing. It provides a large amount of grammatical information and collocation features for 49,072 words. The original intention ...
Morphological Analysis Corpus Construction of Uyghur
Chinese Computational LinguisticsAbstractMorphological analysis is a fundamental task in natural language processing, and results can be applied to different downstream tasks such as named entity recognition, syntactic analysis, and machine translation. However, there are many problems ...






Comments