Abstract
The number of possible word forms is theoretically infinite in agglutinative languages. This brings up the out-of-vocabulary (OOV) issue for part-of-speech (PoS) tagging in agglutinative languages. Since inflectional morphology does not change the PoS tag of a word, we propose to learn stems along with PoS tags simultaneously. Therefore, we aim to overcome the sparsity problem by reducing word forms into their stems. We adopt a Bayesian model that is fully unsupervised. We build a Hidden Markov Model for PoS tagging where the stems are emitted through hidden states. Several versions of the model are introduced in order to observe the effects of different dependencies throughout the corpus, such as the dependency between stems and PoS tags or between PoS tags and affixes. Additionally, we use neural word embeddings to estimate the semantic similarity between the word form and stem. We use the semantic similarity as prior information to discover the actual stem of a word since inflection does not change the meaning of a word. We compare our models with other unsupervised stemming and PoS tagging models on Turkish, Hungarian, Finnish, Basque, and English. The results show that a joint model for PoS tagging and stemming improves on an independent PoS tagger and stemmer in agglutinative languages.
- Giorgos Adam, Konstantinos Asimakis, Christos Bouras, and Vassilis Poulopoulos. 2010. An efficient mechanism for stemming and tagging: The case of Greek language. In International Conference on Knowledge-Based and Intelligent Information and Engineering Systems. Springer, 389--397. Google Scholar
Digital Library
- Itziar Aduriz, M. J. Aranzabe, J. M. Arriola, A. Atutxa, A. Díaz De Ilarraza, A. Garmendia, and M. Oronoz. 2003. Construction of a Basque dependency treebank. In Proceedings of the 2nd Workshop on Treebanks and Linguistic Theories. 201--204.Google Scholar
- Sanjeev Arora, Rong Ge, Yonatan Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, and Michael Zhu. 2013. A practical algorithm for topic modeling with provable guarantees. In Proceedings of the 30th International Conference on Machine Learning. ACM, 280--288. Google Scholar
Digital Library
- Michela Bacchin, Nicola Ferro, and Massimo Melucci. 2002. The effectiveness of a graph-based algorithm for stemming. In International Conference on Asian Digital Libraries. Springer, 117--128. Google Scholar
Digital Library
- Michela Bacchin, Nicola Ferro, and Massimo Melucci. 2005. A probabilistic model for stemmer generation. Information Processing 8 Management 41, 1 (2005), 121--137. Google Scholar
Digital Library
- Michele Banko and Robert C. Moore. 2004. Part of speech tagging in context. In Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics, 556. Google Scholar
Digital Library
- Leonard E. Baum, Ted Petrie, George Soules, and Norman Weiss. 1970. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical Statistics 41, 1 (1970), 164--171.Google Scholar
Cross Ref
- Chris Biemann. 2006. Unsupervised part-of-speech tagging employing efficient graph clustering. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. Association for Computational Linguistics, 7--12. Google Scholar
Digital Library
- Peter F. Brown, Peter V. Desouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. 1992. Class-based n-gram models of natural language. Computational Linguistics 18, 4 (1992), 467--479. Google Scholar
Digital Library
- Tomáš Brychcín and Miloslav Konopík. 2015. HPS: High precision stemmer. Information Processing 8 Management 51, 1 (2015), 68--91.Google Scholar
- Burcu Can and Suresh Manandhar. 2013. Dirichlet processes for joint learning of morphology and PoS tags. In Proceedings of the 6th International Joint Conference on Natural Language Processing. Asian Federation of Natural Language Processing, 1087--1091. http://aclweb.org/anthology/I13-1152Google Scholar
- Christos Christodoulopoulos, Sharon Goldwater, and Mark Steedman. 2010. Two decades of unsupervised POS induction: How far have we come? In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 575--584. Google Scholar
Digital Library
- Doug Cutting, Julian Kupiec, Jan Pedersen, and Penelope Sibun. 1992. A practical part-of-speech tagger. In Proceedings of the 3rd Conference on Applied Natural Language Processing. Association for Computational Linguistics, 133--140. Google Scholar
Digital Library
- William B. Frakes and Christopher J. Fox. 2003. Strength and similarity of affix removal stemming algorithms. In ACM SIGIR Forum, Vol. 37. ACM, 26--30. Google Scholar
Digital Library
- Jianfeng Gao and Mark Johnson. 2008. A comparison of Bayesian estimators for unsupervised hidden Markov model POS taggers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 344--352. Google Scholar
Digital Library
- Stuart Geman and Donald Geman. 1984. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence (1984), 721--741. Google Scholar
Digital Library
- John Goldsmith. 2001. Unsupervised learning of the morphology of a natural language. Computational Linguistics 27, 2 (2001), 153--198. Google Scholar
Digital Library
- John Goldsmith. 2006. An algorithm for the unsupervised learning of morphology. Natural Language Engineering 12, 04 (2006), 353--371. Google Scholar
Digital Library
- Sharon Goldwater and Tom Griffiths. 2007. A fully Bayesian approach to unsupervised part-of-speech tagging. In Annual Meeting—Association for Computational Linguistics, Vol. 45. Citeseer, 744.Google Scholar
- A. Goweder, H. Alhami, Tarik Rashed, and A. Al-Musrati. 2008. A hybrid method for stemming Arabic text. Journal of Computer Science. Retrieved December 4, 2018 from http://eref.uqu.edu.sa/files/eref2/folder6/f181.pdf.Google Scholar
- Stig-Arne Grönroos, Sami Virpioja, Peter Smit, and Mikko Kurimo. 2014. Morfessor FlatCat: An HMM-based method for unsupervised and semi-supervised learning of morphology. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. Association for Computational Linguistics, 1177--1185.Google Scholar
- Aria Haghighi and Dan Klein. 2006. Prototype-driven learning for sequence models. In Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics. Association for Computational Linguistics, 320--327. Google Scholar
Digital Library
- Mark Johnson. 2007. Why doesn’t EM find good HMM POS-taggers? In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL'07). Association for Computational Linguistics, 296--305.Google Scholar
- Jon M. Kleinberg, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew S. Tomkins. 1999. The web as a graph: Measurements, models, and methods. In International Computing and Combinatorics Conference. Springer, 1--17. Google Scholar
Digital Library
- Robert Krovetz. 1993. Viewing morphology as an inference process. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 191--202. Google Scholar
Digital Library
- Julie B. Lovins. 1968. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11, 1--2 (1968), 22--31.Google Scholar
- David J. C. MacKay. 1997. Ensemble Learning for Hidden Markov Models. Technical Report. Cavendish Laboratory, Cambridge.Google Scholar
- Prasenjit Majumder, Mandar Mitra, Swapan K. Parui, Gobinda Kole, Pabitra Mitra, and Kalyankumar Datta. 2007. YASS: Yet another suffix stripper. ACM Transactions on Information Systems 25, 4 (2007), 18. Google Scholar
Digital Library
- Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19, 2 (1993), 313--330. Google Scholar
Digital Library
- Paul McNamee and James Mayfield. 2004. Character n-gram tokenization for European language text retrieval. Information Retrieval 7, 1--2 (2004), 73--97. Google Scholar
Digital Library
- Marina Meilă. 2007. Comparing clusterings—an information-based distance. Journal of Multivariate Analysis 98, 5 (2007), 873--895. Google Scholar
Digital Library
- Massimo Melucci and Nicola Orio. 2003. A novel method for stemmer generation based on hidden Markov models. In Proceedings of the 12th International Conference on Information and Knowledge Management. ACM, 131--138. Google Scholar
Digital Library
- Bernard Merialdo. 1994. Tagging English text with a probabilistic model. Computational Linguistics 20, 2 (1994), 155--171. Google Scholar
Digital Library
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111--3119. Google Scholar
Digital Library
- Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, et al. 2016. Universal dependencies v1: A multilingual treebank collection. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 1659--1666.Google Scholar
- Kemal Oflazer, Bilge Say, Dilek Zeynep Hakkani-Tür, and Gökhan Tür. 2003. Building a Turkish treebank. Treebanks (2003), 261--277.Google Scholar
- Jiaul H. Paik, Mandar Mitra, Swapan K. Parui, and Kalervo Järvelin. 2011. GRAS: An effective and efficient stemming algorithm for information retrieval. ACM Transactions on Information Systems 29, 4 (2011), 19. Google Scholar
Digital Library
- Fuchun Peng, Nawaaz Ahmed, Xin Li, and Yumao Lu. 2007. Context sensitive stemming for web search. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 639--646. Google Scholar
Digital Library
- Slav Petrov, Dipanjan Das, and Ryan McDonald. 2011. A universal part-of-speech tagset. arXiv preprint arXiv:1104.2086 (2011).Google Scholar
- Martin F. Porter. 1980. An algorithm for suffix stripping. Program 14, 3 (1980), 130--137.Google Scholar
Digital Library
- Xipeng Qiu, F. Eng Ji, Jiayi Zhao, and Xuanjing Huang. 2012. Joint segmentation and tagging with coupled sequences labeling. (2012).Google Scholar
- Haşim Sak, Tunga Güngör, and Murat Saraçlar. 2008. Turkish language resources: Morphological parser, morphological disambiguator and web corpus. In Advances in Natural Language Processing: 6th International Conference, GoTAL 2008 Gothenburg, Sweden, August 25-27, 2008 Proceedings. Springer, Berlin, 417--427. Google Scholar
Digital Library
- Hinrich Schütze. 1993. Part-of-speech induction from scratch. In Proceedings of the 31st Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 251--258. Google Scholar
Digital Library
- Manish Shrivastava, Nitin Agrawal, Bibhuti Mohapatra, Smriti Singh, and Pushpak Bhattacharya. 2005. Morphology based natural language processing tools for Indian languages. In Proceedings of the 4th Annual Inter Research Institute Student Seminar in Computer Science, IIT, Kanpur, India, April. Citeseer.Google Scholar
- Kairit Sirts and Tanel Alumäe. 2012. A hierarchical Dirichlet process model for joint part-of-speech and morphology induction. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 407--416. Google Scholar
Digital Library
- Kairit Sirts, Jacob Eisenstein, Micha Elsner, and Sharon Goldwater. 2014. POS induction with distributional and morphological information using a distance-dependent Chinese restaurant process. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Vol. 2. 265--271.Google Scholar
Cross Ref
- Karl Stratos, Michael Collins, and Daniel Hsu. 2016. Unsupervised part-of-speech tagging with anchor hidden Markov models. Transactions of the Association for Computational Linguistics 4 (2016), 245--257.Google Scholar
Cross Ref
- Alexander Strehl and Joydeep Ghosh. 2002. Cluster ensembles—A knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3, Dec (2002), 583--617. Google Scholar
Digital Library
- Jurgen Van Gael, Andreas Vlachos, and Zoubin Ghahramani. 2009. The infinite HMM for unsupervised PoS tagging. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2. Association for Computational Linguistics, 678--687. Google Scholar
Digital Library
- Veronika Vincze, Dóra Szauter, Attila Almási, György Móra, Zoltán Alexin, and János Csirik. 2010. Hungarian dependency treebank. In LREC, Vol. 10. Citeseer, 1855--1862.Google Scholar
- Atro Voutilainen, Tanja Purtonen, and Kristiina Muhonen. 2012. Outsourcing parsebanking: The FinnTreeBank project. In Shall We Play the Festschrift Game? Springer, 117--131.Google Scholar
- Stephen G. Walker. 2007. Sampling the Dirichlet mixture model with slices. Communications in Statistics-Simulation and Computation 36, 1 (2007), 45--54.Google Scholar
Cross Ref
- Jinxi Xu and W. Bruce Croft. 1998. Corpus-based stemming using cooccurrence of word variants. ACM Transactions on Information Systems 16, 1 (1998), 61--81. Google Scholar
Digital Library
Index Terms
Unsupervised Joint PoS Tagging and Stemming for Agglutinative Languages
Recommendations
A Cascaded Unsupervised Model for PoS Tagging
Special issue on Deep Learning for Low-Resource Natural Language Processing, Part 1 and Regular PapersPart of speech (PoS) tagging is one of the fundamental syntactic tasks in Natural Language Processing, as it assigns a syntactic category to each word within a given sentence or context (such as noun, verb, adjective, etc.). Those syntactic categories ...
Stemming resource-poor Indian languages
Stemming is a basic method for morphological normalization of natural language texts. In this study, we focus on the problem of stemming several resource-poor languages from Eastern India, viz., Assamese, Bengali, Bishnupriya Manipuri and Bodo. While ...
A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics
AbstractWord Stemming is a widely used mechanism in the fields of Natural Language Processing, Information Retrieval, and Language Modeling. Language-independent stemmers discover classes of morphologically related words from the ambient ...






Comments