Abstract
Part of speech (PoS) tagging is one of the fundamental syntactic tasks in Natural Language Processing, as it assigns a syntactic category to each word within a given sentence or context (such as noun, verb, adjective, etc.). Those syntactic categories could be used to further analyze the sentence-level syntax (e.g., dependency parsing) and thereby extract the meaning of the sentence (e.g., semantic parsing). Various methods have been proposed for learning PoS tags in an unsupervised setting without using any annotated corpora. One of the widely used methods for the tagging problem is log-linear models. Initialization of the parameters in a log-linear model is very crucial for the inference. Different initialization techniques have been used so far. In this work, we present a log-linear model for PoS tagging that uses another fully unsupervised Bayesian model to initialize the parameters of the model in a cascaded framework. Therefore, we transfer some knowledge between two different unsupervised models to leverage the PoS tagging results, where a log-linear model benefits from a Bayesian model’s expertise. We present results for Turkish as a morphologically rich language and for English as a comparably morphologically poor language in a fully unsupervised framework. The results show that our framework outperforms other unsupervised models proposed for PoS tagging.
- Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. 2016. Globally normalized transition-based neural networks. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany. 2442--2452.Google Scholar
Cross Ref
- Galen Andrew and Jianfeng Gao. 2007. Scalable training of L1-regularized log-linear models. In Proceedings of the 24th International Conference on Machine Learning (ICML’07). ACM, New York, NY, 33--40.Google Scholar
Digital Library
- Sanjeev Arora, Rong Ge, Yonatan Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, and Michael Zhu. 2013. A practical algorithm for topic modeling with provable guarantees. In Proceedings of the 30th International Conference on Machine Learning, Sanjoy Dasgupta and David McAllester (Eds.), Vol. 28. 280--288.Google Scholar
- Michele Banko and Robert C. Moore. 2004. Part of speech tagging in context. In Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics, 556.Google Scholar
- Taylor Berg-Kirkpatrick, Alexandre Bouchard-Côté, John DeNero, and Dan Klein. 2010. Painless unsupervised learning with features. In Human Language Technologies: Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 582--590.Google Scholar
- Chris Biemann. 2006. Unsupervised part-of-speech tagging employing efficient graph clustering. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. Association for Computational Linguistics, 7--12.Google Scholar
- Necva Bölücü and Burcu Can. 2017. Joint PoS tagging and stemming for agglutinative languages. In Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing. Springer, 110--122.Google Scholar
- Lubomir Bourdev and Jonathan Brandt. 2005. Robust object detection via soft cascade. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). IEEE Computer Society, Los Alamitos, CA, 236--243.Google Scholar
Digital Library
- Thorsten Brants. 2000. TnT—A statistical part-of-speech tagger. In Proceedings of the 6th Applied Natural Language Processing Conference. Association for Computational Linguistics, 224--231.Google Scholar
- Eric Brill. 1992. A simple rule-based part of speech tagger. In Proceedings of the 3rd Conference on Applied Natural Language Processing. Association for Computational Linguistics, 152--155.Google Scholar
Digital Library
- Peter F. Brown, Peter V. Desouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. 1992. Class-based n-gram models of natural language. Comput. Ling. 18, 4 (1992), 467--479.Google Scholar
Digital Library
- S. Charles Brubaker, Jianxin Wu, Jie Sun, Matthew D. Mullin, and James M. Rehg. 2008. On the design of cascades of boosted ensembles for face detection. Int. J. Comput. Vis. 77, 1 (01 May 2008), 65--86.Google Scholar
Digital Library
- Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. 1995. A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 16, 5 (1995), 1190--1208.Google Scholar
Digital Library
- Eugene Charniak, Curtis Hendrickson, Neil Jacobson, and Mike Perkowitz. 1993. Equations for part-of-speech tagging. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’93), Vol. 11. 784--789.Google Scholar
- Danqi Chen and Christopher Manning. 2014. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 740--750.Google Scholar
Cross Ref
- Alexander Clark. 2003. Combining distributional and morphological information for part of speech induction. In Proceedings of the 10th Conference on European Chapter of the Association for Computational Linguistics, Volume 1. Association for Computational Linguistics, 59--66.Google Scholar
Digital Library
- Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12 (Aug 2011), 2493--2537.Google Scholar
Digital Library
- Doug Cutting, Julian Kupiec, Jan Pedersen, and Penelope Sibun. 1992. A practical part-of-speech tagger. In Proceedings of the 3rd Conference on Applied Natural Language Processing. Association for Computational Linguistics, 133--140.Google Scholar
Digital Library
- Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey. 1992. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 318--329.Google Scholar
- L. De Lathauwer, B. De Moor, J. Vandewalle, and Blind Source Separation by Higher-Order. 1994. Singular value decomposition. In Proceedings of the European Association for Signal Processing (EUSIPCO’94), Vol. 1. 175--178.Google Scholar
- David Elworthy. 1994. Does Baum-Welch re-estimation help taggers? In Proceedings of the Fourth Conference on Applied Natural Language Processing (ANLC’94). Association for Computational Linguistics, 53--58.Google Scholar
Digital Library
- Roger Fletcher. 1987. Practical Methods of Optimization (2nd ed.). Wiley-Interscience, New York, NY.Google Scholar
Cross Ref
- Jianfeng Gao and Mark Johnson. 2008. A comparison of Bayesian estimators for unsupervised Hidden Markov Model POS taggers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 344--352.Google Scholar
Digital Library
- Sharon Goldwater and Tom Griffiths. 2007. A fully Bayesian approach to unsupervised part-of-speech tagging. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. 744--751.Google Scholar
- Yu Gong, Xusheng Luo, Yu Zhu, Wenwu Ou, Zhao Li, Muhua Zhu, Kenny Zhu, Lu Duan, and Xi Chen. 2019. Deep cascade multi-task learning for slot filling in online shopping assistant. In Proceedings of the AAAI Conference on Artificial Intelligence. 6465--6472.Google Scholar
Cross Ref
- Matthew R. Gormley and Jason Eisner. 2013. Nonconvex global optimization for latent-variable models. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 444--454.Google Scholar
- Aria Haghighi and Dan Klein. 2006. Prototype-driven learning for sequence models. In Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics. Association for Computational Linguistics, 320--327.Google Scholar
- Geremy Heitz, Stephen Gould, Ashutosh Saxena, and Daphne Koller. 2009. Cascaded classification models: Combining models for holistic scene understanding. In Advances in Neural Information Processing Systems 21, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou (Eds.). Curran Associates, Inc., 641--648.Google Scholar
- Yong Jiang, Wenjuan Han, and Kewei Tu. 2016. Unsupervised neural dependency parsing. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 763--771.Google Scholar
Cross Ref
- Mark Johnson. 2007. Why doesn’t EM find good HMM POS-taggers? In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’07).Google Scholar
- Matthieu Labeau, Kevin Löser, and Alexandre Allauzen. 2015. Non-lexical neural architecture for fine-grained POS tagging. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 232--237.Google Scholar
Cross Ref
- John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, 282--289.Google Scholar
Digital Library
- Dong C. Liu and Jorge Nocedal. 1989. On the limited memory BFGS method for large scale optimization. Math. Program. 45, 1–3 (1989), 503--528.Google Scholar
Cross Ref
- Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: The Penn treebank. Comput. Ling. 19, 2 (1993), 313--330.Google Scholar
Digital Library
- Marina Meilă. 2007. Comparing clusterings - An information based distance. J. Multivar. Anal. 98, 5 (2007), 873--895.Google Scholar
Digital Library
- Thomas Minka. 2001. Algorithms for maximum-likelihood logistic regression. Retrieved on March 2021 from http://www.stat.cmu.edu/tr/tr758/tr758.pdf.Google Scholar
- Robert Moore. 2015. An improved tag dictionary for faster part-of-speech tagging. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1303--1308.Google Scholar
Cross Ref
- Karthik Narasimhan, Regina Barzilay, and Tommi Jaakkola. 2015. An unsupervised method for uncovering morphological chains. Transactions of the Association for Computational Linguistics 3 (2015), 157--167.Google Scholar
Cross Ref
- Kemal Oflazer, Bilge Say, Dilek Zeynep Hakkani-Tür, and Gökhan Tür. 2003. Building a Turkish treebank. In Treebanks. Springer, 261--277.Google Scholar
- J. A. Perez-Ortiz and M. L. Forcada. 2001. Part-of-speech tagging with recurrent neural networks. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’01). 1588--1592.Google Scholar
- Barbara Plank, Anders Søgaard, and Yoav Goldberg. 2016. Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 412--418.Google Scholar
Cross Ref
- Adwait Ratnaparkhi. 1996. A maximum entropy model for part-of-speech tagging. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.Google Scholar
- Sujith Ravi and Kevin Knight. 2009. Minimized models for unsupervised part-of-speech tagging. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1. Association for Computational Linguistics, 504--512.Google Scholar
Cross Ref
- Henry Schneiderman. 2004. Feature-centric evaluation for efficient cascaded object detection. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’04). IEEE Computer Society, 29--36.Google Scholar
Digital Library
- Hinrich Schütze. 1993. Part-of-speech induction from scratch. In Proceedings of the 31st Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 251--258.Google Scholar
Digital Library
- Fei Sha and Fernando Pereira. 2003. Shallow parsing with conditional random fields. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1 (NAACL’03). Association for Computational Linguistics, 134--141.Google Scholar
Digital Library
- Noah Smith. 2006. Novel Estimation Methods for Unsupervised Discovery of Latent Structure in Natural Language Text. Ph.D. Dissertation.Google Scholar
- Noah A. Smith and Jason Eisner. 2005. Contrastive estimation: Training log-linear models on unlabeled data. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 354--362.Google Scholar
- Noah A. Smith and Jason Eisner. 2006. Annealing structural bias in multilingual weighted grammar induction. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 569--576.Google Scholar
- Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel Jurafsky. 2010. From baby steps to leapfrog: How less is more in unsupervised dependency parsing. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 751--759.Google Scholar
- Karl Stratos, Michael Collins, and Daniel Hsu. 2016. Unsupervised part-of-speech tagging with anchor hidden markov models. Trans. Assoc. Comput. Ling. 4 (2016), 245--257.Google Scholar
Cross Ref
- Alexander Strehl and Joydeep Ghosh. 2002. Cluster ensembles—A knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3 (Dec. 2002), 583--617.Google Scholar
- Nicola Ueffing and Hermann Ney. 2003. Using POS information for statistical machine translation into morphologically rich languages. In Proceedings of the 10th Conference on European Chapter of the Association for Computational Linguistics, Volume 1. Association for Computational Linguistics, 347--354.Google Scholar
Digital Library
- Peilu Wang, Yao Qian, Frank K. Soong, Lei He, and Hai Zhao. 2015. Part-of-speech tagging with bidirectional long short-term memory recurrent neural network. pre-print, abs/1510.06168.Google Scholar
- Ming Yan, Jiangnan Xia, Chen Wu, Bin Bi, Zhongzhou Zhao, Ji Zhang, Luo Si, Rui Wang, Wei Wang, and Haiqing Chen. 2018. A deep cascade model for multi-document reading comprehension. In Proceedings of The Thirty-Third AAAI Conference on Artificial Intelligence. 7354--7361.Google Scholar
- Ciyou Zhu, Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. 1997. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Trans. Math. Softw. 23, 4 (1997), 550--560.Google Scholar
Digital Library
Index Terms
A Cascaded Unsupervised Model for PoS Tagging
Recommendations
Unsupervised Joint PoS Tagging and Stemming for Agglutinative Languages
The number of possible word forms is theoretically infinite in agglutinative languages. This brings up the out-of-vocabulary (OOV) issue for part-of-speech (PoS) tagging in agglutinative languages. Since inflectional morphology does not change the PoS ...
Unsupervised multilingual learning for POS tagging
EMNLP '08: Proceedings of the Conference on Empirical Methods in Natural Language ProcessingWe demonstrate the effectiveness of multilingual learning for unsupervised part-of-speech tagging. The key hypothesis of multilingual learning is that by combining cues from multiple languages, the structure of each becomes more apparent. We formulate a ...
Unsupervised Word-Sense Disambiguation Using Bilingual Comparable Corpora
An unsupervised method for word-sense disambiguation using bilingual comparable corpora was developed. First, it extracts word associations, i.e., statistically significant pairs of associated words, from the corpus of each language. Then, it aligns ...






Comments