Abstract
Deep contextualized word embeddings (Embeddings from Language Model, short for ELMo), as an emerging and effective replacement for the static word embeddings, have achieved success on a bunch of syntactic and semantic NLP problems. However, little is known about what is responsible for the improvements. In this article, we focus on the effect of ELMo for a typical syntax problem—universal POS tagging and dependency parsing. We incorporate ELMo as additional word embeddings into the state-of-the-art POS tagger and dependency parser, and it leads to consistent performance improvements. Experimental results show the model using ELMo outperforms the state-of-the-art baseline by an average of 0.91 for POS tagging and 1.11 for dependency parsing. Further analysis reveals that the improvements mainly result from the ELMo’s better abstraction ability on the out-of-vocabulary (OOV) words, and the character-level word representation in ELMo contributes a lot to the abstraction. Based on ELMo’s advantage on OOV, experiments that simulate low-resource settings are conducted and the results show that deep contextualized word embeddings are effective for data-insufficient tasks where the OOV problem is severe.
- Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In Proc. of Coling. http://www.aclweb.org/anthology/C18-1139.Google Scholar
- Miguel Ballesteros, Chris Dyer, and Noah A. Smith. 2015. Improved transition-based parsing by modeling characters instead of words with LSTMs. In Proc. of EMNLP.Google Scholar
- Christian Bentz, Tatyana Ruzsics, Alexander Koplenig, and Tanja Samardzic. 2016. A comparison between morphological complexity measures: Typological data vs. language corpora. In Proc. of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC’16).Google Scholar
- Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, 1 (2017), 135--146. https://www.aclweb.org/anthology/Q17-1010.Google Scholar
Cross Ref
- Samuel R. Bowman, Ellie Pavlick, Edouard Grave, Benjamin Van Durme, Alex Wang, Jan Hula, Patrick Xia, Raghavendra Pappagari, R. Thomas McCoy, Roma Patel, Najoung Kim, Ian Tenney, Yinghui Huang, Katherin Yu, Shuning Jin, and Berlin Chen. 2018. Looking for ELMo’s friends: Sentence-level pretraining beyond language modeling. CoRR abs/1812.10860 (2018). arxiv:1812.10860 http://arxiv.org/abs/1812.10860.Google Scholar
- Wenliang Chen, Jun’ichi Kazama, Kiyotaka Uchimoto, and Kentaro Torisawa. 2009. Improving dependency parsing with subtrees from auto-parsed data. In Proc. of EMNLP. Google Scholar
Digital Library
- W. Chen, M. Zhang, and Y. Zhang. 2015. Distributed feature representations for dependency parsing. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 3 (March 2015), 451--460. Google Scholar
Digital Library
- Michael Collins. 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proc. of EMNLP. Google Scholar
Digital Library
- Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language modeling with gated convolutional networks. In Proc. of ICML, Vol. 70. Google Scholar
Digital Library
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). arxiv:1810.04805Google Scholar
- Timothy Dozat and Christopher D. Manning. 2016. Deep biaffine attention for neural dependency parsing. CoRR abs/1611.01734 (2016). arxiv:1611.01734Google Scholar
- Timothy Dozat, Peng Qi, and Christopher D. Manning. 2017. Stanford’s graph-based neural dependency parser at the CoNLL 2017 shared task. In Proc. of CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies.Google Scholar
- Matthew S. Dryer and Martin Haspelmath (Eds.). 2013. WALS Online. Max Planck Institute for Evolutionary Anthropology, Leipzig.Google Scholar
- Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In Proc. of ACL.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. CoRR abs/1512.03385 (2015). arxiv:1512.03385 http://arxiv.org/abs/1512.03385.Google Scholar
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (Nov. 1997). Google Scholar
Digital Library
- Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proc. of ACL.Google Scholar
Cross Ref
- Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. 2016. Densely connected convolutional networks. CoRR abs/1608.06993 (2016). arxiv:1608.06993 http://arxiv.org/abs/1608.06993.Google Scholar
- Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On using very large target vocabulary for neural machine translation. In Proc. of ACL.Google Scholar
Cross Ref
- Vidur Joshi, Matthew Peters, and Mark Hopkins. 2018. Extending a parser to distant domains using a few dozen partially annotated examples. In Proc. of ACL.Google Scholar
Cross Ref
- Daniel Kondratyuk. 2019. 75 languages, 1 model: Parsing universal dependencies universally. CoRR abs/1904.02099 (2019). arxiv:1904.02099 http://arxiv.org/abs/1904.02099.Google Scholar
- Kenton Lee, Luheng He, and Luke Zettlemoyer. 2018. Higher-order coreference resolution with coarse-to-fine inference. In Proc. of NAACL.Google Scholar
Cross Ref
- Percy Liang. 2005. Semi-supervised Learning for Natural Language. Master's Thesis. MIT.Google Scholar
- P. Liang, H. Daumé, and D. Klein. 2008. Structure compilation: Trading structure for features. In Proc. of the 25th International Conference on Machine Learning (ICML’08), 592--599. Google Scholar
Digital Library
- Xuezhe Ma, Zecong Hu, Jingzhou Liu, Nanyun Peng, Graham Neubig, and Eduard Hovy. 2018. Stack-pointer networks for dependency parsing. In Proc. of ACL. https://www.aclweb.org/anthology/P18-1130.Google Scholar
Cross Ref
- Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9 (2008).Google Scholar
- Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of english: The Penn treebank. Computational Linguistic 19, 2 (1993), 313--330. Google Scholar
Digital Library
- Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In NIPS 30. 6294--6305. Google Scholar
Digital Library
- Ryan McDonald and Joakim Nivre. 2007. Characterizing the errors of data-driven dependency parsing models. In Proc. of EMNLP.Google Scholar
- Ryan T. McDonald and Fernando C. N. Pereira. 2006. Online learning of approximate dependency parsing algorithms. In Proc. of EACL.Google Scholar
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. CoRR abs/1310.4546 (2013).Google Scholar
Digital Library
- Joakim Nivre. 2008. Algorithms for deterministic incremental dependency parsing. Computational Linguistics 34, 4 (2008). Google Scholar
Digital Library
- Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajič, Christopher Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal dependencies v1: A multilingual treebank collection. In Proc. of LREC-2016.Google Scholar
- Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proc. of EMNLP.Google Scholar
Cross Ref
- Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL.Google Scholar
Cross Ref
- Matthew Peters, Mark Neumann, Luke Zettlemoyer, and Wen-tau Yih. 2018. Dissecting contextual word embeddings: Architecture and representation. In Proc. of EMNLP.Google Scholar
Cross Ref
- Matthew Peters, Sebastian Ruder, and Noah A. Smith. 2019. To tune or not to tune? Adapting pretrained representations to diverse tasks. CoRR abs/1903.05987 (2019). arxiv:1903.05987 http://arxiv.org/abs/1903.05987.Google Scholar
- Milan Straka. 2018. UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In Proc. of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. https://www.aclweb.org/anthology/K18-2020.Google Scholar
- Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R. Thomas McCoy, Najoung Kim, Benjamin Van Durme, Sam Bowman, Dipanjan Das, and Ellie Pavlick. 2019. What do you learn from context? Probing for sentence structure in contextualized word representations. In International Conference on Learning Representations. https://openreview.net/forum?id=SJzSgnRcKX.Google Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS 30. 5998--6008. Google Scholar
Digital Library
- Daniel Zeman, Filip Ginter, Jan Hajič, Joakim Nivre, Martin Popel, and Milan Straka. 2018. CoNLL 2018 shared task: Multilingual parsing from raw text to universal dependencies. In Proc. of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies.Google Scholar
- Kelly Zhang and Samuel Bowman. 2018. Language modeling teaches you more than translation does: Lessons learned through auxiliary syntactic task analysis. In Proc. of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. http://www.aclweb.org/anthology/W18-5448.Google Scholar
Cross Ref
- Yue Zhang and Joakim Nivre. 2011. Transition-based dependency parsing with rich non-local features. In Proc. of ACL. Google Scholar
Digital Library
Index Terms
Deep Contextualized Word Embeddings for Universal Dependency Parsing
Recommendations
Experiments on POS tagging and data driven dependency parsing for Telugu language
ICACCI '12: Proceedings of the International Conference on Advances in Computing, Communications and InformaticsIn this paper we present our experiments on Part-Of-Speech tagging and data driven dependency Parsing for Telugu language. We adopted three Part-Of-Speech taggers named as Brill tagger, Maximum Entropy tagger and Trigrams 'n' Tags tagger (TnT) to Telugu ...
Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition
This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological ...
Improving word vector model with part‐of‐speech and dependency grammar information
Part‐of‐speech (POS) and dependency grammar (DG) are the basic components of natural language processing. However, current word vector models have not made full use of both POS information and DG information, and hence the models’ performances are limited ...






Comments