skip to main content
research-article

Deep Contextualized Word Embeddings for Universal Dependency Parsing

Authors Info & Claims
Published:13 July 2019Publication History
Skip Abstract Section

Abstract

Deep contextualized word embeddings (Embeddings from Language Model, short for ELMo), as an emerging and effective replacement for the static word embeddings, have achieved success on a bunch of syntactic and semantic NLP problems. However, little is known about what is responsible for the improvements. In this article, we focus on the effect of ELMo for a typical syntax problem—universal POS tagging and dependency parsing. We incorporate ELMo as additional word embeddings into the state-of-the-art POS tagger and dependency parser, and it leads to consistent performance improvements. Experimental results show the model using ELMo outperforms the state-of-the-art baseline by an average of 0.91 for POS tagging and 1.11 for dependency parsing. Further analysis reveals that the improvements mainly result from the ELMo’s better abstraction ability on the out-of-vocabulary (OOV) words, and the character-level word representation in ELMo contributes a lot to the abstraction. Based on ELMo’s advantage on OOV, experiments that simulate low-resource settings are conducted and the results show that deep contextualized word embeddings are effective for data-insufficient tasks where the OOV problem is severe.

References

  1. Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In Proc. of Coling. http://www.aclweb.org/anthology/C18-1139.Google ScholarGoogle Scholar
  2. Miguel Ballesteros, Chris Dyer, and Noah A. Smith. 2015. Improved transition-based parsing by modeling characters instead of words with LSTMs. In Proc. of EMNLP.Google ScholarGoogle Scholar
  3. Christian Bentz, Tatyana Ruzsics, Alexander Koplenig, and Tanja Samardzic. 2016. A comparison between morphological complexity measures: Typological data vs. language corpora. In Proc. of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC’16).Google ScholarGoogle Scholar
  4. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, 1 (2017), 135--146. https://www.aclweb.org/anthology/Q17-1010.Google ScholarGoogle ScholarCross RefCross Ref
  5. Samuel R. Bowman, Ellie Pavlick, Edouard Grave, Benjamin Van Durme, Alex Wang, Jan Hula, Patrick Xia, Raghavendra Pappagari, R. Thomas McCoy, Roma Patel, Najoung Kim, Ian Tenney, Yinghui Huang, Katherin Yu, Shuning Jin, and Berlin Chen. 2018. Looking for ELMo’s friends: Sentence-level pretraining beyond language modeling. CoRR abs/1812.10860 (2018). arxiv:1812.10860 http://arxiv.org/abs/1812.10860.Google ScholarGoogle Scholar
  6. Wenliang Chen, Jun’ichi Kazama, Kiyotaka Uchimoto, and Kentaro Torisawa. 2009. Improving dependency parsing with subtrees from auto-parsed data. In Proc. of EMNLP. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. W. Chen, M. Zhang, and Y. Zhang. 2015. Distributed feature representations for dependency parsing. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 3 (March 2015), 451--460. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Michael Collins. 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proc. of EMNLP. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language modeling with gated convolutional networks. In Proc. of ICML, Vol. 70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). arxiv:1810.04805Google ScholarGoogle Scholar
  11. Timothy Dozat and Christopher D. Manning. 2016. Deep biaffine attention for neural dependency parsing. CoRR abs/1611.01734 (2016). arxiv:1611.01734Google ScholarGoogle Scholar
  12. Timothy Dozat, Peng Qi, and Christopher D. Manning. 2017. Stanford’s graph-based neural dependency parser at the CoNLL 2017 shared task. In Proc. of CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies.Google ScholarGoogle Scholar
  13. Matthew S. Dryer and Martin Haspelmath (Eds.). 2013. WALS Online. Max Planck Institute for Evolutionary Anthropology, Leipzig.Google ScholarGoogle Scholar
  14. Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In Proc. of ACL.Google ScholarGoogle Scholar
  15. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. CoRR abs/1512.03385 (2015). arxiv:1512.03385 http://arxiv.org/abs/1512.03385.Google ScholarGoogle Scholar
  16. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (Nov. 1997). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proc. of ACL.Google ScholarGoogle ScholarCross RefCross Ref
  18. Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. 2016. Densely connected convolutional networks. CoRR abs/1608.06993 (2016). arxiv:1608.06993 http://arxiv.org/abs/1608.06993.Google ScholarGoogle Scholar
  19. Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On using very large target vocabulary for neural machine translation. In Proc. of ACL.Google ScholarGoogle ScholarCross RefCross Ref
  20. Vidur Joshi, Matthew Peters, and Mark Hopkins. 2018. Extending a parser to distant domains using a few dozen partially annotated examples. In Proc. of ACL.Google ScholarGoogle ScholarCross RefCross Ref
  21. Daniel Kondratyuk. 2019. 75 languages, 1 model: Parsing universal dependencies universally. CoRR abs/1904.02099 (2019). arxiv:1904.02099 http://arxiv.org/abs/1904.02099.Google ScholarGoogle Scholar
  22. Kenton Lee, Luheng He, and Luke Zettlemoyer. 2018. Higher-order coreference resolution with coarse-to-fine inference. In Proc. of NAACL.Google ScholarGoogle ScholarCross RefCross Ref
  23. Percy Liang. 2005. Semi-supervised Learning for Natural Language. Master's Thesis. MIT.Google ScholarGoogle Scholar
  24. P. Liang, H. Daumé, and D. Klein. 2008. Structure compilation: Trading structure for features. In Proc. of the 25th International Conference on Machine Learning (ICML’08), 592--599. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Xuezhe Ma, Zecong Hu, Jingzhou Liu, Nanyun Peng, Graham Neubig, and Eduard Hovy. 2018. Stack-pointer networks for dependency parsing. In Proc. of ACL. https://www.aclweb.org/anthology/P18-1130.Google ScholarGoogle ScholarCross RefCross Ref
  26. Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9 (2008).Google ScholarGoogle Scholar
  27. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of english: The Penn treebank. Computational Linguistic 19, 2 (1993), 313--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In NIPS 30. 6294--6305. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Ryan McDonald and Joakim Nivre. 2007. Characterizing the errors of data-driven dependency parsing models. In Proc. of EMNLP.Google ScholarGoogle Scholar
  30. Ryan T. McDonald and Fernando C. N. Pereira. 2006. Online learning of approximate dependency parsing algorithms. In Proc. of EACL.Google ScholarGoogle Scholar
  31. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. CoRR abs/1310.4546 (2013).Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Joakim Nivre. 2008. Algorithms for deterministic incremental dependency parsing. Computational Linguistics 34, 4 (2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajič, Christopher Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal dependencies v1: A multilingual treebank collection. In Proc. of LREC-2016.Google ScholarGoogle Scholar
  34. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proc. of EMNLP.Google ScholarGoogle ScholarCross RefCross Ref
  35. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL.Google ScholarGoogle ScholarCross RefCross Ref
  36. Matthew Peters, Mark Neumann, Luke Zettlemoyer, and Wen-tau Yih. 2018. Dissecting contextual word embeddings: Architecture and representation. In Proc. of EMNLP.Google ScholarGoogle ScholarCross RefCross Ref
  37. Matthew Peters, Sebastian Ruder, and Noah A. Smith. 2019. To tune or not to tune? Adapting pretrained representations to diverse tasks. CoRR abs/1903.05987 (2019). arxiv:1903.05987 http://arxiv.org/abs/1903.05987.Google ScholarGoogle Scholar
  38. Milan Straka. 2018. UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In Proc. of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. https://www.aclweb.org/anthology/K18-2020.Google ScholarGoogle Scholar
  39. Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R. Thomas McCoy, Najoung Kim, Benjamin Van Durme, Sam Bowman, Dipanjan Das, and Ellie Pavlick. 2019. What do you learn from context? Probing for sentence structure in contextualized word representations. In International Conference on Learning Representations. https://openreview.net/forum?id=SJzSgnRcKX.Google ScholarGoogle Scholar
  40. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS 30. 5998--6008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Daniel Zeman, Filip Ginter, Jan Hajič, Joakim Nivre, Martin Popel, and Milan Straka. 2018. CoNLL 2018 shared task: Multilingual parsing from raw text to universal dependencies. In Proc. of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies.Google ScholarGoogle Scholar
  42. Kelly Zhang and Samuel Bowman. 2018. Language modeling teaches you more than translation does: Lessons learned through auxiliary syntactic task analysis. In Proc. of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. http://www.aclweb.org/anthology/W18-5448.Google ScholarGoogle ScholarCross RefCross Ref
  43. Yue Zhang and Joakim Nivre. 2011. Transition-based dependency parsing with rich non-local features. In Proc. of ACL. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Deep Contextualized Word Embeddings for Universal Dependency Parsing

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!