Abstract
Dependency-based graph convolutional networks (DepGCNs) are proven helpful for text representation to handle many natural language tasks. Almost all previous models are trained with cross-entropy (CE) loss, which maximizes the posterior likelihood directly. However, the contribution of dependency structures is not well considered by CE loss. As a result, the performance improvement gained by using the structure information can be narrow due to the failure in learning to rely on this structure information. To face the challenge, we propose the novel structurally comparative hinge (SCH) loss function for DepGCNs. SCH loss aims at enlarging the margin gained by structural representations over non-structural ones. From the perspective of information theory, this is equivalent to improving the conditional mutual information of model decision and structure information given text. Our experimental results on both English and Chinese datasets show that by substituting SCH loss for CE loss on various tasks, for both induced structures and structures from an external parser, performance is improved without additional learnable parameters. Furthermore, the extent to which certain types of examples rely on the dependency structure can be measured directly by the learned margin, which results in better interpretability. In addition, through detailed analysis, we show that this structure margin has a positive correlation with task performance and structure induction of DepGCNs, and SCH loss can help model focus more on the shortest dependency path between entities. We achieve the new state-of-the-art results on TACRED, IMDB, and Zh. Literature datasets, even compared with ensemble and BERT baselines.
- Joost Bastings, Wilker Aziz, Ivan Titov, and Khalil Sima’an. 2019. Modeling latent sentence structure in neural machine translation. arxiv:1901.06436.Google Scholar
- Joost Bastings, Ivan Titov, Wilker Aziz, Diego Marcheggiani, and Khalil Sima’an. 2017. Graph convolutional encoders for syntax-aware neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 1957--1967. https://aclanthology.info/papers/D17-1209/d17-1209.Google Scholar
Cross Ref
- Yonatan Bisk and Ke Tran. 2018. Inducing grammars with and for neural machine translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation ([email protected]’18). 25--35. https://aclanthology.info/papers/W18-2704/w18-2704.Google Scholar
Cross Ref
- Rui Cai, Xiaodong Zhang, and Houfeng Wang. 2016. Bidirectional recurrent convolutional neural network for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL’16), Volume 1: Long Papers. 756–765. http://aclweb.org/anthology/P/P16/P16-1072.pdf.Google Scholar
Cross Ref
- Daniel Cer, Yinfei Yang, Sheng-Yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, et al. 2018. Universal sentence encoder for English. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP’18): System Demonstrations 169--174. https://aclanthology.info/papers/D18-2029/d18-2029.Google Scholar
Cross Ref
- Jihun Choi, Kang Min Yoo, and SangGoo Lee. 2018. Learning to compose task-specific tree structures. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI-18), the 30th Innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18). 5094--5101. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16682.Google Scholar
- Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. 2017. Hierarchical multiscale recurrent neural networks. In Proceedings of the 5th International Conference on Learning Representations (ICLR’17): Conference Track Proceedings. https://openreview.net/forum?id=S1di0sfgl.Google Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.Google Scholar
- Qiming Diao, Minghui Qiu, Chao-Yuan Wu, Alexander J. Smola, Jing Jiang, and Chong Wang. 2014. Jointly modeling aspects, ratings and sentiments for movie recommendation (JMARS). In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’14). 193--202. DOI:https://doi.org/10.1145/2623330.2623758Google Scholar
Digital Library
- Akiko Eriguchi, Yoshimasa Tsuruoka, and Kyunghyun Cho. 2017. Learning to parse and translate improves neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL’17), Volume 2: Short Papers. 72--78. DOI:https://doi.org/10.18653/v1/P17-2012Google Scholar
Cross Ref
- Zhe Gan, Yunchen Pu, Ricardo Henao, Chunyuan Li, Xiaodong He, and Lawrence Carin. 2017. Learning generic sentence representations using convolutional neural networks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 2390--2400. https://aclanthology.info/papers/D17-1254/d17-1254.Google Scholar
Cross Ref
- Yichen Gong, Heng Luo, and Jian Zhang. 2018. Natural language inference over interaction space. In Proceedings of the 6th International Conference on Learning Representations (ICLR’18): Conference Track Proceedings. https://openreview.net/forum?id=r1dHXnH6-.Google Scholar
- Matthew R. Gormley, Mo Yu, and Mark Dredze. 2015. Improved relation extraction with feature-rich compositional embedding models. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP’15). 1774--1784. http://aclweb.org/anthology/D/D15/D15-1205.pdf.Google Scholar
Cross Ref
- Zhijiang Guo, Yan Zhang, and Wei Lu. 2019. Attention guided graph convolutional networks for relation extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 241--251. DOI:https://doi.org/10.18653/v1/P19-1024Google Scholar
Cross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770--778. DOI:https://doi.org/10.1109/CVPR.2016.90Google Scholar
Cross Ref
- Shexia He, Zuchao Li, Hai Zhao, and Hongxiao Bai. 2018. Syntax for semantic role labeling, to be, or not to be. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18), Volume 1: Long Papers. 2061--2071. https://aclanthology.info/papers/P18-1192/p18-1192.Google Scholar
Cross Ref
- Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016. Learning distributed representations of sentences from unlabelled data. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’16). 1367--1377. http://aclweb.org/anthology/N/N16/N16-1162.pdf.Google Scholar
Cross Ref
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780. DOI:https://doi.org/10.1162/neco.1997.9.8.1735Google Scholar
Digital Library
- Matthew Honnibal and Mark Johnson. 2015. An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP’15). 1373--1378. https://aclweb.org/anthology/D/D15/D15-1162.Google Scholar
Cross Ref
- Sébastien Jean and Kyunghyun Cho. 2019. Context-aware learning for neural machine translation. arxiv:1903.04715.Google Scholar
- Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL’14), Volume 1: Long Papers. 655--665. http://aclweb.org/anthology/P/P14/P14-1062.pdf.Google Scholar
Cross Ref
- Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations (ICLR’17): Conference Track Proceedings. https://openreview.net/forum?id=SJU4ayYgl.Google Scholar
- Terry Koo, Amir Globerson, Xavier Carreras, and Michael Collins. 2007. Structured prediction models via the matrix-tree theorem. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’07). 141--150. http://www.aclweb.org/anthology/D07-1015.Google Scholar
- Jiwei Li, Xinlei Chen, Eduard H. Hovy, and Dan Jurafsky. 2016. Visualizing and understanding neural models in NLP. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’16). 681--691. http://aclweb.org/anthology/N/N16/N16-1082.pdf.Google Scholar
Cross Ref
- Shen Li, Zhe Zhao, Renfen Hu, Wensi Li, Tao Liu, and Xiaoyong Du. 2018. Analogical reasoning on Chinese morphological and semantic relations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18), Volume 2: Short Papers. 138--143. http://aclweb.org/anthology/P18-2023.Google Scholar
Cross Ref
- Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. 2016. Gated graph sequence neural networks. In Proceedings of the 4th International Conference on Learning Representations (ICLR’16): Conference Track Proceedings. http://arxiv.org/abs/1511.05493Google Scholar
- Xin Liu, Qingcai Chen, Chong Deng, Huajun Zeng, Jing Chen, Dongfang Li, and Buzhou Tang. 2018. LCQMC: A large-scale Chinese question matching corpus. In Proceedings of the 27th International Conference on Computational Linguistics (COLING’18). 1952--1962. https://aclanthology.info/papers/C18-1166/c18-1166.Google Scholar
- Yang Liu and Mirella Lapata. 2018. Learning structured text representations. Transactions of the Association for Computational Linguistics 6 (2018), 63--75. https://transacl.org/ojs/index.php/tacl/article/view/1185.Google Scholar
Cross Ref
- Yang Liu, Furu Wei, Sujian Li, Heng Ji, Ming Zhou, and Houfeng Wang. 2015. A dependency-based neural network for relation classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL’15), Volume 2: Short Papers. 285--290. http://aclweb.org/anthology/P/P15/P15-2047.pdf.Google Scholar
- Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. 2017. The concrete distribution: A continuous relaxation of discrete random variables. In Proceedings of the 5th International Conference on Learning Representations (ICLR’17): Conference Track Proceedings. https://openreview.net/forum?id=S1jE5L5gl.Google Scholar
- Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL’14): System Demonstrations. 55--60. http://aclweb.org/anthology/P/P14/P14-5010.pdf.Google Scholar
Cross Ref
- Diego Marcheggiani and Ivan Titov. 2017. Encoding sentences with graph convolutional networks for semantic role labeling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 1506--1515. https://aclanthology.info/papers/D17-1159/d17-1159.Google Scholar
Cross Ref
- Makoto Miwa and Mohit Bansal. 2016. End-to-end relation extraction using LSTMs on sequences and tree structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL’16), Volume 1: Long Papers. http://aclweb.org/anthology/P/P16/P16-1105.pdf.Google Scholar
Cross Ref
- Vlad Niculae, André F. T. Martins, and Claire Cardie. 2018. Towards dynamic computation graphs via sparse latent structure. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language. 905--911. https://aclanthology.info/papers/D18-1108/d18-1108.Google Scholar
Cross Ref
- Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying Song, and Rabab K. Ward. 2016. Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24, 4 (2016), 694--707. DOI:https://doi.org/10.1109/TASLP.2016.2520371Google Scholar
Digital Library
- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14), a Meeting of SIGDAT, a Special Interest Group of the ACL. 1532--1543. http://aclweb.org/anthology/D/D14/D14-1162.pdf.Google Scholar
- Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18), Volume 1 (Long Papers). 2227--2237. https://aclanthology.info/papers/N18-1202/n18-1202Google Scholar
Cross Ref
- Alessandro Raganato and Jörg Tiedemann. 2018. An analysis of encoder representations in transformer-based machine translation. In Proceedings of the Workshop on Analyzing and Interpreting Neural Networks for NLP ([email protected]’18). 287--297. https://aclanthology.info/papers/W18-5431/w18-5431.Google Scholar
Cross Ref
- Rico Sennrich and Barry Haddow. 2016. Linguistic input features improve neural machine translation. In Proceedings of the 1st Conference on Machine Translation (WMT’17), Colocated with ACL 2016. 83--91. http://aclweb.org/anthology/W/W16/W16-2209.pdf.Google Scholar
Cross Ref
- Yikang Shen, Shawn Tan, Alessandro Sordoni, and Aaron C. Courville. 2019. Ordered neurons: Integrating tree structures into recurrent neural networks. In Proceedings of the 7th International Conference on Learning Representations (ICLR’19). https://openreview.net/forum?id=B1l6qiR5F7.Google Scholar
- Peng Shi and Jimmy Lin. 2019. Simple BERT models for relation extraction and semantic role labeling. arxiv:1904.05255.Google Scholar
- Xing Shi, Inkit Padhi, and Kevin Knight. 2016. Does string-based neural MT learn source syntax? In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP’16). 1526--1534. http://aclweb.org/anthology/D/D16/D16-1159.pdf.Google Scholar
Cross Ref
- Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP’13), a Meeting of SIGDAT, a Special Interest Group of the ACL. 1631--1642. https://aclanthology.info/papers/D13-1170/d13-1170.Google Scholar
- Baohua Sun, Lin Yang, Patrick Dong, Wenhan Zhang, Jason Dong, and Charles Young. 2018. Super characters: A conversion from sentiment classification to image classification. In Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment, and Social Media Analysis ([email protected]’18). 309--315. https://aclanthology.info/papers/W18-6245/w18-6245.Google Scholar
Cross Ref
- Swabha Swayamdipta, Sam Thomson, Kenton Lee, Luke Zettlemoyer, Chris Dyer, and Noah A. Smith. 2018. Syntactic scaffolds for semantic structures. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 3772--3782. https://aclanthology.info/papers/D18-1412/d18-1412.Google Scholar
- Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL’15), Volume 1: Long Papers. 1556--1566. http://aclweb.org/anthology/P/P15/P15-1150.pdf.Google Scholar
- Duyu Tang, Bing Qin, and Ting Liu. 2015. Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP’15). 1422--1432. http://aclweb.org/anthology/D/D15/D15-1167.pdf.Google Scholar
Cross Ref
- Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT rediscovers the classical NLP pipeline. arxiv:1905.05950.Google Scholar
- Yufei Wang, Mark Johnson, Stephen Wan, Yifang Sun, and Wei Wang. 2019. How to best use syntax in semantic role labelling. arxiv:1906.00266.Google Scholar
- Zhiguo Wang, Wael Hamza, and Radu Florian. 2017. Bilateral multi-perspective matching for natural language sentences. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17). 4144--4150. DOI:https://doi.org/10.24963/ijcai.2017/579Google Scholar
Cross Ref
- Zhiguo Wang, Haitao Mi, and Abraham Ittycheriah. 2016. Sentence similarity learning by lexical decomposition and composition. In Proceedings of the 26th International Conference on Computational Linguistics (COLING’16): Technical Papers. 1340--1349. http://aclweb.org/anthology/C/C16/C16-1127.pdf.Google Scholar
- Larry Wasserman. 2000. Bayesian model selection and model averaging. Journal of Mathematical Psychology 44, 1 (2000), 92--107.Google Scholar
Digital Library
- Ji Wen, Xu Sun, Xuancheng Ren, and Qi Su. 2018. Structure regularized neural network for entity relation classification for Chinese literature text. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18), Volume 2 (Short Papers). 365--370. https://aclanthology.info/papers/N18-2059/n18-2059.Google Scholar
Cross Ref
- Yunlun Yang, Yunhai Tong, Shulei Ma, and Zhi-Hong Deng. 2016. A position encoding convolutional neural network based on dependency tree for relation classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP’16). 65--74. DOI:https://doi.org/10.18653/v1/D16-1007Google Scholar
Cross Ref
- Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J. Smola, and Eduard H. Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’16). 1480--1489. http://aclweb.org/anthology/N/N16/N16-1174.pdf.Google Scholar
- Meishan Zhang, Zhenghua Li, Guohong Fu, and Min Zhang. 2019. Syntax-enhanced neural machine translation with syntax-aware word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19), Volume 1 (Long and Short Papers). 1151--1161. https://aclweb.org/anthology/papers/N/N19/N19-1118/.Google Scholar
Cross Ref
- Xiang Zhang and Yann LeCun. 2017. Which encoding is the best for text classification in Chinese, English, Japanese and Korean?arxiv:1708.02657.Google Scholar
- Yuhao Zhang, Peng Qi, and Christopher D. Manning. 2018. Graph convolution over pruned dependency trees improves relation extraction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 2205--2215. https://aclanthology.info/papers/D18-1244/d18-1244.Google Scholar
- Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D. Manning. 2017. Position-aware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 35--45. https://aclanthology.info/papers/D17-1004/d17-1004.Google Scholar
- Xiao-Dan Zhu, Parinaz Sobhani, and Hongyu Guo. 2015. Long short-term memory over recursive structures. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 1604--1612. http://jmlr.org/proceedings/papers/v37/zhub15.html.Google Scholar
Index Terms
Structurally Comparative Hinge Loss for Dependency-Based Neural Text Representation
Recommendations
Deformation of log-likelihood loss function for multiclass boosting
The purpose of this paper is to study loss functions in multiclass classification. In classification problems, the decision function is estimated by minimizing an empirical loss function, and then, the output label is predicted by using the estimated ...
A comparative study of target dependency structures for statistical machine translation
ACL '12: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2This paper presents a comparative study of target dependency structures yielded by several state-of-the-art linguistic parsers. Our approach is to measure the impact of these non-isomorphic dependency structures to be used for string-to-dependency ...
Multiclass boosting with hinge loss based on output coding
ICML'11: Proceedings of the 28th International Conference on International Conference on Machine LearningMulticlass classification is an important and fundamental problem in machine learning. A popular family of multiclass classification methods belongs to reducing multiclass to binary based on output coding. Several multiclass boosting algorithms have ...






Comments