skip to main content
research-article

RCE-HIL: Recognizing Cross-media Entailment with Heterogeneous Interactive Learning

Published:17 February 2020Publication History
Skip Abstract Section

Abstract

Entailment recognition is an important paradigm of reasoning that judges if a hypothesis can be inferred from given premises. However, previous efforts mainly concentrate on text-based reasoning as recognizing textual entailment (RTE), where the hypotheses and premises are both textual. In fact, humans’ reasoning process has the characteristic of cross-media reasoning. It is naturally based on the joint inference with different sensory organs, which represent complementary reasoning cues from unique perspectives as language, vision, and audition. How to realize cross-media reasoning has been a significant challenge to achieve the breakthrough for width and depth of entailment recognition. Therefore, this article extends RTE to a novel reasoning paradigm: recognizing cross-media entailment (RCE), and proposes heterogeneous interactive learning (HIL) approach. Specifically, HIL recognizes entailment relationships via cross-media joint inference, from image-text premises to text hypotheses. It is an end-to-end architecture with two parts: (1) Cross-media hybrid embedding is proposed to perform cross embedding of premises and hypotheses for generating their fine-grained representations. It aims to achieve the alignment of cross-media inference cues via image-text and text-text interactive attention. (2) Heterogeneous joint inference is proposed to construct a heterogeneous interaction tensor space and extract semantic features for entailment recognition. It aims to simultaneously capture the interaction between cross-media premises and hypotheses and distinguish their entailment relationships. Experimental results on widely used Stanford natural language inference (SNLI) dataset with image premises from Flickr30K dataset verify the effectiveness of HIL and the intrinsic inter-media complementarity in reasoning.

References

  1. Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2010. Deep canonical correlation analysis. In Proceedings of the International Conference on Machine Learning (ICML’10). 3408--3415.Google ScholarGoogle Scholar
  2. Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP’15). 632--642.Google ScholarGoogle Scholar
  3. Samuel R. Bowman, Jon Gauthier, Abhinav Rastogi, Raghav Gupta, Christopher D. Manning, and Christopher Potts. 2016. A fast unified model for parsing and sentence understanding. In Proceedings of the Meeting of the Association for Computational Linguistics (ACL’16). 1466--1477.Google ScholarGoogle ScholarCross RefCross Ref
  4. Herng-Yow Chen and Sheng-Wei Li. 2007. Exploring many-to-one speech-to-text correlation for web-based language learning. ACM Trans. Multim. Comput. Commun. Applic. 3, 3 (2007), 13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Diana Inkpen, and Si Wei. 2018. Neural natural language inference models enhanced with external knowledge. In Proceedings of the Meeting of the Association for Computational Linguistics (ACL’18). 2406--2417.Google ScholarGoogle ScholarCross RefCross Ref
  6. Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced LSTM for natural language inference. In Proceedings of the Meeting of the Association for Computational Linguistics (ACL’17). 1657--1668.Google ScholarGoogle ScholarCross RefCross Ref
  7. Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Long short-term memory-networks for machine reading. In Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP’16). 551--561.Google ScholarGoogle ScholarCross RefCross Ref
  8. Ido Dagan and Oren Glickman. 2004. Probabilistic textual entailment: Generic applied modeling of languagevariability. In Proceedings of the Learning Methods for Text Understanding and Mining Workshop.Google ScholarGoogle Scholar
  9. Fangxiang Feng, Xiaojie Wang, Ruifan Li, and Ibrar Ahmad. 2015. Correspondence autoencoders for cross-modal retrieval. ACM Trans. Multim. Comput. Commun. Applic. 12, 1s (2015), 26:1--26:22.Google ScholarGoogle Scholar
  10. Yichen Gong, Heng Luo, and Jian Zhang. 2017. Natural language inference over interaction space. arXiv preprint arXiv: abs/1709.04348 (2017).Google ScholarGoogle Scholar
  11. Dan Han, Pascual Martínez-Gómez, and Koji Mineshima. 2017. Visual denotations for recognizing textual entailment. In Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP’17). 2853--2859.Google ScholarGoogle ScholarCross RefCross Ref
  12. Sanda M. Harabagiu and Andrew Hickl. 2006. Methods for using textual entailment in open-domain question answering. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Meeting of the Association for Computational Linguistics. 905--912.Google ScholarGoogle Scholar
  13. Harold Hotelling. 1936. Relations between two sets of variates. Biometrika 28, 3–4 (1936), 321--377.Google ScholarGoogle ScholarCross RefCross Ref
  14. Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2261--2269.Google ScholarGoogle Scholar
  15. Xin Huang and Yuxin Peng. 2018. Deep cross-media knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 8837--8846.Google ScholarGoogle ScholarCross RefCross Ref
  16. Xin Huang, Yuxin Peng, and Mingkuan Yuan. 2017. Cross-modal common representation learning by hybrid transfer network. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’17). 1893--1900.Google ScholarGoogle ScholarCross RefCross Ref
  17. Cuicui Kang, Shiming Xiang, Shengcai Liao, Changsheng Xu, and Chunhong Pan. 2015. Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans. Multim. 17, 3 (2015), 370--381.Google ScholarGoogle ScholarCross RefCross Ref
  18. Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 1 (2017), 32--73.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. G. Hinton, A. Krizhevsky, and I. Sutskever. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’12). 1106--1114.Google ScholarGoogle Scholar
  20. Kai Li, Guo-Jun Qi, and Kien A. Hua. 2018. Learning label preserving binary codes for multimedia retrieval: A general approach. ACM Trans. Multim. Comput. Commun. Applic. 14, 1 (2018), 2:1--2:23.Google ScholarGoogle Scholar
  21. Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’16). 289--297.Google ScholarGoogle Scholar
  22. Bill MacCartney. 2009. Natural Language Inference. Ph.D. thesis. Stanford University.Google ScholarGoogle Scholar
  23. George A. Miller. 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (1995), 39--41.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Shachar Mirkin, Lucia Specia, Nicola Cancedda, Ido Dagan, Marc Dymetman, and Idan Szpektor. 2009. Source-language entailment modeling for translating unknown terms. In Proceedings of the 47th Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 791--799.Google ScholarGoogle ScholarCross RefCross Ref
  25. Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, Rui Yan, and Zhi Jin. 2016. Natural language inference by tree-based convolution and heuristic matching. In Proceedings of the Meeting of the Association for Computational Linguistics (ACL’16). 1466--1477.Google ScholarGoogle ScholarCross RefCross Ref
  26. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal deep learning. In Proceedings of the International Conference on Machine Learning (ICML’11). 689--696.Google ScholarGoogle Scholar
  27. Biswajit Paria, K. M. Annervaz, Ambedkar Dukkipati, Ankush Chatterjee, and Sanjay Podder. 2016. A neural architecture mimicking humans end-to-end for natural language inference. arXiv preprint arXiv: abs/1611.04741 (2016).Google ScholarGoogle Scholar
  28. Ankur P. Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. 2016. A decomposable attention model for natural language inference. In Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP’16). 2249--2255.Google ScholarGoogle ScholarCross RefCross Ref
  29. Yuxin Peng, Xin Huang, and Jinwei Qi. 2016. Cross-media shared representation by hierarchical learning with multiple deep networks. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’16). 3846--3853.Google ScholarGoogle Scholar
  30. Yuxin Peng, Xin Huang, and Yunzhen Zhao. 2018. An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges. IEEE Trans. Circ. Syst. Vid. Technol. 28, 9 (2018), 2372--2385.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Yuxin Peng, Jinwei Qi, Xin Huang, and Yuxin Yuan. 2018. CCL: Cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Trans. Multim. 20, 2 (2018), 405--420.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2017. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. Int. J. Comput. Vis. 123, 1 (2017), 74--93.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the ACM International Conference on Multimedia (ACM MM’10). 251--260.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Lei Sha, Baobao Chang, Zhifang Sui, and Sujian Li. 2016. Reading and thinking: Re-read LSTM unit for textual entailment recognition. In Proceedings of the International Conference on Computational Linguistics (COLING’16). 2870--2879.Google ScholarGoogle Scholar
  35. Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. 2018. DiSAN: Directional self-attention network for RNN/CNN-free language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’18).Google ScholarGoogle Scholar
  36. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv: 1409.1556 (2014).Google ScholarGoogle Scholar
  37. Yi Tay, Luu Anh Tuan, and Siu Cheung. 2018. Compare, compress, and propagate: Enhancing neural architectures with alignment factorization for natural language inference. In Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP’18). 1565--1575.Google ScholarGoogle ScholarCross RefCross Ref
  38. Peng Wang, Qi Wu, Chunhua Shen, Anthony R. Dick, and Anton van den Hengel. 2017. Explicit knowledge-based reasoning for visual question answering. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’17). 1290--1296.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Shuohang Wang and Jing Jiang. 2016. Learning natural language inference with LSTM. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL’16). 1442--1451.Google ScholarGoogle ScholarCross RefCross Ref
  40. Zhiguo Wang, Wael Hamza, and Radu Florian. 2017. Bilateral multi-perspective matching for natural language sentences. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’17). 4144--4150.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Yunchao Wei, Yao Zhao, Canyi Lu, Shikui Wei, Luoqi Liu, Zhenfeng Zhu, and Shuicheng Yan. 2017. Cross-modal retrieval with CNN visual features: A new baseline. IEEE Trans. Cyber. 47, 2 (2017), 449--460.Google ScholarGoogle Scholar
  42. Hongyang Xue, Zhou Zhao, and Deng Cai. 2017. Unifying the video and question attentions for open-ended video question answering. IEEE Trans. Image Proc. 26, 12 (2017), 5656--5666.Google ScholarGoogle ScholarCross RefCross Ref
  43. Wenpeng Yin, Hinrich Schütze, Bing Xiang, and Bowen Zhou. 2016. ABCNN: Attention-based convolutional neural network for modeling sentence pairs. Trans. Assoc. Computat. Ling. 4 (2016), 259--272.Google ScholarGoogle ScholarCross RefCross Ref
  44. Hong Yu and Tsendsuren Munkhdalai. 2017. Neural tree indexers for text understanding. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL’17). 11--21.Google ScholarGoogle Scholar

Index Terms

  1. RCE-HIL: Recognizing Cross-media Entailment with Heterogeneous Interactive Learning

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!