Abstract
Entailment recognition is an important paradigm of reasoning that judges if a hypothesis can be inferred from given premises. However, previous efforts mainly concentrate on text-based reasoning as recognizing textual entailment (RTE), where the hypotheses and premises are both textual. In fact, humans’ reasoning process has the characteristic of cross-media reasoning. It is naturally based on the joint inference with different sensory organs, which represent complementary reasoning cues from unique perspectives as language, vision, and audition. How to realize cross-media reasoning has been a significant challenge to achieve the breakthrough for width and depth of entailment recognition. Therefore, this article extends RTE to a novel reasoning paradigm: recognizing cross-media entailment (RCE), and proposes heterogeneous interactive learning (HIL) approach. Specifically, HIL recognizes entailment relationships via cross-media joint inference, from image-text premises to text hypotheses. It is an end-to-end architecture with two parts: (1) Cross-media hybrid embedding is proposed to perform cross embedding of premises and hypotheses for generating their fine-grained representations. It aims to achieve the alignment of cross-media inference cues via image-text and text-text interactive attention. (2) Heterogeneous joint inference is proposed to construct a heterogeneous interaction tensor space and extract semantic features for entailment recognition. It aims to simultaneously capture the interaction between cross-media premises and hypotheses and distinguish their entailment relationships. Experimental results on widely used Stanford natural language inference (SNLI) dataset with image premises from Flickr30K dataset verify the effectiveness of HIL and the intrinsic inter-media complementarity in reasoning.
- Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2010. Deep canonical correlation analysis. In Proceedings of the International Conference on Machine Learning (ICML’10). 3408--3415.Google Scholar
- Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP’15). 632--642.Google Scholar
- Samuel R. Bowman, Jon Gauthier, Abhinav Rastogi, Raghav Gupta, Christopher D. Manning, and Christopher Potts. 2016. A fast unified model for parsing and sentence understanding. In Proceedings of the Meeting of the Association for Computational Linguistics (ACL’16). 1466--1477.Google Scholar
Cross Ref
- Herng-Yow Chen and Sheng-Wei Li. 2007. Exploring many-to-one speech-to-text correlation for web-based language learning. ACM Trans. Multim. Comput. Commun. Applic. 3, 3 (2007), 13.Google Scholar
Digital Library
- Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Diana Inkpen, and Si Wei. 2018. Neural natural language inference models enhanced with external knowledge. In Proceedings of the Meeting of the Association for Computational Linguistics (ACL’18). 2406--2417.Google Scholar
Cross Ref
- Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced LSTM for natural language inference. In Proceedings of the Meeting of the Association for Computational Linguistics (ACL’17). 1657--1668.Google Scholar
Cross Ref
- Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Long short-term memory-networks for machine reading. In Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP’16). 551--561.Google Scholar
Cross Ref
- Ido Dagan and Oren Glickman. 2004. Probabilistic textual entailment: Generic applied modeling of languagevariability. In Proceedings of the Learning Methods for Text Understanding and Mining Workshop.Google Scholar
- Fangxiang Feng, Xiaojie Wang, Ruifan Li, and Ibrar Ahmad. 2015. Correspondence autoencoders for cross-modal retrieval. ACM Trans. Multim. Comput. Commun. Applic. 12, 1s (2015), 26:1--26:22.Google Scholar
- Yichen Gong, Heng Luo, and Jian Zhang. 2017. Natural language inference over interaction space. arXiv preprint arXiv: abs/1709.04348 (2017).Google Scholar
- Dan Han, Pascual Martínez-Gómez, and Koji Mineshima. 2017. Visual denotations for recognizing textual entailment. In Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP’17). 2853--2859.Google Scholar
Cross Ref
- Sanda M. Harabagiu and Andrew Hickl. 2006. Methods for using textual entailment in open-domain question answering. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Meeting of the Association for Computational Linguistics. 905--912.Google Scholar
- Harold Hotelling. 1936. Relations between two sets of variates. Biometrika 28, 3–4 (1936), 321--377.Google Scholar
Cross Ref
- Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2261--2269.Google Scholar
- Xin Huang and Yuxin Peng. 2018. Deep cross-media knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 8837--8846.Google Scholar
Cross Ref
- Xin Huang, Yuxin Peng, and Mingkuan Yuan. 2017. Cross-modal common representation learning by hybrid transfer network. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’17). 1893--1900.Google Scholar
Cross Ref
- Cuicui Kang, Shiming Xiang, Shengcai Liao, Changsheng Xu, and Chunhong Pan. 2015. Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans. Multim. 17, 3 (2015), 370--381.Google Scholar
Cross Ref
- Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 1 (2017), 32--73.Google Scholar
Digital Library
- G. Hinton, A. Krizhevsky, and I. Sutskever. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’12). 1106--1114.Google Scholar
- Kai Li, Guo-Jun Qi, and Kien A. Hua. 2018. Learning label preserving binary codes for multimedia retrieval: A general approach. ACM Trans. Multim. Comput. Commun. Applic. 14, 1 (2018), 2:1--2:23.Google Scholar
- Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’16). 289--297.Google Scholar
- Bill MacCartney. 2009. Natural Language Inference. Ph.D. thesis. Stanford University.Google Scholar
- George A. Miller. 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (1995), 39--41.Google Scholar
Digital Library
- Shachar Mirkin, Lucia Specia, Nicola Cancedda, Ido Dagan, Marc Dymetman, and Idan Szpektor. 2009. Source-language entailment modeling for translating unknown terms. In Proceedings of the 47th Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 791--799.Google Scholar
Cross Ref
- Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, Rui Yan, and Zhi Jin. 2016. Natural language inference by tree-based convolution and heuristic matching. In Proceedings of the Meeting of the Association for Computational Linguistics (ACL’16). 1466--1477.Google Scholar
Cross Ref
- Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal deep learning. In Proceedings of the International Conference on Machine Learning (ICML’11). 689--696.Google Scholar
- Biswajit Paria, K. M. Annervaz, Ambedkar Dukkipati, Ankush Chatterjee, and Sanjay Podder. 2016. A neural architecture mimicking humans end-to-end for natural language inference. arXiv preprint arXiv: abs/1611.04741 (2016).Google Scholar
- Ankur P. Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. 2016. A decomposable attention model for natural language inference. In Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP’16). 2249--2255.Google Scholar
Cross Ref
- Yuxin Peng, Xin Huang, and Jinwei Qi. 2016. Cross-media shared representation by hierarchical learning with multiple deep networks. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’16). 3846--3853.Google Scholar
- Yuxin Peng, Xin Huang, and Yunzhen Zhao. 2018. An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges. IEEE Trans. Circ. Syst. Vid. Technol. 28, 9 (2018), 2372--2385.Google Scholar
Digital Library
- Yuxin Peng, Jinwei Qi, Xin Huang, and Yuxin Yuan. 2018. CCL: Cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Trans. Multim. 20, 2 (2018), 405--420.Google Scholar
Digital Library
- Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2017. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. Int. J. Comput. Vis. 123, 1 (2017), 74--93.Google Scholar
Digital Library
- Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the ACM International Conference on Multimedia (ACM MM’10). 251--260.Google Scholar
Digital Library
- Lei Sha, Baobao Chang, Zhifang Sui, and Sujian Li. 2016. Reading and thinking: Re-read LSTM unit for textual entailment recognition. In Proceedings of the International Conference on Computational Linguistics (COLING’16). 2870--2879.Google Scholar
- Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. 2018. DiSAN: Directional self-attention network for RNN/CNN-free language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’18).Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv: 1409.1556 (2014).Google Scholar
- Yi Tay, Luu Anh Tuan, and Siu Cheung. 2018. Compare, compress, and propagate: Enhancing neural architectures with alignment factorization for natural language inference. In Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP’18). 1565--1575.Google Scholar
Cross Ref
- Peng Wang, Qi Wu, Chunhua Shen, Anthony R. Dick, and Anton van den Hengel. 2017. Explicit knowledge-based reasoning for visual question answering. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’17). 1290--1296.Google Scholar
Digital Library
- Shuohang Wang and Jing Jiang. 2016. Learning natural language inference with LSTM. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL’16). 1442--1451.Google Scholar
Cross Ref
- Zhiguo Wang, Wael Hamza, and Radu Florian. 2017. Bilateral multi-perspective matching for natural language sentences. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’17). 4144--4150.Google Scholar
Digital Library
- Yunchao Wei, Yao Zhao, Canyi Lu, Shikui Wei, Luoqi Liu, Zhenfeng Zhu, and Shuicheng Yan. 2017. Cross-modal retrieval with CNN visual features: A new baseline. IEEE Trans. Cyber. 47, 2 (2017), 449--460.Google Scholar
- Hongyang Xue, Zhou Zhao, and Deng Cai. 2017. Unifying the video and question attentions for open-ended video question answering. IEEE Trans. Image Proc. 26, 12 (2017), 5656--5666.Google Scholar
Cross Ref
- Wenpeng Yin, Hinrich Schütze, Bing Xiang, and Bowen Zhou. 2016. ABCNN: Attention-based convolutional neural network for modeling sentence pairs. Trans. Assoc. Computat. Ling. 4 (2016), 259--272.Google Scholar
Cross Ref
- Hong Yu and Tsendsuren Munkhdalai. 2017. Neural tree indexers for text understanding. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL’17). 11--21.Google Scholar
Index Terms
RCE-HIL: Recognizing Cross-media Entailment with Heterogeneous Interactive Learning
Recommendations
Deconfounded Video Moment Retrieval with Causal Intervention
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information RetrievalWe tackle the task of video moment retrieval (VMR), which aims to localize a specific moment in a video according to a textual query. Existing methods primarily model the matching relationship between query and moment by complex cross-modal ...
Sub-Word Indexing and Blind Relevance Feedback for English, Bengali, Hindi, and Marathi IR
The Forum for Information Retrieval Evaluation (FIRE) provides document collections, topics, and relevance assessments for information retrieval (IR) experiments on Indian languages. Several research questions are explored in this article: 1) How to ...
Some results using different approaches to merge visual and text-based features in CLEF'08 photo collection
CLEF'08: Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information accessThis paper describes the participation of the MIRACLE team at the ImageCLEF Photographic Retrieval task of CLEF 2008. We succeeded in submitting 41 runs. Obtained results from text-based retrieval are better than content-based as previous experiments in ...






Comments