Abstract
Performing direct matching among different modalities (like image and text) can benefit many tasks in computer vision, multimedia, information retrieval, and information fusion. Most of existing works focus on class-level image-text matching, called cross-modal retrieval, which attempts to propose a uniform model for matching images with all types of texts, for example, tags, sentences, and articles (long texts). Although cross-model retrieval alleviates the heterogeneous gap among visual and textual information, it can provide only a rough correspondence between two modalities. In this article, we propose a more precise image-text embedding method, image-sentence matching, which can provide heterogeneous matching in the instance level. The key issue for image-text embedding is how to make the distributions of the two modalities consistent in the embedding space. To address this problem, some previous works on the cross-model retrieval task have attempted to pull close their distributions by employing adversarial learning. However, the effectiveness of adversarial learning on image-sentence matching has not been proved and there is still not an effective method. Inspired by previous works, we propose to learn a modality-invariant image-text embedding for image-sentence matching by involving adversarial learning. On top of the triplet loss--based baseline, we design a modality classification network with an adversarial loss, which classifies an embedding into either the image or text modality. In addition, the multi-stage training procedure is carefully designed so that the proposed network not only imposes the image-text similarity constraints by ground-truth labels, but also enforces the image and text embedding distributions to be similar by adversarial learning. Experiments on two public datasets (Flickr30k and MSCOCO) demonstrate that our method yields stable accuracy improvement over the baseline model and that our results compare favorably to the state-of-the-art methods.
- Xinlei Chen and C. Lawrence Zitnick. 2015. Mind’s eye: A recurrent visual representation for image caption generation. In CVPR. IEEE, 2422--2431.Google Scholar
- Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. Arxiv Preprint Arxiv:1409.1259 (2014).Google Scholar
- Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. Arxiv Preprint Arxiv:1412.3555 (2014).Google Scholar
Digital Library
- Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In CVPR. IEEE, 2625--2634.Google Scholar
- Aviv Eisenschtat and Lior Wolf. 2016. Linking image and text with 2-way nets. Arxiv Preprint Arxiv:1608.07973 (2016).Google Scholar
- Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. VSE++: Improved visual-semantic embeddings. Arxiv Preprint Arxiv:1707.05612 (2017).Google Scholar
- Fangxiang Feng, Xiaojie Wang, Ruifan Li, and Ibrar Ahmad. 2015. Correspondence autoencoders for cross-modal retrieval. TOMM 12, 1s (2015), 26. Google Scholar
Digital Library
- Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, and others. 2013. Devise: A deep visual-semantic embedding model. In NIPS. MIT Press, 2121--2129. Google Scholar
Digital Library
- Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In ICML. ACM, 1180--1189. Google Scholar
Digital Library
- Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks. JMLR 17, 59 (2016), 1--35. Google Scholar
Digital Library
- Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling Internet images, tags, and their semantics. IJCV 106, 2 (2014), 210--233. Google Scholar
Digital Library
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In NIPS. MIT Press, 2672--2680. Google Scholar
Digital Library
- Jiuxiang Gu, Jianfei Cai, Shafiq Joty, Li Niu, and Gang Wang. 2017. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. Arxiv Preprint Arxiv:1711.06420 (2017).Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. IEEE, 770--778.Google Scholar
- Li He, Xing Xu, Huimin Lu, Yang Yang, Fumin Shen, and Heng Tao Shen. 2017. Unsupervised cross-modal retrieval through adversarial learning. In ICME. IEEE, 1153--1158.Google Scholar
- Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. JAIR 47 (2013), 853--899. Google Scholar
Cross Ref
- Harold Hotelling. 1936. Relations between two sets of variates. Biometrika 28, 3/4 (1936), 321--377.Google Scholar
Cross Ref
- Yuting Hu, Liang Zheng, Yi Yang, and Yongfeng Huang. 2018. Twitter100k: A real-world dataset for weakly supervised cross-media retrieval. TMM 20, 4 (2018), 927--938. Google Scholar
Digital Library
- Yan Huang, Wei Wang, and Liang Wang. 2016. Instance-aware image and sentence matching with selective multimodal LSTM. Arxiv Preprint Arxiv:1611.05588 (2016).Google Scholar
- Yan Huang, Qi Wu, and Liang Wang. 2017. Learning semantic concepts and order for image and sentence matching. Arxiv Preprint Arxiv:1712.02036 (2017).Google Scholar
- Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR. IEEE, 3128--3137.Google Scholar
Digital Library
- Andrej Karpathy, Armand Joulin, and Li F. Fei-Fei. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS. MIT Press, 1889--1897. Google Scholar
Digital Library
- Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. Arxiv Preprint Arxiv:1412.6980 (2014).Google Scholar
- Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. Arxiv Preprint Arxiv:1411.2539 (2014).Google Scholar
- Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In NIPS. MIT Press, 3294--3302. Google Scholar
Digital Library
- Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. 2015. Associating neural word embeddings with deep image representations using Fisher vectors. In CVPR. IEEE, 4437--4446.Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS. MIT Press, 1097--1105. Google Scholar
Digital Library
- Guy Lev, Gil Sadeh, Benjamin Klein, and Lior Wolf. 2016. RNN Fisher vectors for action recognition and image annotation. In ECCV. Springer, 833--850.Google Scholar
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In ECCV. Springer, 740--755.Google Scholar
- Xiao Lin and Devi Parikh. 2016. Leveraging visual question answering for image-caption ranking. In ECCV. Springer, 261--277.Google Scholar
- Ruoyu Liu, Shikui Wei, Yao Zhao, Zhenfeng Zhu, and Jingdong Wang. 2018. Multi-view cross-media hashing with semantic consistency. IEEE MultiMedia 25, 2 (2018), 71--86.Google Scholar
Cross Ref
- Ruoyu Liu, Yao Zhao, Shikui Wei, Zhenfeng Zhu, and others. 2015. Cross-media hashing with centroid approaching. In ICME. IEEE, 1--6.Google Scholar
- Ruoyu Liu, Yao Zhao, Liang Zheng, Shikui Wei, and Yi Yang. 2017. A new evaluation protocol and benchmarking results for extendable cross-media retrieval. Arxiv Preprint Arxiv:1703.03567 (2017).Google Scholar
- Yu Liu, Yanming Guo, Erwin M. Bakker, and Michael S. Lew. 2017. Learning a recurrent residual fusion network for multimodal matching. In ICCV. IEEE, 4107--4116.Google Scholar
- Lin Ma, Zhengdong Lu, Lifeng Shang, and Hang Li. 2015. Multimodal convolutional neural networks for matching image and sentence. In ICCV. IEEE, 2623--2631. Google Scholar
Digital Library
- Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. JMLR 9, Nov (2008), 2579--2605.Google Scholar
- Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. 2014. Explain images with multimodal recurrent neural networks. Arxiv Preprint Arxiv:1410.1090 (2014).Google Scholar
- Tomas Mikolov and Geoffrey Zweig. 2012. Context dependent recurrent neural network language model. SLT 12, 234-239 (2012), 8.Google Scholar
- Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2016. Dual attention networks for multimodal reasoning and matching. Arxiv Preprint Arxiv:1611.00471 (2016).Google Scholar
- Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua. 2017. Hierarchical multimodal LSTM for dense visual-semantic embedding. In ICCV. IEEE, 1881--1889.Google Scholar
- Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. 2016. Deep metric learning via lifted structured feature embedding. In CVPR. IEEE, 4004--4012.Google Scholar
- Gwangbeen Park and Woobin Im. 2016. Image-text multi-modal representation learning by adversarial backpropagation. Arxiv Preprint Arxiv:1612.08354 (2016).Google Scholar
- Yuxin Peng, Xin Huang, and Yunzhen Zhao. 2018. An overview of cross-media retrieval: Concepts, methodologies, benchmarks and challenges. TCSVT 28, 9 (2018), 2372--2385. Google Scholar
Digital Library
- Yuxin Peng, Wenwu Zhu, Yao Zhao, Changsheng Xu, Qingming Huang, Hanqing Lu, Qinghua Zheng, Tiejun Huang, and Wen Gao. 2017. Cross-media analysis and reasoning: Advances and directions. FITEE 18, 1 (2017), 44--57.Google Scholar
Cross Ref
- Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In ACMMM. ACM, 251--260. Google Scholar
Digital Library
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, and others. 2015. ImageNet large scale visual recognition challenge. IJCV 115, 3 (2015), 211--252. Google Scholar
Digital Library
- Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. In CVPR. IEEE, 815--823.Google Scholar
- Abhishek Sharma, Abhishek Kumar, Hal Daume, and David W. Jacobs. 2012. Generalized multiview analysis: A discriminative latent space. In CVPR. IEEE, 2160–2167 Google Scholar
Digital Library
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Arxiv Preprint Arxiv:1409.1556 (2014).Google Scholar
- Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. TACL 2 (2014), 207--218.Google Scholar
Cross Ref
- Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. JMLR 15, 1 (2014), 1929--1958. Google Scholar
Digital Library
- Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In CVPR. IEEE, 3156--3164.Google Scholar
- Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In ACMMM. ACM, 154--162. Google Scholar
Digital Library
- Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In CVPR. IEEE, 5005--5013.Google Scholar
- Yunchao Wei, Yao Zhao, Canyi Lu, Shikui Wei, Luoqi Liu, Zhenfeng Zhu, and Shuicheng Yan. 2017. Cross-modal retrieval with CNN visual features: A new baseline. IEEE Trans. on Cybernetics 47, 2 (2017), 449--460.Google Scholar
- Xing Xu, Fumin Shen, Yang Yang, Heng Tao Shen, Li He, and Jingkuan Song. 2016. Cross-modal retrieval with label completion. In ACMMM. ACM, 302--306. Google Scholar
Digital Library
- Xing Xu, Fumin Shen, Yang Yang, Heng Tao Shen, and Xuelong Li. 2017. Learning discriminative binary codes for large-scale cross-modal retrieval. TIP 26, 5 (2017), 2494--2507. Google Scholar
Digital Library
- Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In CVPR. IEEE, 3441--3450.Google Scholar
- Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL 2 (2014), 67--78.Google Scholar
Cross Ref
- Jian Zhang, Yuxin Peng, and Mingkuan Yuan. 2017. Unsupervised generative adversarial cross-modal hashing. Arxiv Preprint Arxiv:1712.00358 (2017).Google Scholar
- Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, and Yi-Dong Shen. 2017. Dual-path convolutional image-text embedding. Arxiv Preprint Arxiv:1711.05535 (2017).Google Scholar
Index Terms
Modality-Invariant Image-Text Embedding for Image-Sentence Matching
Recommendations
Improving Text-Image Matching with Adversarial Learning and Circle Loss for Multi-modal Steganography
Digital Forensics and WatermarkingAbstractThis paper proposes a multi-modal steganography method based on an improved text-image matching algorithm. At present, most of the steganography methods are based on single modality of carriers and embed confidential information into the carriers ...
Unsupervised adversarial image retrieval
AbstractThe strong feature representation ability of deep learning enables content-based image retrieval (CBIR) to achieve higher retrieval accuracy, while there are still some challenges for CBIR such as high requirements of training labels and retrieve ...
Cross-modal Graph Matching Network for Image-text Retrieval
Image-text retrieval is a fundamental cross-modal task whose main idea is to learn image-text matching. Generally, according to whether there exist interactions during the retrieval process, existing image-text retrieval methods can be classified into ...






Comments