skip to main content
research-article

Modality-Invariant Image-Text Embedding for Image-Sentence Matching

Published:07 February 2019Publication History
Skip Abstract Section

Abstract

Performing direct matching among different modalities (like image and text) can benefit many tasks in computer vision, multimedia, information retrieval, and information fusion. Most of existing works focus on class-level image-text matching, called cross-modal retrieval, which attempts to propose a uniform model for matching images with all types of texts, for example, tags, sentences, and articles (long texts). Although cross-model retrieval alleviates the heterogeneous gap among visual and textual information, it can provide only a rough correspondence between two modalities. In this article, we propose a more precise image-text embedding method, image-sentence matching, which can provide heterogeneous matching in the instance level. The key issue for image-text embedding is how to make the distributions of the two modalities consistent in the embedding space. To address this problem, some previous works on the cross-model retrieval task have attempted to pull close their distributions by employing adversarial learning. However, the effectiveness of adversarial learning on image-sentence matching has not been proved and there is still not an effective method. Inspired by previous works, we propose to learn a modality-invariant image-text embedding for image-sentence matching by involving adversarial learning. On top of the triplet loss--based baseline, we design a modality classification network with an adversarial loss, which classifies an embedding into either the image or text modality. In addition, the multi-stage training procedure is carefully designed so that the proposed network not only imposes the image-text similarity constraints by ground-truth labels, but also enforces the image and text embedding distributions to be similar by adversarial learning. Experiments on two public datasets (Flickr30k and MSCOCO) demonstrate that our method yields stable accuracy improvement over the baseline model and that our results compare favorably to the state-of-the-art methods.

References

  1. Xinlei Chen and C. Lawrence Zitnick. 2015. Mind’s eye: A recurrent visual representation for image caption generation. In CVPR. IEEE, 2422--2431.Google ScholarGoogle Scholar
  2. Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. Arxiv Preprint Arxiv:1409.1259 (2014).Google ScholarGoogle Scholar
  3. Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. Arxiv Preprint Arxiv:1412.3555 (2014).Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In CVPR. IEEE, 2625--2634.Google ScholarGoogle Scholar
  5. Aviv Eisenschtat and Lior Wolf. 2016. Linking image and text with 2-way nets. Arxiv Preprint Arxiv:1608.07973 (2016).Google ScholarGoogle Scholar
  6. Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. VSE++: Improved visual-semantic embeddings. Arxiv Preprint Arxiv:1707.05612 (2017).Google ScholarGoogle Scholar
  7. Fangxiang Feng, Xiaojie Wang, Ruifan Li, and Ibrar Ahmad. 2015. Correspondence autoencoders for cross-modal retrieval. TOMM 12, 1s (2015), 26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, and others. 2013. Devise: A deep visual-semantic embedding model. In NIPS. MIT Press, 2121--2129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In ICML. ACM, 1180--1189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks. JMLR 17, 59 (2016), 1--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling Internet images, tags, and their semantics. IJCV 106, 2 (2014), 210--233. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In NIPS. MIT Press, 2672--2680. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jiuxiang Gu, Jianfei Cai, Shafiq Joty, Li Niu, and Gang Wang. 2017. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. Arxiv Preprint Arxiv:1711.06420 (2017).Google ScholarGoogle Scholar
  14. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. IEEE, 770--778.Google ScholarGoogle Scholar
  15. Li He, Xing Xu, Huimin Lu, Yang Yang, Fumin Shen, and Heng Tao Shen. 2017. Unsupervised cross-modal retrieval through adversarial learning. In ICME. IEEE, 1153--1158.Google ScholarGoogle Scholar
  16. Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. JAIR 47 (2013), 853--899. Google ScholarGoogle ScholarCross RefCross Ref
  17. Harold Hotelling. 1936. Relations between two sets of variates. Biometrika 28, 3/4 (1936), 321--377.Google ScholarGoogle ScholarCross RefCross Ref
  18. Yuting Hu, Liang Zheng, Yi Yang, and Yongfeng Huang. 2018. Twitter100k: A real-world dataset for weakly supervised cross-media retrieval. TMM 20, 4 (2018), 927--938. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Yan Huang, Wei Wang, and Liang Wang. 2016. Instance-aware image and sentence matching with selective multimodal LSTM. Arxiv Preprint Arxiv:1611.05588 (2016).Google ScholarGoogle Scholar
  20. Yan Huang, Qi Wu, and Liang Wang. 2017. Learning semantic concepts and order for image and sentence matching. Arxiv Preprint Arxiv:1712.02036 (2017).Google ScholarGoogle Scholar
  21. Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR. IEEE, 3128--3137.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Andrej Karpathy, Armand Joulin, and Li F. Fei-Fei. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS. MIT Press, 1889--1897. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. Arxiv Preprint Arxiv:1412.6980 (2014).Google ScholarGoogle Scholar
  24. Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. Arxiv Preprint Arxiv:1411.2539 (2014).Google ScholarGoogle Scholar
  25. Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In NIPS. MIT Press, 3294--3302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. 2015. Associating neural word embeddings with deep image representations using Fisher vectors. In CVPR. IEEE, 4437--4446.Google ScholarGoogle Scholar
  27. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS. MIT Press, 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Guy Lev, Gil Sadeh, Benjamin Klein, and Lior Wolf. 2016. RNN Fisher vectors for action recognition and image annotation. In ECCV. Springer, 833--850.Google ScholarGoogle Scholar
  29. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In ECCV. Springer, 740--755.Google ScholarGoogle Scholar
  30. Xiao Lin and Devi Parikh. 2016. Leveraging visual question answering for image-caption ranking. In ECCV. Springer, 261--277.Google ScholarGoogle Scholar
  31. Ruoyu Liu, Shikui Wei, Yao Zhao, Zhenfeng Zhu, and Jingdong Wang. 2018. Multi-view cross-media hashing with semantic consistency. IEEE MultiMedia 25, 2 (2018), 71--86.Google ScholarGoogle ScholarCross RefCross Ref
  32. Ruoyu Liu, Yao Zhao, Shikui Wei, Zhenfeng Zhu, and others. 2015. Cross-media hashing with centroid approaching. In ICME. IEEE, 1--6.Google ScholarGoogle Scholar
  33. Ruoyu Liu, Yao Zhao, Liang Zheng, Shikui Wei, and Yi Yang. 2017. A new evaluation protocol and benchmarking results for extendable cross-media retrieval. Arxiv Preprint Arxiv:1703.03567 (2017).Google ScholarGoogle Scholar
  34. Yu Liu, Yanming Guo, Erwin M. Bakker, and Michael S. Lew. 2017. Learning a recurrent residual fusion network for multimodal matching. In ICCV. IEEE, 4107--4116.Google ScholarGoogle Scholar
  35. Lin Ma, Zhengdong Lu, Lifeng Shang, and Hang Li. 2015. Multimodal convolutional neural networks for matching image and sentence. In ICCV. IEEE, 2623--2631. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. JMLR 9, Nov (2008), 2579--2605.Google ScholarGoogle Scholar
  37. Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. 2014. Explain images with multimodal recurrent neural networks. Arxiv Preprint Arxiv:1410.1090 (2014).Google ScholarGoogle Scholar
  38. Tomas Mikolov and Geoffrey Zweig. 2012. Context dependent recurrent neural network language model. SLT 12, 234-239 (2012), 8.Google ScholarGoogle Scholar
  39. Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2016. Dual attention networks for multimodal reasoning and matching. Arxiv Preprint Arxiv:1611.00471 (2016).Google ScholarGoogle Scholar
  40. Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua. 2017. Hierarchical multimodal LSTM for dense visual-semantic embedding. In ICCV. IEEE, 1881--1889.Google ScholarGoogle Scholar
  41. Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. 2016. Deep metric learning via lifted structured feature embedding. In CVPR. IEEE, 4004--4012.Google ScholarGoogle Scholar
  42. Gwangbeen Park and Woobin Im. 2016. Image-text multi-modal representation learning by adversarial backpropagation. Arxiv Preprint Arxiv:1612.08354 (2016).Google ScholarGoogle Scholar
  43. Yuxin Peng, Xin Huang, and Yunzhen Zhao. 2018. An overview of cross-media retrieval: Concepts, methodologies, benchmarks and challenges. TCSVT 28, 9 (2018), 2372--2385. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Yuxin Peng, Wenwu Zhu, Yao Zhao, Changsheng Xu, Qingming Huang, Hanqing Lu, Qinghua Zheng, Tiejun Huang, and Wen Gao. 2017. Cross-media analysis and reasoning: Advances and directions. FITEE 18, 1 (2017), 44--57.Google ScholarGoogle ScholarCross RefCross Ref
  45. Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In ACMMM. ACM, 251--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, and others. 2015. ImageNet large scale visual recognition challenge. IJCV 115, 3 (2015), 211--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. In CVPR. IEEE, 815--823.Google ScholarGoogle Scholar
  48. Abhishek Sharma, Abhishek Kumar, Hal Daume, and David W. Jacobs. 2012. Generalized multiview analysis: A discriminative latent space. In CVPR. IEEE, 2160–2167 Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Arxiv Preprint Arxiv:1409.1556 (2014).Google ScholarGoogle Scholar
  50. Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. TACL 2 (2014), 207--218.Google ScholarGoogle ScholarCross RefCross Ref
  51. Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. JMLR 15, 1 (2014), 1929--1958. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In CVPR. IEEE, 3156--3164.Google ScholarGoogle Scholar
  53. Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In ACMMM. ACM, 154--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In CVPR. IEEE, 5005--5013.Google ScholarGoogle Scholar
  55. Yunchao Wei, Yao Zhao, Canyi Lu, Shikui Wei, Luoqi Liu, Zhenfeng Zhu, and Shuicheng Yan. 2017. Cross-modal retrieval with CNN visual features: A new baseline. IEEE Trans. on Cybernetics 47, 2 (2017), 449--460.Google ScholarGoogle Scholar
  56. Xing Xu, Fumin Shen, Yang Yang, Heng Tao Shen, Li He, and Jingkuan Song. 2016. Cross-modal retrieval with label completion. In ACMMM. ACM, 302--306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Xing Xu, Fumin Shen, Yang Yang, Heng Tao Shen, and Xuelong Li. 2017. Learning discriminative binary codes for large-scale cross-modal retrieval. TIP 26, 5 (2017), 2494--2507. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In CVPR. IEEE, 3441--3450.Google ScholarGoogle Scholar
  59. Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL 2 (2014), 67--78.Google ScholarGoogle ScholarCross RefCross Ref
  60. Jian Zhang, Yuxin Peng, and Mingkuan Yuan. 2017. Unsupervised generative adversarial cross-modal hashing. Arxiv Preprint Arxiv:1712.00358 (2017).Google ScholarGoogle Scholar
  61. Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, and Yi-Dong Shen. 2017. Dual-path convolutional image-text embedding. Arxiv Preprint Arxiv:1711.05535 (2017).Google ScholarGoogle Scholar

Index Terms

  1. Modality-Invariant Image-Text Embedding for Image-Sentence Matching

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 1
      February 2019
      265 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3309717
      Issue’s Table of Contents

      Copyright © 2019 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 7 February 2019
      • Accepted: 1 November 2018
      • Revised: 1 August 2018
      • Received: 1 March 2018
      Published in tomm Volume 15, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!