10.1145/3477495.3531783acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper
Open Access

Image-Text Retrieval via Contrastive Learning with Auxiliary Generative Features and Support-set Regularization

Authors Info & Claims
Published:07 July 2022Publication History

ABSTRACT

In this paper, we bridge the heterogeneity gap between different modalities and improve image-text retrieval by taking advantage of auxiliary image-to-text and text-to-image generative features with contrastive learning. Concretely, contrastive learning is devised to narrow the distance between the aligned image-text pairs and push apart the distance between the unaligned pairs from both inter- and intra-modality perspectives with the help of cross-modal retrieval features and auxiliary generative features. In addition, we devise a support-set regularization term to further improve contrastive learning by constraining the distance between each image/text and its corresponding cross-modal support-set information contained in the same semantic category. To evaluate the effectiveness of the proposed method, we conduct experiments on three benchmark datasets (i.e., MIRFLICKR-25K, NUS-WIDE, MS COCO). Experimental results show that our model significantly outperforms the strong baselines for cross-modal image-text retrieval. For reproducibility, we submit the code and data publicly at: \urlhttps://github.com/Hambaobao/CRCGS.

Skip Supplemental Material Section

Supplemental Material

SIGIR22-sp1341.mp4

SIGIR-2022, Presentation video of short paper: Image-Text Retrieval via Contrastive Learning with Auxiliary Generative Features and Support-set Regularization.

References

  1. Hui Chen, Guiguang Ding, Zijia Lin, Sicheng Zhao, and Jungong Han. 2019. Cross-modal image-text retrieval with semantic consistency. In Proceedings of the 27th ACM International Conference on Multimedia. 1749--1757.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: a real-world web image database from National University of Singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval (CIVR '09). Association for Computing Machinery, New York, NY, USA, 1--9. https://doi.org/10.1145/1646396.1646452Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the 22nd ACM international conference on Multimedia. 7--16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems 27 (2014).Google ScholarGoogle Scholar
  5. Yonghao He, Shiming Xiang, Cuicui Kang, Jian Wang, and Chunhong Pan. 2016. Cross-modal retrieval via deep and bidirectional representation learning. IEEE Transactions on Multimedia 18, 7 (2016), 1363--1377.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Mark J. Huiskes and Michael S. Lew. 2008. The MIR flickr retrieval evaluation. In Proceedings of the 1st ACM international conference on Multimedia information retrieval (MIR '08). Association for Computing Machinery, New York, NY, USA, 39--43. https://doi.org/10.1145/1460096.1460104Google ScholarGoogle Scholar
  7. Qing-Yuan Jiang and Wu-Jun Li. 2017. Deep cross-modal hashing. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3232--3240.Google ScholarGoogle ScholarCross RefCross Ref
  8. Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs] (Jan. 2017). arXiv: 1412.6980.Google ScholarGoogle Scholar
  9. Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, and Dacheng Tao. 2018. Selfsupervised adversarial hashing networks for cross-modal retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4242--4251.Google ScholarGoogle ScholarCross RefCross Ref
  10. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision -- ECCV 2014 (Lecture Notes in Computer Science), David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 740--755. https://doi.org/10.1007/978--3--319--10602--1_48Google ScholarGoogle Scholar
  11. Zijia Lin, Guiguang Ding, Mingqing Hu, and Jianmin Wang. 2015. Semantics-Preserving Hashing for Cross-View Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  12. J. Liu, M. Yang, C. Li, and R. Xu. 2020. Improving Cross-Modal Image-Text Retrieval with Teacher-Student Learning. IEEE Transactions on Circuits and Systems for Video Technology (2020), 1--1. https://doi.org/10.1109/TCSVT.2020.3037661 Conference Name: IEEE Transactions on Circuits and Systems for Video Technology.Google ScholarGoogle Scholar
  13. Junyu Luo, Ying Shen, Xiang Ao, Zhou Zhao, and Min Yang. 2019. Cross-modal Image-Text Retrieval with Multitask Learning. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM '19). Association for Computing Machinery, New York, NY, USA, 2309--2312. https://doi.org/10.1145/3357384.3358104Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Mandela Patrick, Po-Yao Huang, Yuki Asano, Florian Metze, Alexander Hauptmann, João Henriques, and Andrea Vedaldi. 2021. Support-set bottlenecks for video-text representation learning. arXiv:2010.02824 [cs] (Jan. 2021). arXiv: 2010.02824.Google ScholarGoogle Scholar
  15. Yuxin Peng, Xin Huang, and Jinwei Qi. 2016. Cross-media shared representation by hierarchical learning with multiple deep networks.. In IJCAI. 3846--3853.Google ScholarGoogle Scholar
  16. Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2020. Contrastive Representation Distillation. arXiv:1910.10699 [cs, stat] (Jan. 2020). arXiv: 1910.10699.Google ScholarGoogle Scholar
  17. Di Wang, Xinbo Gao, Xiumei Wang, and Lihuo He. 2015. Semantic topic multimodal hashing for cross-media retrieval. In Twenty-fourth international joint conference on artificial intelligence.Google ScholarGoogle Scholar
  18. Jian Wang, Yonghao He, Cuicui Kang, Shiming Xiang, and Chunhong Pan. 2015. Image-text cross-modal retrieval via modality-specific feature learning. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. 347--354.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Dongqing Zhang and Wu-Jun Li. 2014. Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization. Proceedings of the AAAI Conference on Artificial Intelligence 28, 1 (June 2014). Number: 1.Google ScholarGoogle ScholarCross RefCross Ref
  20. Ying Zhang and Huchuan Lu. 2018. Deep cross-modal projection learning for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV). 686--701.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep supervised cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10394--10403.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Image-Text Retrieval via Contrastive Learning with Auxiliary Generative Features and Support-set Regularization

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
        July 2022
        3569 pages
        ISBN:9781450387323
        DOI:10.1145/3477495

        Copyright © 2022 Owner/Author

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 7 July 2022

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • short-paper

        Acceptance Rates

        Overall Acceptance Rate792of3,983submissions,20%
      • Article Metrics

        • Downloads (Last 12 months)452
        • Downloads (Last 6 weeks)95

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!