skip to main content
research-article

Semantic Completion and Filtration for Image–Text Retrieval

Authors Info & Claims
Published:27 February 2023Publication History
Skip Abstract Section

Abstract

Image–text retrieval is a vital task in computer vision and has received growing attention, since it connects cross-modality data. It comes with the critical challenges of learning unified representations and eliminating the large gap between visual and textual domains. Over the past few decades, although many works have made significant progress in image–text retrieval, they are still confronted with the challenge of incomplete text descriptions of images, i.e., how to fully learn the correlations between relevant region–word pairs with semantic diversity. In this article, we propose a novel semantic completion and filtration (SCAF) method to alleviate the above issue. Specifically, the text semantic completion module is presented to generate a complete semantic description of an image using multi-view text descriptions, guiding the model to explore the correlations of relevant region–word pairs fully. Meanwhile, the adaptive structural semantic matching module is presented to filter irrelevant region–word pairs by considering the relevance score of each region–word pair, which facilitates the model to focus on learning the relevance of matching pairs. Extensive experiments show that our SCAF outperforms the existing methods on Flickr30K and MSCOCO datasets, which demonstrates the superiority of our proposed method.

REFERENCES

  1. [1] Antol Stanislaw, Agrawal Aishwarya, Lu Jiasen, Mitchell Margaret, Batra Dhruv, Zitnick C. Lawrence, and Parikh Devi. 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 24252433.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Barzilay Regina and Lee Lillian. 2003. Learning to paraphrase: An unsupervised approach using multiple-sequence alignment. CoRR cs.CL/0304006 (2003).Google ScholarGoogle Scholar
  3. [3] Biten Ali Furkan, Gómez Lluís, Rusiñol Marçal, and Karatzas Dimosthenis. 2019. Good news, everyone! Context driven entity-aware captioning for news images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1246612475.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Chen Hui, Ding Guiguang, Liu Xudong, Lin Zijia, Liu Ji, and Han Jungong. 2020. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1265212660.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Chung Junyoung, Gülçehre Çaglar, Cho KyungHyun, and Bengio Yoshua. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555 (2014).Google ScholarGoogle Scholar
  6. [6] Cong Gaoxiang, Li Liang, Liu Zhenhuan, Tu Yunbin, Qin Weijun, Zhang Shenyuan, Yan Chengang, Wang Wenyu, and Jiang Bin. 2022. LS-GAN: Iterative language-based image manipulation via long and short term consistency reasoning. In Proceedings of the 30th ACM International Conference on Multimedia (MM’22), Magalhães João, Bimbo Alberto Del, Satoh Shin’ichi, Sebe Nicu, Alameda-Pineda Xavier, Jin Qin, Oria Vincent, and Toni Laura (Eds.). ACM, 44964504. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Marneffe Marie-Catherine de and Manning Christopher D.. 2008. The stanford typed dependencies representation. In Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation. 18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Faghri Fartash, Fleet David J., Kiros Jamie Ryan, and Fidler Sanja. 2018. VSE++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British Machine Vision Conference. 12.Google ScholarGoogle Scholar
  9. [9] Fan Zhihao, Wei Zhongyu, Wang Siyuan, and Huang Xuanjing. 2019. Bridging by word: Image grounded vocabulary construction for visual captioning. In Proceedings of the Conference of the Association for Computational Linguistics. 65146524.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Filippova Katja. 2010. Multi-Sentence compression: Finding shortest paths in word graphs. In Proceedings of the International Conference on Computational Linguistics, Proceedings of the Conference. 322330.Google ScholarGoogle Scholar
  11. [11] Frome Andrea, Corrado Gregory S., Shlens Jonathon, Bengio Samy, Dean Jeffrey, Ranzato Marc’Aurelio, and Mikolov Tomás. 2013. DeViSE: A deep visual-semantic embedding model. In Proceedings of the Annual Conference on Neural Information Processing Systems. 21212129.Google ScholarGoogle Scholar
  12. [12] Gu Jiuxiang, Cai Jianfei, Joty Shafiq R., Niu Li, and Wang Gang. 2018. Look, Imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 71817189.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770778.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Huang Feiran, Zhang Xiaoming, Zhao Zhonghua, and Li Zhoujun. 2019. Bi-Directional spatial-semantic attention networks for image-text matching. IEEE Trans. Image Process. 28, 4 (2019), 20082020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Huang Yan, Wang Wei, and Wang Liang. 2017. Instance-Aware image and sentence matching with selective multimodal LSTM. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 72547262.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Huang Yan, Wu Qi, Song Chunfeng, and Wang Liang. 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 61636171.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Karpathy Andrej and Fei-Fei Li. 2017. Deep visual-semantic alignments for generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 664676.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Karpathy Andrej, Joulin Armand, and Li Fei-Fei. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Proceedings of the Annual Conference on Neural Information Processing Systems. 18891897.Google ScholarGoogle Scholar
  19. [19] Kiros Ryan, Salakhutdinov Ruslan, and Zemel Richard S.. 2014. Unifying visual-semantic embeddings with multimodal neural language models. CoRR abs/1411.2539 (2014).Google ScholarGoogle Scholar
  20. [20] Lebanoff Logan, Muchovej John, Dernoncourt Franck, Kim Doo Soon, Kim Seokhwan, Chang Walter, and Liu Fei. 2019. Analyzing sentence fusion in abstractive summarization. CoRR abs/1910.00203 (2019).Google ScholarGoogle Scholar
  21. [21] Lee Kuang-Huei, Chen Xi, Hua Gang, Hu Houdong, and He Xiaodong. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision, Vol. 11208. 212228.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Li Liang, Gao Xingyu, Deng Jincan, Tu Yunbin, Zha Zheng-Jun, and Huang Qingming. 2022. Long short-term relation transformer with global gating for video captioning. IEEE Trans. Image Process. 31 (2022), 27262738. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Li Yongzhi, Zhang Duo, and Mu Yadong. 2020. Visual-Semantic matching by exploring high-order attention and distraction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1278312792.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Lin Tsung-Yi, Maire Michael, Belongie Serge J., Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision, Vol. 8693. 740755.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Liu Chunxiao, Mao Zhendong, Liu An-An, Zhang Tianzhu, Wang Bin, and Zhang Yongdong. 2019. Focus your attention: A bidirectional focal attention network for image-text matching. In Proceedings of the ACM International Conference on Multimedia. 311.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Liu Chunxiao, Mao Zhendong, Zhang Tianzhu, Xie Hongtao, Wang Bin, and Zhang Yongdong. 2020. Graph structured network for image-text matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1091810927.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Liu Junhao, Yang Min, Li Chengming, and Xu Ruifeng. 2021. Improving cross-modal image-text retrieval with teacher-student learning. IEEE Trans. Circuits Syst. Video Technol. 31, 8 (2021), 32423253.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Liu Xuejing, Li Liang, Wang Shuhui, Zha Zheng-Jun, Li Zechao, Tian Qi, and Huang Qingming. 2022. Entity-enhanced adaptive reconstruction network for weakly supervised referring expression grounding. arXiv:2207.08386. Retrieved from https://arxiv.org/abs/2207.08386.Google ScholarGoogle Scholar
  29. [29] Luo Minnan, Chang Xiaojun, Li Zhihui, Nie Liqiang, Hauptmann Alexander G., and Zheng Qinghua. 2017. Simple to complex cross-modal learning to rank. Comput. Vis. Image Underst. 163 (2017), 6777.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Ma Lin, Lu Zhengdong, Shang Lifeng, and Li Hang. 2015. Multimodal convolutional neural networks for matching image and sentence. In Proceedings of the IEEE International Conference on Computer Vision. 26232631.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Ma Yanjun, Yu Dianhai, Wu Tian, and Wang Haifeng. 2019. PaddlePaddle: An open-source deep learning platform from industrial practice. Front. Data Comput. 1, 1 (2019), 105115.Google ScholarGoogle Scholar
  32. [32] Nam Hyeonseob, Ha Jung-Woo, and Kim Jeonghee. 2017. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 21562164.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Nie Xiushan, Wang Bowei, Li Jiajia, Hao Fanchang, Jian Muwei, and Yin Yilong. 2021. Deep multiscale fusion hashing for cross-modal retrieval. IEEE Trans. Circ. Syst. Video Technol. 31, 1 (2021), 401410.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Niu Zhenxing, Zhou Mo, Wang Le, Gao Xinbo, and Hua Gang. 2017. Hierarchical multimodal LSTM for dense visual-semantic embedding. In Proceedings of the IEEE International Conference on Computer Vision. 18991907.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Paddlepaddle. 2019. PaddlePaddle: An Easy-to-use, Easy-to-learn Deep Learning Platform. Retrieved from http://www.paddlepaddle.org/.Google ScholarGoogle Scholar
  36. [36] Plummer Bryan A., Wang Liwei, Cervantes Chris M., Caicedo Juan C., Hockenmaier Julia, and Lazebnik Svetlana. 2017. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. Int. J. Comput. Vis. 123, 1 (2017), 7493.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Qiao Tingting, Zhang Jing, Xu Duanqing, and Tao Dacheng. 2019. MirrorGAN: Learning text-to-image generation by redescription. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 15051514.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Ren Shaoqing, He Kaiming, Girshick Ross B., and Sun Jian. 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 6 (2017), 11371149.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Sarafianos Nikolaos, Xu Xiang, and Kakadiaris Ioannis A.. 2019. Adversarial representation learning for text-to-image matching. In Proceedings of the IEEE International Conference on Computer Vision. 58135823.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Shih Kevin J., Singh Saurabh, and Hoiem Derek. 2016. Where to look: Focus regions for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 46134621.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Wang Hao, Sahoo Doyen, Liu Chenghao, Lim Ee-Peng, and Hoi Steven C. H.. 2019. Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1157211581.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Wang Hao, Zha Zheng-Jun, Li Liang, Liu Dong, and Luo Jiebo. 2021. Structured multi-level interaction network for video moment localization via language query. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’21). Computer Vision Foundation/IEEE, 70267035. Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Wang Liwei, Li Yin, Huang Jing, and Lazebnik Svetlana. 2019. Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2 (2019), 394407.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Wang Sijin, Wang Ruiping, Yao Ziwei, Shan Shiguang, and Chen Xilin. 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 14971506.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Wang Yaxiong, Yang Hao, Qian Xueming, Ma Lin, Lu Jing, Li Biao, and Fan Xin. 2019. Position focused attention network for image-text matching. In Proceedings of the ACM International Joint Conference on Artificial Intelligence. 37923798.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Wang Zihao, Liu Xihui, Li Hongsheng, Sheng Lu, Yan Junjie, Wang Xiaogang, and Shao Jing. 2019. CAMP: Cross-Modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE International Conference on Computer Vision. 57635772.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Wen Xin, Han Zhizhong, and Liu Yu-Shen. 2021. CMPD: Using cross memory network with pair discrimination for image-text retrieval. IEEE Trans. Circuits Syst. Video Technol. 31, 6 (2021), 24272437.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Wu Lingxiang, Xu Min, Sang Lei, Yao Ting, and Mei Tao. 2021. Noise augmented double-stream graph convolutional networks for image captioning. IEEE Trans. Circuits Syst. Video Technol. 31, 8 (2021), 31183127.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Wu Yiling, Wang Shuhui, Song Guoli, and Huang Qingming. 2021. Augmented adversarial training for cross-modal retrieval. IEEE Trans. Multimedia 23 (2021), 559571.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Xu Xing, He Li, Lu Huimin, Gao Lianli, and Ji Yanli. 2019. Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22, 2 (2019), 657672.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Zellers Rowan, Yatskar Mark, Thomson Sam, and Choi Yejin. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 58315840.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Zhang Chenyang, Tang Yongqiang, Zhang Zhizhong, Li Ding, Yang Xuebing, and Zhang Wensheng. 2021. Improving domain-adaptive person re-identification by dual-alignment learning with camera-aware image generation. IEEE Trans. Circ. Syst. Video Technol. 31, 11 (2021), 43344346.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Zhang Qi, Lei Zhen, Zhang Zhaoxiang, and Li Stan Z.. 2020. Context-Aware attention network for image-text retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 35333542.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Zhu Bin, Ngo Chong-Wah, Chen Jingjing, and Hao Yanbin. 2019. R2GAN: Cross-Modal recipe retrieval with generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1147711486.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Semantic Completion and Filtration for Image–Text Retrieval

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Multimedia Computing, Communications, and Applications
            ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 4
            July 2023
            263 pages
            ISSN:1551-6857
            EISSN:1551-6865
            DOI:10.1145/3582888
            • Editor:
            • Abdulmotaleb El Saddik
            Issue’s Table of Contents

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 27 February 2023
            • Online AM: 23 November 2022
            • Accepted: 20 November 2022
            • Revised: 16 September 2022
            • Received: 23 April 2022
            Published in tomm Volume 19, Issue 4

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
          • Article Metrics

            • Downloads (Last 12 months)247
            • Downloads (Last 6 weeks)49

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Full Text

          View this article in Full Text.

          View Full Text

          HTML Format

          View this article in HTML Format .

          View HTML Format
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!