skip to main content
research-article

Tell, Imagine, and Search: End-to-end Learning for Composing Text and Image to Image Retrieval

Authors Info & Claims
Published:04 March 2022Publication History
Skip Abstract Section

Abstract

Composing Text and Image to Image Retrieval (CTI-IR) is an emerging task in computer vision, which allows retrieving images relevant to a query image with text describing desired modifications to the query image. Most conventional cross-modal retrieval approaches usually take one modality data as the query to retrieve relevant data of another modality. Different from the existing methods, in this article, we propose an end-to-end trainable network for simultaneous image generation and CTI-IR. The proposed model is based on Generative Adversarial Network (GAN) and enjoys several merits. First, it can learn a generative and discriminative feature for the query (a query image with text description) by jointly training a generative model and a retrieval model. Second, our model can automatically manipulate the visual features of the reference image in terms of the text description by the adversarial learning between the synthesized image and target image. Third, global-local collaborative discriminators and attention-based generators are exploited, allowing our approach to focus on both the global and local differences between the query image and the target image. As a result, the semantic consistency and fine-grained details of the generated images can be better enhanced in our model. The generated image can also be used to interpret and empower our retrieval model. Quantitative and qualitative evaluations on three benchmark datasets demonstrate that the proposed algorithm performs favorably against state-of-the-art methods.

REFERENCES

  1. [1] Ak Kenan E., Kassim Ashraf A., Lim Joo Hwee, and Tham Jo Yew. 2018. Learning attribute representations with localization for flexible fashion search. In CVPR. 77087717.Google ScholarGoogle Scholar
  2. [2] Ak Kenan E., Lim Joo Hwee, Tham Jo Yew, and Kassim Ashraf A.. 2019. Attribute manipulation generative adversarial networks for fashion images. In ICCV. 1054110550.Google ScholarGoogle Scholar
  3. [3] Chen H., Ding G., Liu Xudong, Lin Zijia, Liu J., and Han J.. 2020. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In CVPR. 1265212660.Google ScholarGoogle Scholar
  4. [4] Chen Jiaxin, Qin Jie, Liu Li, Zhu Fan, Shen Fumin, Xie Jin, and Shao Ling. 2019. Deep sketch-shape hashing with segmented 3D stochastic viewing. In CVPR. 791800.Google ScholarGoogle Scholar
  5. [5] Chen Shizhe, Zhao Yida, Jin Qin, and Wu Qi. 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In CVPR.Google ScholarGoogle Scholar
  6. [6] Chung Junyoung, Gulcehre Caglar, Cho KyungHyun, and Bengio Yoshua. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555 (2014).Google ScholarGoogle Scholar
  7. [7] Dey Sounak, Riba Pau, Dutta Anjan, Llados Josep, and Song Yi-Zhe. 2019. Doodle to search: Practical zero-shot sketch-based image retrieval. In CVPR. 21792188.Google ScholarGoogle Scholar
  8. [8] Dong Jianfeng, Li Xirong, Xu Chaoxi, Ji Shouling, He Yuan, Yang Gang, and Wang Xun. 2019. Dual encoding for zero-example video retrieval. In CVPR. 93469355.Google ScholarGoogle Scholar
  9. [9] Dorfer Matthias, Schlüter Jan, Vall Andreu, Korzeniowski Filip, and Widmer Gerhard. 2018. End-to-end cross-modality retrieval with CCA projections and pairwise ranking loss. Int. J. Multimedia Inf. Retr. 7, 2 (2018), 117128.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Dutta Anjan and Akata Zeynep. 2019. Semantically tied paired cycle consistency for zero-shot sketch-based image retrieval. In CVPR. 50895098.Google ScholarGoogle Scholar
  11. [11] Fan Hehe, Zhu Linchao, Yang Yi, and Wu Fei. 2020. Recurrent attention network with reinforced generator for visual dialog. ACM Trans. Multimedia Comput. Commun. Applic. 16 (2020), 116.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Ferecatu Marin and Geman Donald. 2007. Interactive search for image categories by mental matching. In ICCV. 18.Google ScholarGoogle Scholar
  13. [13] Goodfellow Ian, Pouget-Abadie Jean, Mirza Mehdi, Xu Bing, Warde-Farley David, Ozair Sherjil, Courville Aaron, and Bengio Yoshua. 2014. Generative adversarial nets. In NIPS. 26722680.Google ScholarGoogle Scholar
  14. [14] Gu Jiuxiang, Cai Jianfei, Joty Shafiq R., Niu Li, and Wang Gang. 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In CVPR. 71817189.Google ScholarGoogle Scholar
  15. [15] Guo Xiaoxiao, Wu Hui, Cheng Yu, Rennie Steven, Tesauro Gerald, and Feris Rogerio. 2018. Dialog-based interactive image retrieval. In NIPS. 678688.Google ScholarGoogle Scholar
  16. [16] Han Xintong, Wu Zuxuan, Huang Phoenix X., Zhang Xiao, Zhu Menglong, Li Yuan, Zhao Yang, and Davis Larry S.. 2017. Automatic spatially aware fashion concept discovery. In ICCV. 14631471.Google ScholarGoogle Scholar
  17. [17] He Kaiming, Zhang X., Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In CVPR.770778.Google ScholarGoogle Scholar
  18. [18] Heim Eric. 2019. Constrained generative adversarial networks for interactive image generation. In CVPR. 1075310761.Google ScholarGoogle Scholar
  19. [19] Howard Andrew G., Zhu Menglong, Chen Bo, Kalenichenko D., Wang Weijun, Weyand Tobias, Andreetto M., and Adam Hartwig. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. ArXiv abs/1704.04861 (2017).Google ScholarGoogle Scholar
  20. [20] Isola Phillip, Lim Joseph J., and Adelson Edward H.. 2015. Discovering states and transformations in image collections. In CVPR. 13831391.Google ScholarGoogle Scholar
  21. [21] Jiang Xinyang, Wu Fei, Li Xi, Zhao Zhou, Lu Weiming, Tang Siliang, and Zhuang Yueting. 2015. Deep compositional cross-modal learning to rank via local-global alignment. In ACM MM. 6978.Google ScholarGoogle Scholar
  22. [22] Johnson Justin, Hariharan Bharath, Maaten Laurens van der, Fei-Fei Li, Zitnick C. Lawrence, and Girshick Ross. 2017. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR. 29012910.Google ScholarGoogle Scholar
  23. [23] Karpathy Andrej, Joulin Armand, and Fei-Fei Li F.. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS. 18891897.Google ScholarGoogle Scholar
  24. [24] Karras Tero, Laine Samuli, and Aila Timo. 2019. A style-based generator architecture for generative adversarial networks. In CVPR. 44014410.Google ScholarGoogle Scholar
  25. [25] Kim Jongseok, Yu Young-Sun, Kim Hoeseong, and Kim Gunhee. 2021. Dual compositional learning in interactive image retrieval. In AAAI. 17711779.Google ScholarGoogle Scholar
  26. [26] Kim Jin-Hwa, Lee Sang-Woo, Kwak Donghyun, Heo Min-Oh, Kim Jeonghee, Ha Jung-Woo, and Zhang Byoung-Tak. 2016. Multimodal residual learning for visual QA. In NIPS. 361369.Google ScholarGoogle Scholar
  27. [27] Kingma Diederik P. and Ba Jimmy. 2015. Adam: A method for stochastic optimization. In ICLR. 115.Google ScholarGoogle Scholar
  28. [28] Klein Benjamin and Wolf Lior. 2019. End-to-end supervised product quantization for image search and retrieval. In CVPR. 50415050.Google ScholarGoogle Scholar
  29. [29] Kovashka Adriana, Parikh Devi, and Grauman Kristen. 2012. WhittleSearch: Image search with relative attribute feedback. In CVPR. 29732980.Google ScholarGoogle Scholar
  30. [30] Lao Qicheng, Havaei Mohammad, Pesaranghader Ahmad, Dutil Francis, Jorio Lisa Di, and Fevens Thomas. 2019. Dual adversarial inference for text-to-image synthesis. In ICCV. 75677576.Google ScholarGoogle Scholar
  31. [31] Li Chuan-Xiang, Yan Ting-Kun, Luo Xin, Nie Liqiang, and Xu Xin-Shun. 2019. Supervised robust discrete multimodal hashing for cross-media retrieval. Trans. Multimedia 21, 11 (2019), 28632877.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Li Kunpeng, Zhang Yulun, Li Kai, Li Yuanyuan, and Fu Yun. 2019. Visual semantic reasoning for image-text matching. In ICCV. 46544662.Google ScholarGoogle Scholar
  33. [33] Liu Jiawei, Zha Zheng-Jun, Hong Richang, Wang Meng, and Zhang Yongdong. 2019. Deep adversarial graph attention convolution network for text-based person search. In ACM MM. 665673.Google ScholarGoogle Scholar
  34. [34] Liu Xin, Hu Zhikai, Ling Haibin, and Cheung Yiu-ming. 2021. MTFH: A matrix tri-factorization hashing framework for efficient cross-modal retrieval. IEEE Trans. Pattern Anal. Machine Intell. 43, 3 (2021), 964–981.Google ScholarGoogle Scholar
  35. [35] Mao Qi, Lee Hsin-Ying, Tseng Hung-Yu, Ma Siwei, and Yang Ming-Hsuan. 2019. Mode seeking generative adversarial networks for diverse image synthesis. In CVPR. 14291437.Google ScholarGoogle Scholar
  36. [36] Min Weiqing, Jiang Shuqiang, Sang J., Wang Huayang, Liu Xinda, and Herranz Luis. 2017. Being a supercook: Joint food attributes and multimodal content modeling for recipe retrieval and exploration. IEEE Trans. Multimedia 19 (2017), 11001113.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Min Weiqing, Mei Shuhuan, Li Zhuo, and Jiang Shuqiang. 2020. A two-stage triplet network training framework for image retrieval. IEEE Trans. Multimedia 22 (2020), 31283138.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Mirza Mehdi and Osindero Simon. 2014. Conditional generative adversarial nets. CoRR abs/1411.1784 (2014).Google ScholarGoogle Scholar
  39. [39] Murrugarra-Llerena Nils and Kovashka Adriana. 2019. Cross-modality personalization for retrieval. In CVPR. 64296438.Google ScholarGoogle Scholar
  40. [40] Nagarajan Tushar and Grauman Kristen. 2018. Attributes as operators: Factorizing unseen attribute-object compositions. In ECCV. 169185.Google ScholarGoogle Scholar
  41. [41] Nguyen Anh, Clune Jeff, Bengio Yoshua, Dosovitskiy Alexey, and Yosinski Jason. 2017. Plug & play generative networks: Conditional iterative generation of images in latent space. In CVPR. 44674477.Google ScholarGoogle Scholar
  42. [42] Noh Hyeonwoo, Seo Paul Hongsuck, and Han Bohyung. 2016. Image question answering using convolutional neural network with dynamic parameter prediction. In CVPR. 3038.Google ScholarGoogle Scholar
  43. [43] Pang Kaiyue, Song Yi-Zhe, Xiang Tony, and Hospedales Timothy M. 2017. Cross-domain generative learning for fine-grained sketch-based image retrieval. In BMVC. 112.Google ScholarGoogle Scholar
  44. [44] Peng Yuxin and Qi Jinwei. 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1 (2019), 22:1–22:24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Perez Ethan, Strub Florian, Vries Harm De, Dumoulin Vincent, and Courville Aaron. 2018. FiLM: Visual reasoning with a general conditioning layer. In AAAI. 39423951.Google ScholarGoogle Scholar
  46. [46] Qiao Tingting, Zhang Jing, Xu Duanqing, and Tao Dacheng. 2019. MirrorGAN: Learning text-to-image generation by redescription. In CVPR. 15051514.Google ScholarGoogle Scholar
  47. [47] Reed Scott, Akata Zeynep, Yan Xinchen, Logeswaran Lajanugen, Schiele Bernt, and Lee Honglak. 2016. Generative adversarial text-to-image synthesis. In ICML. 10601069.Google ScholarGoogle Scholar
  48. [48] Russakovsky Olga, Deng Jia, Su Hao, Krause Jonathan, Satheesh Sanjeev, Ma Sean, Huang Zhiheng, Karpathy Andrej, Khosla et al. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211252.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Santoro Adam, Raposo David, Barrett David G., Malinowski Mateusz, Pascanu Razvan, Battaglia Peter, and Lillicrap Timothy. 2017. A simple neural network module for relational reasoning. In NIPS. 49674976.Google ScholarGoogle Scholar
  50. [50] Sarafianos Nikolaos, Xu Xiang, and Kakadiaris Ioannis A.. 2019. Adversarial representation learning for text-to-image matching. In ICCV. 58135823.Google ScholarGoogle Scholar
  51. [51] Schuster Mike and Paliwal Kuldip K.. 1997. Bidirectional recurrent neural networks. IEEE Trans. Sig. Process. 45, 11 (1997), 26732681.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Song Ge, Wang Dong, and Tan Xiaoyang. 2019. Deep memory network for cross-modal retrieval. Trans. Multimedia 21, 5 (2019), 12611275.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Song Yale and Soleymani Mohammad. 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In CVPR. 19791988.Google ScholarGoogle Scholar
  54. [54] Szegedy Christian, Vanhoucke Vincent, Ioffe Sergey, Shlens Jon, and Wojna Zbigniew. 2016. Rethinking the inception architecture for computer vision. In CVPR. 28182826.Google ScholarGoogle Scholar
  55. [55] Tan Hongchen, Liu Xiuping, Li Xin, Zhang Yi, and Yin Baocai. 2019. Semantics-enhanced adversarial nets for text-to-image synthesis. In ICCV. 1050110510.Google ScholarGoogle Scholar
  56. [56] Maaten Laurens Van der and Hinton Geoffrey. 2008. Visualizing data using t-sne. J. Mach. Learn. Res. 9, Nov. (2008), 25792605.Google ScholarGoogle Scholar
  57. [57] Vinyals Oriol, Toshev Alexander, Bengio Samy, and Erhan Dumitru. 2015. Show and tell: A neural image caption generator. In CVPR. 31563164.Google ScholarGoogle Scholar
  58. [58] Vo Nam, Jiang Lu, Sun Chen, Murphy Kevin, Li Li-Jia, Fei-Fei Li, and Hays James. 2019. Composing text and image for image retrieval-an empirical odyssey. In CVPR. 64396448.Google ScholarGoogle Scholar
  59. [59] Wang Bokun, Yang Yang, Xu Xing, Hanjalic Alan, and Shen Heng Tao. 2017. Adversarial cross-modal retrieval. In ACM MM. 154162.Google ScholarGoogle Scholar
  60. [60] Wang Hao, Sahoo Doyen, Liu Chenghao, Lim Ee-peng, and Hoi Steven C. H.. 2019. Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In CVPR. 1157211581.Google ScholarGoogle Scholar
  61. [61] Wang Liwei, Li Yin, Huang Jing, and Lazebnik Svetlana. 2019. Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Machine Intell. 41, 2 (2019), 394407.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. [62] Wu Yuehua, Jiang Lu, and Yang Y.. 2020. Revisiting EmbodiedQA: A simple baseline and beyond. IEEE Trans. Image Process. 29 (2020), 39843992.Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Wu Yiling, Wang Shuhui, and Huang Qingming. 2017. Online asymmetric similarity learning for cross-modal retrieval. In CVPR. 42694278.Google ScholarGoogle Scholar
  64. [64] Wu Yiling, Wang Shuhui, and Huang Qingming. 2020. Online fast adaptive low-rank similarity learning for cross-modal retrieval. Trans. Multimedia 22, 5 (2020), 13101322.Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Xu Tao, Zhang Pengchuan, Huang Qiuyuan, Zhang Han, Gan Zhe, Huang Xiaolei, and He Xiaodong. 2018. AttnGAN: Fine-grained text-to-image generation with attentional generative adversarial networks. In CVPR. 13161324.Google ScholarGoogle Scholar
  66. [66] Yan Fei and Mikolajczyk Krystian. 2015. Deep correlation for matching images and text. In CVPR. 34413450.Google ScholarGoogle Scholar
  67. [67] Yin Guojun, Liu Bin, Sheng Lu, Yu Nenghai, Wang Xiaogang, and Shao Jing. 2019. Semantics disentangling for text-to-image generation. In CVPR. 23272336.Google ScholarGoogle Scholar
  68. [68] Zhang Da, Dai Xiyang, Wang Xin, Wang Yuan-Fang, and Davis Larry S.. 2019. MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In CVPR. 12471257.Google ScholarGoogle Scholar
  69. [69] Zhang Feifei, Xu Mingliang, Mao Qirong, and Xu Changsheng. 2020. Joint attribute manipulation and modality alignment learning for composing text and image to image retrieval. In ACM MM. 33673376.Google ScholarGoogle Scholar
  70. [70] Zhang Han, Xu Tao, Li Hongsheng, Zhang Shaoting, Wang Xiaogang, Huang Xiaolei, and Metaxas Dimitris N.. 2017. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV. 59075915.Google ScholarGoogle Scholar
  71. [71] Zhang Han, Xu Tao, Li Hongsheng, Zhang Shaoting, Wang Xiaogang, Huang Xiaolei, and Metaxas Dimitris N.. 2018. StackGAN++: Realistic image synthesis with stacked generative adversarial networks. IEEE Trans. Pattern Anal. Machine Intell. 41, 8 (2018), 19471962.Google ScholarGoogle ScholarCross RefCross Ref
  72. [72] Zhang Jian and Peng Yuxin. 2020. Multi-pathway generative adversarial hashing for unsupervised cross-modal retrieval. Trans Multimedia 22, 1 (2020), 174187.Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. [73] Zhang Jingyi, Shen Fumin, Liu Li, Zhu Fan, Yu Mengyang, Shao Ling, Shen Heng Tao, and Gool Luc Van. 2018. Generative domain-migration hashing for sketch-to-image retrieval. In ECCV. 297314.Google ScholarGoogle Scholar
  74. [74] Zhang Lingling, Luo Minnan, Liu Jun, Chang Xiaojun, Yang Yi, and Hauptmann Alexander G.. 2020. Deep Top-k ranking for image-sentence matching. Trans. Multimedia 22, 3 (2020), 775785.Google ScholarGoogle ScholarCross RefCross Ref
  75. [75] Zhao Bo, Feng Jiashi, Wu Xiao, and Yan Shuicheng. 2017. Memory-augmented attribute manipulation networks for interactive fashion search. In CVPR. 15201528.Google ScholarGoogle Scholar
  76. [76] Zhen Liangli, Hu Peng, Wang Xu, and Peng Dezhong. 2019. Deep supervised cross-modal retrieval. In CVPR. 11394–10403.Google ScholarGoogle Scholar
  77. [77] Zhu Bin, Ngo Chong-Wah, Chen Jingjing, and Hao Yanbin. 2019. R2GAN: Cross-modal recipe retrieval with generative adversarial network. In CVPR. 1147711486.Google ScholarGoogle Scholar
  78. [78] Zhu Linchao and Yang Y.. 2020. ActBERT: Learning global-local video-text representations. CVPR. 87438752.Google ScholarGoogle Scholar

Index Terms

  1. Tell, Imagine, and Search: End-to-end Learning for Composing Text and Image to Image Retrieval

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 2
      May 2022
      494 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3505207
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 4 March 2022
      • Accepted: 1 July 2021
      • Revised: 1 June 2021
      • Received: 1 July 2020
      Published in tomm Volume 18, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!