Abstract
Composing Text and Image to Image Retrieval (CTI-IR) is an emerging task in computer vision, which allows retrieving images relevant to a query image with text describing desired modifications to the query image. Most conventional cross-modal retrieval approaches usually take one modality data as the query to retrieve relevant data of another modality. Different from the existing methods, in this article, we propose an end-to-end trainable network for simultaneous image generation and CTI-IR. The proposed model is based on Generative Adversarial Network (GAN) and enjoys several merits. First, it can learn a generative and discriminative feature for the query (a query image with text description) by jointly training a generative model and a retrieval model. Second, our model can automatically manipulate the visual features of the reference image in terms of the text description by the adversarial learning between the synthesized image and target image. Third, global-local collaborative discriminators and attention-based generators are exploited, allowing our approach to focus on both the global and local differences between the query image and the target image. As a result, the semantic consistency and fine-grained details of the generated images can be better enhanced in our model. The generated image can also be used to interpret and empower our retrieval model. Quantitative and qualitative evaluations on three benchmark datasets demonstrate that the proposed algorithm performs favorably against state-of-the-art methods.
- [1] . 2018. Learning attribute representations with localization for flexible fashion search. In CVPR. 7708–7717.Google Scholar
- [2] . 2019. Attribute manipulation generative adversarial networks for fashion images. In ICCV. 10541–10550.Google Scholar
- [3] . 2020. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In CVPR. 12652–12660.Google Scholar
- [4] . 2019. Deep sketch-shape hashing with segmented 3D stochastic viewing. In CVPR. 791–800.Google Scholar
- [5] . 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In CVPR.Google Scholar
- [6] . 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555 (2014).Google Scholar
- [7] . 2019. Doodle to search: Practical zero-shot sketch-based image retrieval. In CVPR. 2179–2188.Google Scholar
- [8] . 2019. Dual encoding for zero-example video retrieval. In CVPR. 9346–9355.Google Scholar
- [9] . 2018. End-to-end cross-modality retrieval with CCA projections and pairwise ranking loss. Int. J. Multimedia Inf. Retr. 7, 2 (2018), 117–128.Google Scholar
Cross Ref
- [10] . 2019. Semantically tied paired cycle consistency for zero-shot sketch-based image retrieval. In CVPR. 5089–5098.Google Scholar
- [11] . 2020. Recurrent attention network with reinforced generator for visual dialog. ACM Trans. Multimedia Comput. Commun. Applic. 16 (2020), 1–16.Google Scholar
Digital Library
- [12] . 2007. Interactive search for image categories by mental matching. In ICCV. 1–8.Google Scholar
- [13] . 2014. Generative adversarial nets. In NIPS. 2672–2680.Google Scholar
- [14] . 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In CVPR. 7181–7189.Google Scholar
- [15] . 2018. Dialog-based interactive image retrieval. In NIPS. 678–688.Google Scholar
- [16] . 2017. Automatic spatially aware fashion concept discovery. In ICCV. 1463–1471.Google Scholar
- [17] . 2016. Deep residual learning for image recognition. In CVPR.770–778.Google Scholar
- [18] . 2019. Constrained generative adversarial networks for interactive image generation. In CVPR. 10753–10761.Google Scholar
- [19] . 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. ArXiv abs/1704.04861 (2017).Google Scholar
- [20] . 2015. Discovering states and transformations in image collections. In CVPR. 1383–1391.Google Scholar
- [21] . 2015. Deep compositional cross-modal learning to rank via local-global alignment. In ACM MM. 69–78.Google Scholar
- [22] . 2017. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR. 2901–2910.Google Scholar
- [23] . 2014. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS. 1889–1897.Google Scholar
- [24] . 2019. A style-based generator architecture for generative adversarial networks. In CVPR. 4401–4410.Google Scholar
- [25] . 2021. Dual compositional learning in interactive image retrieval. In AAAI. 1771–1779.Google Scholar
- [26] . 2016. Multimodal residual learning for visual QA. In NIPS. 361–369.Google Scholar
- [27] . 2015. Adam: A method for stochastic optimization. In ICLR. 1–15.Google Scholar
- [28] . 2019. End-to-end supervised product quantization for image search and retrieval. In CVPR. 5041–5050.Google Scholar
- [29] . 2012. WhittleSearch: Image search with relative attribute feedback. In CVPR. 2973–2980.Google Scholar
- [30] . 2019. Dual adversarial inference for text-to-image synthesis. In ICCV. 7567–7576.Google Scholar
- [31] . 2019. Supervised robust discrete multimodal hashing for cross-media retrieval. Trans. Multimedia 21, 11 (2019), 2863–2877.Google Scholar
Cross Ref
- [32] . 2019. Visual semantic reasoning for image-text matching. In ICCV. 4654–4662.Google Scholar
- [33] . 2019. Deep adversarial graph attention convolution network for text-based person search. In ACM MM. 665–673.Google Scholar
- [34] . 2021. MTFH: A matrix tri-factorization hashing framework for efficient cross-modal retrieval. IEEE Trans. Pattern Anal. Machine Intell. 43, 3 (2021), 964–981.Google Scholar
- [35] . 2019. Mode seeking generative adversarial networks for diverse image synthesis. In CVPR. 1429–1437.Google Scholar
- [36] . 2017. Being a supercook: Joint food attributes and multimodal content modeling for recipe retrieval and exploration. IEEE Trans. Multimedia 19 (2017), 1100–1113.Google Scholar
Digital Library
- [37] . 2020. A two-stage triplet network training framework for image retrieval. IEEE Trans. Multimedia 22 (2020), 3128–3138.Google Scholar
Digital Library
- [38] . 2014. Conditional generative adversarial nets. CoRR abs/1411.1784 (2014).Google Scholar
- [39] . 2019. Cross-modality personalization for retrieval. In CVPR. 6429–6438.Google Scholar
- [40] . 2018. Attributes as operators: Factorizing unseen attribute-object compositions. In ECCV. 169–185.Google Scholar
- [41] . 2017. Plug & play generative networks: Conditional iterative generation of images in latent space. In CVPR. 4467–4477.Google Scholar
- [42] . 2016. Image question answering using convolutional neural network with dynamic parameter prediction. In CVPR. 30–38.Google Scholar
- [43] . 2017. Cross-domain generative learning for fine-grained sketch-based image retrieval. In BMVC. 1–12.Google Scholar
- [44] . 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1 (2019), 22:1–22:24.Google Scholar
Digital Library
- [45] . 2018. FiLM: Visual reasoning with a general conditioning layer. In AAAI. 3942–3951.Google Scholar
- [46] . 2019. MirrorGAN: Learning text-to-image generation by redescription. In CVPR. 1505–1514.Google Scholar
- [47] . 2016. Generative adversarial text-to-image synthesis. In ICML. 1060–1069.Google Scholar
- [48] . 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211–252.Google Scholar
Digital Library
- [49] . 2017. A simple neural network module for relational reasoning. In NIPS. 4967–4976.Google Scholar
- [50] . 2019. Adversarial representation learning for text-to-image matching. In ICCV. 5813–5823.Google Scholar
- [51] . 1997. Bidirectional recurrent neural networks. IEEE Trans. Sig. Process. 45, 11 (1997), 2673–2681.Google Scholar
Digital Library
- [52] . 2019. Deep memory network for cross-modal retrieval. Trans. Multimedia 21, 5 (2019), 1261–1275.Google Scholar
Digital Library
- [53] . 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In CVPR. 1979–1988.Google Scholar
- [54] . 2016. Rethinking the inception architecture for computer vision. In CVPR. 2818–2826.Google Scholar
- [55] . 2019. Semantics-enhanced adversarial nets for text-to-image synthesis. In ICCV. 10501–10510.Google Scholar
- [56] . 2008. Visualizing data using t-sne. J. Mach. Learn. Res. 9, Nov. (2008), 2579–2605.Google Scholar
- [57] . 2015. Show and tell: A neural image caption generator. In CVPR. 3156–3164.Google Scholar
- [58] . 2019. Composing text and image for image retrieval-an empirical odyssey. In CVPR. 6439–6448.Google Scholar
- [59] . 2017. Adversarial cross-modal retrieval. In ACM MM. 154–162.Google Scholar
- [60] . 2019. Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In CVPR. 11572–11581.Google Scholar
- [61] . 2019. Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Machine Intell. 41, 2 (2019), 394–407.Google Scholar
Digital Library
- [62] . 2020. Revisiting EmbodiedQA: A simple baseline and beyond. IEEE Trans. Image Process. 29 (2020), 3984–3992.Google Scholar
Cross Ref
- [63] . 2017. Online asymmetric similarity learning for cross-modal retrieval. In CVPR. 4269–4278.Google Scholar
- [64] . 2020. Online fast adaptive low-rank similarity learning for cross-modal retrieval. Trans. Multimedia 22, 5 (2020), 1310–1322.Google Scholar
Cross Ref
- [65] . 2018. AttnGAN: Fine-grained text-to-image generation with attentional generative adversarial networks. In CVPR. 1316–1324.Google Scholar
- [66] . 2015. Deep correlation for matching images and text. In CVPR. 3441–3450.Google Scholar
- [67] . 2019. Semantics disentangling for text-to-image generation. In CVPR. 2327–2336.Google Scholar
- [68] . 2019. MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In CVPR. 1247–1257.Google Scholar
- [69] . 2020. Joint attribute manipulation and modality alignment learning for composing text and image to image retrieval. In ACM MM. 3367–3376.Google Scholar
- [70] . 2017. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV. 5907–5915.Google Scholar
- [71] . 2018. StackGAN++: Realistic image synthesis with stacked generative adversarial networks. IEEE Trans. Pattern Anal. Machine Intell. 41, 8 (2018), 1947–1962.Google Scholar
Cross Ref
- [72] . 2020. Multi-pathway generative adversarial hashing for unsupervised cross-modal retrieval. Trans Multimedia 22, 1 (2020), 174–187.Google Scholar
Digital Library
- [73] . 2018. Generative domain-migration hashing for sketch-to-image retrieval. In ECCV. 297–314.Google Scholar
- [74] . 2020. Deep Top-k ranking for image-sentence matching. Trans. Multimedia 22, 3 (2020), 775–785.Google Scholar
Cross Ref
- [75] . 2017. Memory-augmented attribute manipulation networks for interactive fashion search. In CVPR. 1520–1528.Google Scholar
- [76] . 2019. Deep supervised cross-modal retrieval. In CVPR. 11394–10403.Google Scholar
- [77] . 2019. R2GAN: Cross-modal recipe retrieval with generative adversarial network. In CVPR. 11477–11486.Google Scholar
- [78] . 2020. ActBERT: Learning global-local video-text representations. CVPR. 8743–8752.Google Scholar
Index Terms
Tell, Imagine, and Search: End-to-end Learning for Composing Text and Image to Image Retrieval
Recommendations
Stylized Adversarial AutoEncoder for Image Generation
MM '17: Proceedings of the 25th ACM international conference on MultimediaIn this paper, we propose an autoencoder-based generative adversarial network (GAN) for automatic image generation, which is called "stylized adversarial autoencoder". Different from existing generative autoencoders which typically impose a prior ...
End-to-end learning of representations for instance-level document image retrieval
AbstractInstance-level document image retrieval plays a vital role in many document image processing systems. An appropriate image representation is of paramount importance for effective retrieval. To this end, we propose an image ...
Highlights- A representation tailored to the instance-level document image retrieval task is proposed.
End-to-End Learning of Deep Visual Representations for Image Retrieval
While deep learning has become a key ingredient in the top performing methods for many computer vision tasks, it has failed so far to bring similar improvements to instance-level image retrieval. In this article, we argue that reasons for the ...






Comments