Abstract
Cross-modal retrieval between texts and videos has received consistent research interest in the multimedia community. Existing studies follow a trend of learning a joint embedding space to measure the distance between text and video representations. In common practice, video representation is constructed by feeding clips into 3D convolutional neural networks for a coarse-grained global visual feature extraction. In addition, several studies have attempted to align the local objects of video with the text. However, these representations share a drawback of neglecting rich fine-grained relation features capturing spatial-temporal object interactions that benefits mapping textual entities in the real-world retrieval system. To tackle this problem, we propose an adversarial multi-grained embedding network (AME-Net), a novel cross-modal retrieval framework that adopts both fine-grained local relation and coarse-grained global features in bridging text-video modalities. Additionally, with the newly proposed visual representation, we also integrate an adversarial learning strategy into AME-Net, to further narrow the domain gap between text and video representations. In summary, we contribute AME-Net with an adversarial learning strategy for learning a better joint embedding space, and experimental results on MSR-VTT and YouCook2 datasets demonstrate that our proposed framework consistently outperforms the state-of-the-art method.
- [1] . 2017. Quo vadis, action recognition? A new model and the Kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 6299–6308.Google Scholar
Cross Ref
- [2] . 2016. Deep-based ingredient recognition for cooking recipe retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’16). 32–41. Google Scholar
Digital Library
- [3] . 2018. Deep understanding of cooking procedure for cross-modal recipe retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’18). 1020–1028. Google Scholar
Digital Library
- [4] . 2019. MMDetection: Open MMLab detection toolbox and benchmark. CoRR abs/1906.07155 (2019). http://arxiv.org/abs/1906.07155.Google Scholar
- [5] . 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’20). 10635–10644.Google Scholar
Cross Ref
- [6] . 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.Google Scholar
- [7] . 2019. DynamoNet: Dynamic action and motion network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). 6192–6201.Google Scholar
Cross Ref
- [8] . 2019. Dual encoding for zero-example video retrieval. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’19). 9346–9355.Google Scholar
Cross Ref
- [9] . 2017. VSE++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612.Google Scholar
- [10] . 2015. Correspondence autoencoders for cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 12, 1s (2015), Article 26, 22 pages. Google Scholar
Digital Library
- [11] . 2020. Exploiting visual semantic reasoning for video-text retrieval. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI’20). 1005–1011. Google Scholar
Digital Library
- [12] . 2020. Multi-modal transformer for video retrieval. In Proceedings of the European Conference on Computer Vision (ECCV’20), Vol. 5.Google Scholar
Digital Library
- [13] . 2015. Unsupervised domain adaptation by backpropagation. In Proceedings of the International Conference on Machine Learning. 1180–1189. Google Scholar
Digital Library
- [14] . 2020. Stacked spatio-temporal graph convolutional networks for action segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 576–585.Google Scholar
Cross Ref
- [15] . 2015. Fast R-CNN. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’15). 1440–1448. Google Scholar
Digital Library
- [16] . 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems. 2672–2680. Google Scholar
Digital Library
- [17] . 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’18). 7181–7189.Google Scholar
Cross Ref
- [18] . 2020. Visual relations augmented cross-modal retrieval. In Proceedings of the 2020 International Conference on Multimedia Retrieval (ICMR’20). 9–15. Google Scholar
Digital Library
- [19] . 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNN and ImageNet? In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’18). 6546–6555.Google Scholar
Cross Ref
- [20] . 2016. Deep residual learning for image recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’16). 770–778.Google Scholar
Cross Ref
- [21] . 1936. Relations between two sets of variates. Biometrika 28, 3–4 (1936), 321–377.Google Scholar
- [22] . 2018. Relation networks for object detection. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’18). 3588–3597.Google Scholar
Cross Ref
- [23] . 2016. Structural-RNN: Deep learning on spatio-temporal graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 5308–5317.Google Scholar
Cross Ref
- [24] . 2012. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2012), 221–231. Google Scholar
Digital Library
- [25] . 2015. Image retrieval using scene graphs. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’15). 3668–3678.Google Scholar
Cross Ref
- [26] . 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations (ICLR’17).Google Scholar
- [27] . 2015. Associating neural word embeddings with deep image representations using Fisher vectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 4437–4446.Google Scholar
Cross Ref
- [28] . 2012. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012). 1097–1105. Google Scholar
Digital Library
- [29] . 2017. Temporal modeling approaches for large-scale YouTube-8m video understanding. arXiv preprint arXiv:1707.04555 (2017).Google Scholar
- [30] . 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14). 740–755.Google Scholar
Cross Ref
- [31] . 2018. Attentive moment retrieval in videos. In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (MM’18). 15–24. Google Scholar
Digital Library
- [32] . 2019. Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487.Google Scholar
- [33] . 2019. Neural message passing on hybrid spatio-temporal visual and symbolic graphs for video understanding. arXiv preprint arXiv:1905.07385.Google Scholar
- [34] . 2020. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’20). 9876–9886.Google Scholar
Cross Ref
- [35] . 2018. Learning a text-video embedding from incomplete and heterogeneous data. arXiv preprint arXiv:1804.02516.Google Scholar
- [36] . 2019. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the International Conference on Computer Vision (ICCV’19). 2630–2640.Google Scholar
Cross Ref
- [37] . 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the International Conference on Multimedia Retrieval (ICMR’18). 19–27. Google Scholar
Digital Library
- [38] . 2020. Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 10870–10879.Google Scholar
Cross Ref
- [39] . 2016. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 1029–1038.Google Scholar
Cross Ref
- [40] . 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (2019), 1–24. Google Scholar
Digital Library
- [41] . 2017. CCL: Cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Transactions on Multimedia 20, 2 (2017), 405–420. Google Scholar
Digital Library
- [42] . 2007. Fisher kernels on visual vocabularies for image categorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’07). 1–8.Google Scholar
Cross Ref
- [43] . 2019. Video relation detection with spatio-temporal graph. In Proceedings of the ACM International Conference on Multimedia (MM’19). 84–93. Google Scholar
Digital Library
- [44] . 2020. AVLnet: Learning audio-visual language representations from instructional videos. arXiv preprint arXiv:2006.09199.Google Scholar
- [45] . 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252. Google Scholar
Digital Library
- [46] . 2019. Adversarial representation learning for text-to-image matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). 5814–5824.Google Scholar
Cross Ref
- [47] . 2017. Video visual relation detection. In Proceedings of the ACM International Conference on Multimedia (MM’17). 1300–1308. Google Scholar
Digital Library
- [48] . 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS’14). 568–576. Google Scholar
Digital Library
- [49] . 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’19). 1979–1988.Google Scholar
Cross Ref
- [50] . 2015. Unsupervised learning of video representations using LSTMs. In Proceedings of the International Conference on Machine Learning. 843–852. Google Scholar
Digital Library
- [51] . 2017. Graph-structured representations for visual question answering. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’17). 1–9.Google Scholar
Cross Ref
- [52] . 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 4489–4497. Google Scholar
Digital Library
- [53] . 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 6450–6459.Google Scholar
Cross Ref
- [54] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008. Google Scholar
Digital Library
- [55] . 2017. Graph attention networks. CoRR abs/1710.10903 (2017). http://arxiv.org/abs/1710.10903.Google Scholar
- [56] . 2017. Adversarial cross-modal retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’17). 154–162. Google Scholar
Digital Library
- [57] . 2019. Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 11572–11581.Google Scholar
Cross Ref
- [58] . 2015. Towards good practices for very deep two-stream ConvNets. arXiv preprint arXiv:1507.02159.Google Scholar
- [59] . 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV’20). 1508–1517.Google Scholar
Cross Ref
- [60] . 2018. Videos as space-time region graphs. In Proceedings of the European Conference on Computer Vision (ECCV’18). 399–417.Google Scholar
Cross Ref
- [61] . 2019. Fine-grained action retrieval through multiple parts-of-speech embeddings. In Proceedings of the International Conference on Computer Vision (ICCV’19). 450–459.Google Scholar
Cross Ref
- [62] . 2019. Learning actor relation graphs for group activity recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’19). 9964–9974.Google Scholar
Cross Ref
- [63] . 2019. A graph-based framework to bridge movies and synopses. In Proceedings of the International Conference on Computer Vision (ICCV’19). 4592–4601.Google Scholar
Cross Ref
- [64] . 2017. Scene graph generation by iterative message passing. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’17). 5410–5419.Google Scholar
Cross Ref
- [65] . 2019. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI’19). 9062–9069. Google Scholar
Digital Library
- [66] . 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’16). 5288–5296.Google Scholar
Cross Ref
- [67] . 2020. Tree-augmented cross-modal encoding for complex-query video retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’20). 1339–1348. Google Scholar
Digital Library
- [68] . 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV’18). 684–699.Google Scholar
Digital Library
- [69] . 2018. A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV’18). 487–503.Google Scholar
Cross Ref
- [70] . 2017. End-to-end concept word detection for video captioning, retrieval, and question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 3165–3173.Google Scholar
Cross Ref
- [71] . 2018. Weakly-Supervised video object grounding from text by loss weighting and object interaction. In Proceedings of the British Machine Vision Conference (BMVC’18). 50.Google Scholar
- [72] . 2019. R2GAN: Cross-modal recipe retrieval with generative adversarial network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 11477–11486.Google Scholar
Cross Ref
Index Terms
Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval
Recommendations
Fine-grained Cross-modal Alignment Network for Text-Video Retrieval
MM '21: Proceedings of the 29th ACM International Conference on MultimediaDespite the recent progress of cross-modal text-to-video retrieval techniques, their performance is still unsatisfactory. Most existing works follow a trend of learning a joint embedding space to measure the distance between global-level or local-level ...
Multi-grained encoding and joint embedding space fusion for video and text cross-modal retrieval
AbstractVideo-text cross-modal retrieval is significant to computer vision. Most of existing works focus on exploring the global similarity between modalities, but ignore the influence of details on retrieval results. How to explore the correlation ...
Adversarial Cross-Modal Retrieval
MM '17: Proceedings of the 25th ACM international conference on MultimediaCross-modal retrieval aims to enable flexible retrieval experience across different modalities (e.g., texts vs. images). The core of cross-modal retrieval research is to learn a common subspace where the items of different modalities can be directly ...






Comments