Abstract
In this article, we propose to learn shared semantic space with correlation alignment (S3CA) for multimodal data representations, which aligns nonlinear correlations of multimodal data distributions in deep neural networks designed for heterogeneous data. In the context of cross-modal (event) retrieval, we design a neural network with convolutional layers and fully connected layers to extract features for images, including images on Flickr-like social media. Simultaneously, we exploit a fully connected neural network to extract semantic features for text documents, including news articles from news media. In particular, nonlinear correlations of layer activations in the two neural networks are aligned with correlation alignment during the joint training of the networks. Furthermore, we project the multimodal data into a shared semantic space for cross-modal (event) retrieval, where the distances between heterogeneous data samples can be measured directly. In addition, we contribute a Wiki-Flickr Event dataset, where the multimodal data samples are not describing each other in pairs like the existing paired datasets, but all of them are describing semantic events. Extensive experiments conducted on both paired and unpaired datasets manifest the effectiveness of S3CA, outperforming the state-of-the-art methods.
- Runwei Situ, Zhenguo Yang, Jianming Lv, Qing Li, and Wenyin Liu. 2018. Cross-modal event retrieval: A dataset and a baseline using deep semantic learning. In Proceedings of the Pacific-Rim Conference on Multimedia. 147--157.Google Scholar
Cross Ref
- Qing-Yuan Jiang and Wu-Jun Li. 2017. Deep cross-modal hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3232--3240.Google Scholar
Cross Ref
- Yunchao Wei, Yao Zhao, Canyi Lu, Shikui Wei, Luoqi Liu, Zhenfeng Zhu, and Shuicheng Yan. 2017. Cross-modal retrieval with CNN visual features: A new baseline. IEEE Transactions on Cybernetics 47, 2 (2017), 449--460.Google Scholar
- Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, and Dacheng Tao. 2018. Self-supervised adversarial hashing networks for cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4242--4251.Google Scholar
Cross Ref
- David R. Hardoon, Sandor Szedmak, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16, 12 (2004), 2639--2664.Google Scholar
Digital Library
- Dongge Li, Nevenka Dimitrova, Mingkun Li, and Ishwar K. Sethi. 2003. Multimedia content processing through cross-modal association. In Proceedings of the ACM International Conference on Multimedia. ACM, New York, NY, 604--611.Google Scholar
- Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the ACM International Conference on Multimedia. ACM, New York, NY, 251--260.Google Scholar
Digital Library
- Abhishek Sharma, Abhishek Kumar, Hal Daume, and David W. Jacobs. 2012. Generalized multiview analysis: A discriminative latent space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 2160--2167.Google Scholar
- Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling Internet images, tags, and their semantics. International Journal of Computer Vision 106, 2 (2014), 210--233.Google Scholar
Digital Library
- Viresh Ranjan, Nikhil Rasiwasia, and C. V. Jawahar. 2015. Multi-label cross-modal retrieval. In Proceedings of the IEEE International Conference on Computer Vision. 4094--4102.Google Scholar
- Yi Yu, Suhua Tang, Francisco Raposo, and Lei Chen. 2019. Deep cross-modal correlation learning for audio and lyrics in music retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (2019), 20.Google Scholar
Digital Library
- Thi Quynh Nhi Tran, Hervé Le Borgne, and Michel Crucianu. 2016. Aggregating image and text quantized correlated components. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2046--2054.Google Scholar
- Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Olivier Chapelle, and Kilian Weinberger. 2010. Learning to rank with (a lot of) word features. Information Retrieval 13, 3 (2010), 291--314.Google Scholar
Digital Library
- Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik. 2018. Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence PP, 99 (2018), 1--14.Google Scholar
- Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin, Ferda Ofli, Ingmar Weber, and Antonio Torralba. 2017. Learning cross-modal embeddings for cooking recipes and food images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 720. 2.Google Scholar
Cross Ref
- Liang Zhang, Bingpeng Ma, Guorong Li, Qingming Huang, and Qi Tian. 2017. Multi-networks joint learning for large-scale cross-modal retrieval. In Proceedings of the ACM International Conference on Multimedia. ACM, New York, NY, 907--915.Google Scholar
Digital Library
- Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3441--3450.Google Scholar
Cross Ref
- Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the ACM International Conference on Multimedia. ACM, New York, NY, 7--16.Google Scholar
Digital Library
- Yuxin Peng, Xin Huang, and Jinwei Qi. 2016. Cross-media shared representation by hierarchical learning with multiple deep networks. In Proceedings of the International Joint Conferences on Artificial Intelligence. 3846--3853.Google Scholar
- Yuxin Peng, Jinwei Qi, Xin Huang, and Yuxin Yuan. 2018. CCL: Cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Transactions on Multimedia 20, 2 (2018), 405--420.Google Scholar
Digital Library
- Yonghao He, Shiming Xiang, Cuicui Kang, Jian Wang, and Chunhong Pan. 2016. Cross-modal retrieval via deep and bidirectional representation learning. IEEE Transactions on Multimedia 18, 7 (2016), 1363--1377.Google Scholar
Digital Library
- Mengdi Fan, Wenmin Wang, Peilei Dong, Liang Han, Ronggang Wang, and Ge Li. 2017. Cross-media retrieval by learning rich semantic embeddings of multimedia. In Proceedings of the ACM International Conference on Multimedia. ACM, New York, NY, 1698--1706.Google Scholar
Digital Library
- Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In Proceedings of the ACM International Conference on Multimedia. ACM, New York, NY, 154--162.Google Scholar
Digital Library
- Xi Zhang, Siyu Zhou, Jiashi Feng, Hanjiang Lai, Bo Li, Yan Pan, Jian Yin, and Shuicheng Yan. 2017. HashGAN: Attention-aware deep adversarial hashing for cross modal retrieval. arXiv:1711.09347.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.Google Scholar
- Baochen Sun, Jiashi Feng, and Kate Saenko. 2016. Return of frustratingly easy domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 6. 8.Google Scholar
- Karsten M. Borgwardt, Arthur Gretton, Malte J. Rasch, Hans-Peter Kriegel, Bernhard Schölkopf, and Alex J. Smola. 2006. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 22, 14 (2006), e49--e57.Google Scholar
Digital Library
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems. 2672--2680.Google Scholar
- Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. 2016. Mode regularized generative adversarial networks. arXiv:1612.02136.Google Scholar
- Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. 2017. Generalization and equilibrium in generative adversarial nets (GANs). arXiv:1703.00573.Google Scholar
- Sanjeev Arora and Yi Zhang. 2017. Do GANs actually learn the distribution? arXiv:1706.08224. An empirical study.Google Scholar
- Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse++: Improved visual-semantic embeddings. arXiv:1707.05612.Google Scholar
- Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using Amazon’s Mechanical Turk. In Proceedings of the North American Chapter of the Association for Computational Linguistics Workshop. 139--147.Google Scholar
- Zhenguo Yang, Qing Li, Wenyin Liu, and Jianming Lv. 2019. Shared multi-view data representation for multi-domain event detection. IEEE Transactions on Pattern Analysis and Machine Intelligence. Epub ahead of print. DOI:10.1109/TPAMI.2019.2893953Google Scholar
- Harold Hotelling. 1936. Relations between two sets of variates. Biometrika 28, 3--4 (1936), 321--377.Google Scholar
Cross Ref
- David R. Hardoon, Sandor Szedmak, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16, 12 (2004), 2639--2664.Google Scholar
Digital Library
- Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3441--3450.Google Scholar
Cross Ref
Index Terms
Learning Shared Semantic Space with Correlation Alignment for Cross-Modal Event Retrieval
Recommendations
Analyzing semantic correlation for cross-modal retrieval
With the development of multimedia technology, effective cross-modal retrieval methods are increasingly demanded. The key point of cross-modal retrieval is analyzing the correlation of heterogeneous modalities. There are mainly two types of correlation: ...
HCMSL: Hybrid Cross-modal Similarity Learning for Cross-modal Retrieval
The purpose of cross-modal retrieval is to find the relationship between different modal samples and to retrieve other modal samples with similar semantics by using a certain modal sample. As the data of different modalities presents heterogeneous low-...
Semantic consistent adversarial cross-modal retrieval exploiting semantic similarity
AbstractCross-modal retrieval aims to search the semantically similar instances from the other modalities given a query from one modality. However, the differences of the distributions and representations between different modalities make that the ...






Comments