Abstract
It is known that the inconsistent distributions and representations of different modalities, such as image and text, cause the heterogeneity gap, which makes it very challenging to correlate heterogeneous data and measure their similarities. Recently, generative adversarial networks (GANs) have been proposed and have shown their strong ability to model data distribution and learn discriminative representation. It has also been shown that adversarial learning can be fully exploited to learn discriminative common representations for bridging the heterogeneity gap. Inspired by this, we aim to effectively correlate large-scale heterogeneous data of different modalities with the power of GANs to model cross-modal joint distribution. In this article, we propose Cross-modal Generative Adversarial Networks (CM-GANs) with the following contributions. First, a cross-modal GAN architecture is proposed to model joint distribution over the data of different modalities. The inter-modality and intra-modality correlation can be explored simultaneously in generative and discriminative models. Both compete with each other to promote cross-modal correlation learning. Second, the cross-modal convolutional autoencoders with weight-sharing constraint are proposed to form the generative model. They not only exploit the cross-modal correlation for learning the common representations but also preserve reconstruction information for capturing the semantic consistency within each modality. Third, a cross-modal adversarial training mechanism is proposed, which uses two kinds of discriminative models to simultaneously conduct intra-modality and inter-modality discrimination. They can mutually boost to make the generated common representations more discriminative by the adversarial training process. In summary, our proposed CM-GAN approach can use GANs to perform cross-modal common representation learning by which the heterogeneous data can be effectively correlated. Extensive experiments are conducted to verify the performance of CM-GANs on cross-modal retrieval compared with 13 state-of-the-art methods on 4 cross-modal datasets.
- Galen Andrew, Raman Arora, Jeff A. Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In International Conference on Machine Learning (ICML’13). PMLR, 1247--1255. Google Scholar
Digital Library
- Herng-Yow Chen and Sheng-Wei Li. 2007. Exploring many-to-one speech-to-text correlation for web-based language learning. ACM Transactions on Multimedia Computing, Communications, and Applications 3, 3 (2007), 13. Google Scholar
Digital Library
- Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. Nus-wide: A real-world web image database from National University of Singapore. In ACM International Conference on Image and Video Retrieval (CIVR’09). ACM, 1--9. Google Scholar
Digital Library
- Emily L. Denton, Soumith Chintala, Rob Fergus, et al. 2015. Deep generative image models using a Laplacian pyramid of adversarial networks. In Advances in Neural Information Processing Systems (NIPS’15). MIT Press, 1486--1494. Google Scholar
Digital Library
- Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In ACM International Conference on Multimedia (ACM-MM). ACM, 7--16. Google Scholar
Digital Library
- Fangxiang Feng, Xiaojie Wang, Ruifan Li, and Ibrar Ahmad. 2015. Correspondence autoencoders for cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 12, 1s (2015), 26:1--26:22. Google Scholar
Digital Library
- Chelsea Finn, Ian Goodfellow, and Sergey Levine. 2016. Unsupervised learning for physical interaction through video prediction. In Advances in Neural Information Processing Systems (NIPS’16). MIT Press, 64--72. Google Scholar
Digital Library
- Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling Internet images, tags, and their semantics. International Journal of Computer Vision 106, 2 (2014), 210--233. Google Scholar
Digital Library
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS’14). MIT Press, 2672--2680. Google Scholar
Digital Library
- Jiuxiang Gu, Jianfei Cai, Shafiq Joty, Li Niu, and Gang Wang. 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Conference on Computer Vision and Pattern Recognition (CVPR’18). IEEE, 7181--7189.Google Scholar
Cross Ref
- Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, and Alexandre Alahi. 2018. Social GAN: Socially acceptable trajectories with generative adversarial networks. arXiv preprint arXiv:1803.10892 (2018).Google Scholar
- David R. Hardoon, Sándor Szedmák, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16, 12 (2004), 2639--2664. Google Scholar
Digital Library
- Li He, Xing Xu, Huimin Lu, Yang Yang, Fumin Shen, and Heng Tao Shen. 2017. Unsupervised cross-modal retrieval through adversarial learning. In IEEE International Conference on Multimedia and Expo (ICME’17). IEEE, 1153--1158.Google Scholar
Cross Ref
- Harold Hotelling. 1936. Relations between two sets of variates. Biometrika (1936), 321--377.Google Scholar
- Cuicui Kang, Shiming Xiang, Shengcai Liao, Changsheng Xu, and Chunhong Pan. 2015. Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Transactions on Multimedia 17, 3 (2015), 370--381.Google Scholar
Digital Library
- Jungi Kim, Jinseok Nam, and Iryna Gurevych. 2012. Learning semantics with deep belief network for cross-language information retrieval. In International Committee on Computational Linguistic (ICCL’12). Indian Institute of Technology Bombay, 579--588.Google Scholar
- Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Conference on Empirical Methods in Natural Language Processing. ACL, 1746--1751.Google Scholar
Cross Ref
- A. Krizhevsky, I. Sutskever, and G. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS’12). MIT Press, 1106--1114. Google Scholar
Digital Library
- Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. 2016. Photo-realistic single image super-resolution using a generative adversarial network. arXiv preprint arXiv:1609.04802 (2016).Google Scholar
- Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, and Dacheng Tao. 2018. Self-supervised adversarial hashing networks for cross-modal retrieval. In Conference on Computer Vision and Pattern Recognition (CVPR’18). IEEE, 4242--4251.Google Scholar
Cross Ref
- Ce Li, Chunyu Xie, Baochang Zhang, Chen Chen, and Jungong Han. 2018. Deep Fisher discriminant learning for mobile hand gesture recognition. Pattern Recognition 77 (2018), 276--288. Google Scholar
Digital Library
- Dongge Li, Nevenka Dimitrova, Mingkun Li, and Ishwar K. Sethi. 2003. Multimedia content processing through cross-modal association. In ACM International Conference on Multimedia (ACM-MM’03). ACM, 604--611. Google Scholar
Digital Library
- Jianan Li, Xiaodan Liang, Yunchao Wei, Tingfa Xu, Jiashi Feng, and Shuicheng Yan. 2017. Perceptual generative adversarial networks for small object detection. arXiv preprint arXiv:1706.05274 (2017).Google Scholar
- Kai Li, Guo-Jun Qi, and Kien A. Hua. 2018. Learning label preserving binary codes for multimedia retrieval: A general approach. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 1 (2018), 2:1--2:23. Google Scholar
Digital Library
- Ting Liu, Yao Zhao, Shikui Wei, Yunchao Wei, and Lixin Liao. 2017. Enhanced isomorphic semantic representation for cross-media retrieval. In IEEE International Conference on Multimedia and Expo (ICME’17). IEEE, 967--972.Google Scholar
Cross Ref
- Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature 264, 5588 (1976), 746--748.Google Scholar
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (NIPS’13). MIT Press, 3111--3119. Google Scholar
Digital Library
- Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).Google Scholar
- Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal deep learning. In International Conference on Machine Learning (ICML’11). PMLR, 689--696. Google Scholar
Digital Library
- Augustus Odena, Christopher Olah, and Jonathon Shlens. 2016. Conditional image synthesis with auxiliary classifier GANS. arXiv preprint arXiv:1610.09585 (2016).Google Scholar
- Lei Pang, Shiai Zhu, and Chong-Wah Ngo. 2015. Deep multimodal learning for affective analysis and retrieval. IEEE Transactions on Multimedia 17, 11 (2015), 2008--2020.Google Scholar
Digital Library
- Yuxin Peng, Xin Huang, and Jinwei Qi. 2016. Cross-media shared representation by hierarchical learning with multiple deep networks. In International Joint Conference on Artificial Intelligence (IJCAI’16). Morgan Kaufmann, 3846--3853. Google Scholar
Digital Library
- Yuxin Peng, Xin Huang, and Yunzhen Zhao. 2018. An overview of cross-media retrieval: Concepts, methodologies, benchmarks and challenges. IEEE Transactions on Circuits and Systems for Video Technology 28, 9 (2018), 2372--2385. Google Scholar
Digital Library
- Yuxin Peng, Jinwei Qi, Xin Huang, and Yuxin Yuan. 2018. CCL: Cross-modal correlation learning with multi-grained fusion by hierarchical network. IEEE Transactions on Multimedia 20, 2 (2018), 405--420. Google Scholar
Digital Library
- Yuxin Peng, Wenwu Zhu, Yao Zhao, Changsheng Xu, Qingming Huang, Hanqing Lu, Qinghua Zheng, Tiejun Huang, and Wen Gao. 2017. Cross-media analysis and reasoning: Advances and directions. Frontiers of Information Technology 8 Electronic Engineering 18, 1 (2017), 44--57.Google Scholar
- Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015).Google Scholar
- Viresh Ranjan, Nikhil Rasiwasia, and C. V. Jawahar. 2015. Multi-label cross-modal retrieval. In IEEE International Conference on Computer Vision (ICCV’15). IEEE, 4094--4102. Google Scholar
Digital Library
- Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using Amazon’s Mechanical Turk. In NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. ACL, 139--147. Google Scholar
Digital Library
- Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In ACM International Conference on Multimedia (ACM-MM’10). ACM, 251--260. Google Scholar
Digital Library
- Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text to image synthesis. In International Conference on Machine Learning (ICML’16). PMLR, 1060--1069. Google Scholar
Digital Library
- Scott E. Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak Lee. 2016. Learning what and where to draw. In Advances in Neural Information Processing Systems (NIPS’16). MIT Press, 217--225. Google Scholar
Digital Library
- Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS’15). MIT Press, 91--99. Google Scholar
Digital Library
- Jialie Shen, John Shepherd, and Anne H. H. Ngu. 2006. Towards effective content-based music retrieval with multiple acoustic feature combination. IEEE Transactions on Multimedia 8, 6 (2006), 1179--1189. Google Scholar
Digital Library
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR’14). 1--14.Google Scholar
- Richard Socher and Fei-Fei Li. 2010. Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In Conference on Computer Vision and Pattern Recognition (CVPR’10). IEEE, 966--973.Google Scholar
Cross Ref
- Nitish Srivastava and Ruslan Salakhutdinov. 2012. Learning representations for multimodal data with deep belief nets. In International Conference on Machine Learning (ICML’12) Workshop. PMLR, 1--8.Google Scholar
- Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Shen Hengtao. 2017. Adversarial cross-modal retrieval. In ACM Conference on Multimedia (ACM-MM’17). ACM, 154--162. Google Scholar
Digital Library
- Daixin Wang, Peng Cui, Mingdong Ou, and Wenwu Zhu. 2015. Deep multimodal hashing with orthogonal regularization. In International Joint Conference on Artificial Intelligence (IJCAI’15). Morgan Kaufmann, 2291--2297. Google Scholar
Digital Library
- Hongwei Wang, Jia Wang, Jialin Wang, Miao Zhao, Weinan Zhang, Fuzheng Zhang, Xing Xie, and Minyi Guo. 2018. GraphGAN: Graph representation learning with generative adversarial nets. In AAAI Conference on Artificial Intelligence (AAAI’18). AAAI, 2508--2515.Google Scholar
- Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, Peng Zhang, and Dell Zhang. 2017. IRGAN: A minimax game for unifying generative and discriminative information retrieval models. In International Conference on Research on Development in Information Retrieval (SIGIR’17). ACM, 515--524. Google Scholar
Digital Library
- Kaiye Wang, Ran He, Liang Wang, Wei Wang, and Tieniu Tan. 2016. Joint feature selection and subspace learning for cross-modal retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 10 (2016), 2010--2023. Google Scholar
Digital Library
- Xiaolong Wang and Abhinav Gupta. 2016. Generative image modeling using style and structure adversarial networks. In European Conference on Computer Vision (ECCV’16). Springer, 318--335.Google Scholar
Cross Ref
- Yunchao Wei, Yao Zhao, Canyi Lu, Shikui Wei, Luoqi Liu, Zhenfeng Zhu, and Shuicheng Yan. 2017. Cross-modal retrieval with CNN visual features: A new baseline. IEEE Transactions on Cybernetics 47, 2 (2017), 449--460.Google Scholar
- Yunchao Wei, Yao Zhao, Zhenfeng Zhu, Shikui Wei, Yanhui Xiao, Jiashi Feng, and Shuicheng Yan. 2016. Modality-dependent cross-media retrieval. ACM Transactions on Intelligent Systems and Technology 7, 4 (2016), 57:1--57:13. Google Scholar
Digital Library
- Fei Wu, Xinyan Lu, Zhongfei Zhang, Shuicheng Yan, Yong Rui, and Yueting Zhuang. 2013. Cross-media semantic representation via bi-directional learning to rank. In ACM International Conference on Multimedia (ACM-MM’13). ACM, 877--886. Google Scholar
Digital Library
- Xing Xu, Fumin Shen, Yang Yang, Heng Tao Shen, Li He, and Jingkuan Song. 2016. Cross-modal retrieval with label completion. In ACM International Conference on Multimedia (ACM-MM’16). ACM, 302--306. Google Scholar
Digital Library
- Xing Xu, Fumin Shen, Yang Yang, Heng Tao Shen, and Xuelong Li. 2017. Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Transactions on Image Processing 26, 5 (2017), 2494--2507. Google Scholar
Digital Library
- Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Conference on Computer Vision and Pattern Recognition (CVPR’15). IEEE, 3441--3450.Google Scholar
Cross Ref
- Yi Yang, Yueting Zhuang, Fei Wu, and Yunhe Pan. 2008. Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Transactions on Multimedia 10, 3 (2008), 437--446. Google Scholar
Digital Library
- Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. 2013. Heterogeneous metric learning with joint graph regularization for cross-media retrieval. In AAAI Conference on Artificial Intelligence (AAAI’13). AAAI, 1198--1204. Google Scholar
Digital Library
- Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. 2014. Learning cross-media joint representation with sparse and semi-supervised regularization. IEEE Transactions on Circuits and Systems for Video Technology 24, 6 (2014), 965--978.Google Scholar
Cross Ref
- Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas. 2017. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In IEEE International Conference on Computer Vision (ICCV’17). IEEE, 1--8.Google Scholar
Cross Ref
Index Terms
CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning
Recommendations
EMVGAN: Emotion-Aware Music-Video Common Representation Learning via Generative Adversarial Networks
MMArt-ACM '22: Proceedings of the 2022 International Joint Workshop on Multimedia Artworks Analysis and Attractiveness Computing in MultimediaMusic can enhance our emotional reactions to videos and images, while videos and images can enrich our emotional response to music. Cross-modality retrieval technology can be used to recommend appropriate music for a given video and vice versa. However, ...
Semi-supervised cross-modal retrieval with graph-based semantic alignment network
AbstractSemi-supervised cross-modal retrieval is an eclectic paradigm which learns common representations via exploiting underlying semantic information from both labeled and unlabeled data. Most existing methods ignore the rich semantic ...
Highlights- We specialize in exploring semantic information covered by corresponding text data.
Parallel learned generative adversarial network with multi-path subspaces for cross-modal retrieval
AbstractCross-modal retrieval aims to narrow the heterogeneity gap between different modalities, such as retrieving images through texts or vice versa. One of the key challenges of cross-modal retrieval is the inconsistent distribution across ...






Comments