skip to main content
research-article

CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning

Published:07 February 2019Publication History
Skip Abstract Section

Abstract

It is known that the inconsistent distributions and representations of different modalities, such as image and text, cause the heterogeneity gap, which makes it very challenging to correlate heterogeneous data and measure their similarities. Recently, generative adversarial networks (GANs) have been proposed and have shown their strong ability to model data distribution and learn discriminative representation. It has also been shown that adversarial learning can be fully exploited to learn discriminative common representations for bridging the heterogeneity gap. Inspired by this, we aim to effectively correlate large-scale heterogeneous data of different modalities with the power of GANs to model cross-modal joint distribution. In this article, we propose Cross-modal Generative Adversarial Networks (CM-GANs) with the following contributions. First, a cross-modal GAN architecture is proposed to model joint distribution over the data of different modalities. The inter-modality and intra-modality correlation can be explored simultaneously in generative and discriminative models. Both compete with each other to promote cross-modal correlation learning. Second, the cross-modal convolutional autoencoders with weight-sharing constraint are proposed to form the generative model. They not only exploit the cross-modal correlation for learning the common representations but also preserve reconstruction information for capturing the semantic consistency within each modality. Third, a cross-modal adversarial training mechanism is proposed, which uses two kinds of discriminative models to simultaneously conduct intra-modality and inter-modality discrimination. They can mutually boost to make the generated common representations more discriminative by the adversarial training process. In summary, our proposed CM-GAN approach can use GANs to perform cross-modal common representation learning by which the heterogeneous data can be effectively correlated. Extensive experiments are conducted to verify the performance of CM-GANs on cross-modal retrieval compared with 13 state-of-the-art methods on 4 cross-modal datasets.

References

  1. Galen Andrew, Raman Arora, Jeff A. Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In International Conference on Machine Learning (ICML’13). PMLR, 1247--1255. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Herng-Yow Chen and Sheng-Wei Li. 2007. Exploring many-to-one speech-to-text correlation for web-based language learning. ACM Transactions on Multimedia Computing, Communications, and Applications 3, 3 (2007), 13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. Nus-wide: A real-world web image database from National University of Singapore. In ACM International Conference on Image and Video Retrieval (CIVR’09). ACM, 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Emily L. Denton, Soumith Chintala, Rob Fergus, et al. 2015. Deep generative image models using a Laplacian pyramid of adversarial networks. In Advances in Neural Information Processing Systems (NIPS’15). MIT Press, 1486--1494. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In ACM International Conference on Multimedia (ACM-MM). ACM, 7--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Fangxiang Feng, Xiaojie Wang, Ruifan Li, and Ibrar Ahmad. 2015. Correspondence autoencoders for cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 12, 1s (2015), 26:1--26:22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chelsea Finn, Ian Goodfellow, and Sergey Levine. 2016. Unsupervised learning for physical interaction through video prediction. In Advances in Neural Information Processing Systems (NIPS’16). MIT Press, 64--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling Internet images, tags, and their semantics. International Journal of Computer Vision 106, 2 (2014), 210--233. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS’14). MIT Press, 2672--2680. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jiuxiang Gu, Jianfei Cai, Shafiq Joty, Li Niu, and Gang Wang. 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Conference on Computer Vision and Pattern Recognition (CVPR’18). IEEE, 7181--7189.Google ScholarGoogle ScholarCross RefCross Ref
  11. Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, and Alexandre Alahi. 2018. Social GAN: Socially acceptable trajectories with generative adversarial networks. arXiv preprint arXiv:1803.10892 (2018).Google ScholarGoogle Scholar
  12. David R. Hardoon, Sándor Szedmák, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16, 12 (2004), 2639--2664. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Li He, Xing Xu, Huimin Lu, Yang Yang, Fumin Shen, and Heng Tao Shen. 2017. Unsupervised cross-modal retrieval through adversarial learning. In IEEE International Conference on Multimedia and Expo (ICME’17). IEEE, 1153--1158.Google ScholarGoogle ScholarCross RefCross Ref
  14. Harold Hotelling. 1936. Relations between two sets of variates. Biometrika (1936), 321--377.Google ScholarGoogle Scholar
  15. Cuicui Kang, Shiming Xiang, Shengcai Liao, Changsheng Xu, and Chunhong Pan. 2015. Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Transactions on Multimedia 17, 3 (2015), 370--381.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jungi Kim, Jinseok Nam, and Iryna Gurevych. 2012. Learning semantics with deep belief network for cross-language information retrieval. In International Committee on Computational Linguistic (ICCL’12). Indian Institute of Technology Bombay, 579--588.Google ScholarGoogle Scholar
  17. Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Conference on Empirical Methods in Natural Language Processing. ACL, 1746--1751.Google ScholarGoogle ScholarCross RefCross Ref
  18. A. Krizhevsky, I. Sutskever, and G. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS’12). MIT Press, 1106--1114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. 2016. Photo-realistic single image super-resolution using a generative adversarial network. arXiv preprint arXiv:1609.04802 (2016).Google ScholarGoogle Scholar
  20. Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, and Dacheng Tao. 2018. Self-supervised adversarial hashing networks for cross-modal retrieval. In Conference on Computer Vision and Pattern Recognition (CVPR’18). IEEE, 4242--4251.Google ScholarGoogle ScholarCross RefCross Ref
  21. Ce Li, Chunyu Xie, Baochang Zhang, Chen Chen, and Jungong Han. 2018. Deep Fisher discriminant learning for mobile hand gesture recognition. Pattern Recognition 77 (2018), 276--288. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Dongge Li, Nevenka Dimitrova, Mingkun Li, and Ishwar K. Sethi. 2003. Multimedia content processing through cross-modal association. In ACM International Conference on Multimedia (ACM-MM’03). ACM, 604--611. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Jianan Li, Xiaodan Liang, Yunchao Wei, Tingfa Xu, Jiashi Feng, and Shuicheng Yan. 2017. Perceptual generative adversarial networks for small object detection. arXiv preprint arXiv:1706.05274 (2017).Google ScholarGoogle Scholar
  24. Kai Li, Guo-Jun Qi, and Kien A. Hua. 2018. Learning label preserving binary codes for multimedia retrieval: A general approach. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 1 (2018), 2:1--2:23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ting Liu, Yao Zhao, Shikui Wei, Yunchao Wei, and Lixin Liao. 2017. Enhanced isomorphic semantic representation for cross-media retrieval. In IEEE International Conference on Multimedia and Expo (ICME’17). IEEE, 967--972.Google ScholarGoogle ScholarCross RefCross Ref
  26. Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature 264, 5588 (1976), 746--748.Google ScholarGoogle Scholar
  27. Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (NIPS’13). MIT Press, 3111--3119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).Google ScholarGoogle Scholar
  29. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal deep learning. In International Conference on Machine Learning (ICML’11). PMLR, 689--696. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Augustus Odena, Christopher Olah, and Jonathon Shlens. 2016. Conditional image synthesis with auxiliary classifier GANS. arXiv preprint arXiv:1610.09585 (2016).Google ScholarGoogle Scholar
  31. Lei Pang, Shiai Zhu, and Chong-Wah Ngo. 2015. Deep multimodal learning for affective analysis and retrieval. IEEE Transactions on Multimedia 17, 11 (2015), 2008--2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Yuxin Peng, Xin Huang, and Jinwei Qi. 2016. Cross-media shared representation by hierarchical learning with multiple deep networks. In International Joint Conference on Artificial Intelligence (IJCAI’16). Morgan Kaufmann, 3846--3853. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Yuxin Peng, Xin Huang, and Yunzhen Zhao. 2018. An overview of cross-media retrieval: Concepts, methodologies, benchmarks and challenges. IEEE Transactions on Circuits and Systems for Video Technology 28, 9 (2018), 2372--2385. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Yuxin Peng, Jinwei Qi, Xin Huang, and Yuxin Yuan. 2018. CCL: Cross-modal correlation learning with multi-grained fusion by hierarchical network. IEEE Transactions on Multimedia 20, 2 (2018), 405--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Yuxin Peng, Wenwu Zhu, Yao Zhao, Changsheng Xu, Qingming Huang, Hanqing Lu, Qinghua Zheng, Tiejun Huang, and Wen Gao. 2017. Cross-media analysis and reasoning: Advances and directions. Frontiers of Information Technology 8 Electronic Engineering 18, 1 (2017), 44--57.Google ScholarGoogle Scholar
  36. Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015).Google ScholarGoogle Scholar
  37. Viresh Ranjan, Nikhil Rasiwasia, and C. V. Jawahar. 2015. Multi-label cross-modal retrieval. In IEEE International Conference on Computer Vision (ICCV’15). IEEE, 4094--4102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using Amazon’s Mechanical Turk. In NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. ACL, 139--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In ACM International Conference on Multimedia (ACM-MM’10). ACM, 251--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text to image synthesis. In International Conference on Machine Learning (ICML’16). PMLR, 1060--1069. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Scott E. Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak Lee. 2016. Learning what and where to draw. In Advances in Neural Information Processing Systems (NIPS’16). MIT Press, 217--225. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS’15). MIT Press, 91--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Jialie Shen, John Shepherd, and Anne H. H. Ngu. 2006. Towards effective content-based music retrieval with multiple acoustic feature combination. IEEE Transactions on Multimedia 8, 6 (2006), 1179--1189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR’14). 1--14.Google ScholarGoogle Scholar
  45. Richard Socher and Fei-Fei Li. 2010. Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In Conference on Computer Vision and Pattern Recognition (CVPR’10). IEEE, 966--973.Google ScholarGoogle ScholarCross RefCross Ref
  46. Nitish Srivastava and Ruslan Salakhutdinov. 2012. Learning representations for multimodal data with deep belief nets. In International Conference on Machine Learning (ICML’12) Workshop. PMLR, 1--8.Google ScholarGoogle Scholar
  47. Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Shen Hengtao. 2017. Adversarial cross-modal retrieval. In ACM Conference on Multimedia (ACM-MM’17). ACM, 154--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Daixin Wang, Peng Cui, Mingdong Ou, and Wenwu Zhu. 2015. Deep multimodal hashing with orthogonal regularization. In International Joint Conference on Artificial Intelligence (IJCAI’15). Morgan Kaufmann, 2291--2297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Hongwei Wang, Jia Wang, Jialin Wang, Miao Zhao, Weinan Zhang, Fuzheng Zhang, Xing Xie, and Minyi Guo. 2018. GraphGAN: Graph representation learning with generative adversarial nets. In AAAI Conference on Artificial Intelligence (AAAI’18). AAAI, 2508--2515.Google ScholarGoogle Scholar
  50. Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, Peng Zhang, and Dell Zhang. 2017. IRGAN: A minimax game for unifying generative and discriminative information retrieval models. In International Conference on Research on Development in Information Retrieval (SIGIR’17). ACM, 515--524. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Kaiye Wang, Ran He, Liang Wang, Wei Wang, and Tieniu Tan. 2016. Joint feature selection and subspace learning for cross-modal retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 10 (2016), 2010--2023. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Xiaolong Wang and Abhinav Gupta. 2016. Generative image modeling using style and structure adversarial networks. In European Conference on Computer Vision (ECCV’16). Springer, 318--335.Google ScholarGoogle ScholarCross RefCross Ref
  53. Yunchao Wei, Yao Zhao, Canyi Lu, Shikui Wei, Luoqi Liu, Zhenfeng Zhu, and Shuicheng Yan. 2017. Cross-modal retrieval with CNN visual features: A new baseline. IEEE Transactions on Cybernetics 47, 2 (2017), 449--460.Google ScholarGoogle Scholar
  54. Yunchao Wei, Yao Zhao, Zhenfeng Zhu, Shikui Wei, Yanhui Xiao, Jiashi Feng, and Shuicheng Yan. 2016. Modality-dependent cross-media retrieval. ACM Transactions on Intelligent Systems and Technology 7, 4 (2016), 57:1--57:13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Fei Wu, Xinyan Lu, Zhongfei Zhang, Shuicheng Yan, Yong Rui, and Yueting Zhuang. 2013. Cross-media semantic representation via bi-directional learning to rank. In ACM International Conference on Multimedia (ACM-MM’13). ACM, 877--886. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Xing Xu, Fumin Shen, Yang Yang, Heng Tao Shen, Li He, and Jingkuan Song. 2016. Cross-modal retrieval with label completion. In ACM International Conference on Multimedia (ACM-MM’16). ACM, 302--306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Xing Xu, Fumin Shen, Yang Yang, Heng Tao Shen, and Xuelong Li. 2017. Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Transactions on Image Processing 26, 5 (2017), 2494--2507. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Conference on Computer Vision and Pattern Recognition (CVPR’15). IEEE, 3441--3450.Google ScholarGoogle ScholarCross RefCross Ref
  59. Yi Yang, Yueting Zhuang, Fei Wu, and Yunhe Pan. 2008. Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Transactions on Multimedia 10, 3 (2008), 437--446. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. 2013. Heterogeneous metric learning with joint graph regularization for cross-media retrieval. In AAAI Conference on Artificial Intelligence (AAAI’13). AAAI, 1198--1204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. 2014. Learning cross-media joint representation with sparse and semi-supervised regularization. IEEE Transactions on Circuits and Systems for Video Technology 24, 6 (2014), 965--978.Google ScholarGoogle ScholarCross RefCross Ref
  62. Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas. 2017. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In IEEE International Conference on Computer Vision (ICCV’17). IEEE, 1--8.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!