skip to main content
research-article

Learning Shared Semantic Space with Correlation Alignment for Cross-Modal Event Retrieval

Published:17 February 2020Publication History
Skip Abstract Section

Abstract

In this article, we propose to learn shared semantic space with correlation alignment (S3CA) for multimodal data representations, which aligns nonlinear correlations of multimodal data distributions in deep neural networks designed for heterogeneous data. In the context of cross-modal (event) retrieval, we design a neural network with convolutional layers and fully connected layers to extract features for images, including images on Flickr-like social media. Simultaneously, we exploit a fully connected neural network to extract semantic features for text documents, including news articles from news media. In particular, nonlinear correlations of layer activations in the two neural networks are aligned with correlation alignment during the joint training of the networks. Furthermore, we project the multimodal data into a shared semantic space for cross-modal (event) retrieval, where the distances between heterogeneous data samples can be measured directly. In addition, we contribute a Wiki-Flickr Event dataset, where the multimodal data samples are not describing each other in pairs like the existing paired datasets, but all of them are describing semantic events. Extensive experiments conducted on both paired and unpaired datasets manifest the effectiveness of S3CA, outperforming the state-of-the-art methods.

References

  1. Runwei Situ, Zhenguo Yang, Jianming Lv, Qing Li, and Wenyin Liu. 2018. Cross-modal event retrieval: A dataset and a baseline using deep semantic learning. In Proceedings of the Pacific-Rim Conference on Multimedia. 147--157.Google ScholarGoogle ScholarCross RefCross Ref
  2. Qing-Yuan Jiang and Wu-Jun Li. 2017. Deep cross-modal hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3232--3240.Google ScholarGoogle ScholarCross RefCross Ref
  3. Yunchao Wei, Yao Zhao, Canyi Lu, Shikui Wei, Luoqi Liu, Zhenfeng Zhu, and Shuicheng Yan. 2017. Cross-modal retrieval with CNN visual features: A new baseline. IEEE Transactions on Cybernetics 47, 2 (2017), 449--460.Google ScholarGoogle Scholar
  4. Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, and Dacheng Tao. 2018. Self-supervised adversarial hashing networks for cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4242--4251.Google ScholarGoogle ScholarCross RefCross Ref
  5. David R. Hardoon, Sandor Szedmak, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16, 12 (2004), 2639--2664.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Dongge Li, Nevenka Dimitrova, Mingkun Li, and Ishwar K. Sethi. 2003. Multimedia content processing through cross-modal association. In Proceedings of the ACM International Conference on Multimedia. ACM, New York, NY, 604--611.Google ScholarGoogle Scholar
  7. Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the ACM International Conference on Multimedia. ACM, New York, NY, 251--260.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Abhishek Sharma, Abhishek Kumar, Hal Daume, and David W. Jacobs. 2012. Generalized multiview analysis: A discriminative latent space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 2160--2167.Google ScholarGoogle Scholar
  9. Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling Internet images, tags, and their semantics. International Journal of Computer Vision 106, 2 (2014), 210--233.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Viresh Ranjan, Nikhil Rasiwasia, and C. V. Jawahar. 2015. Multi-label cross-modal retrieval. In Proceedings of the IEEE International Conference on Computer Vision. 4094--4102.Google ScholarGoogle Scholar
  11. Yi Yu, Suhua Tang, Francisco Raposo, and Lei Chen. 2019. Deep cross-modal correlation learning for audio and lyrics in music retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (2019), 20.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Thi Quynh Nhi Tran, Hervé Le Borgne, and Michel Crucianu. 2016. Aggregating image and text quantized correlated components. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2046--2054.Google ScholarGoogle Scholar
  13. Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Olivier Chapelle, and Kilian Weinberger. 2010. Learning to rank with (a lot of) word features. Information Retrieval 13, 3 (2010), 291--314.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik. 2018. Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence PP, 99 (2018), 1--14.Google ScholarGoogle Scholar
  15. Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin, Ferda Ofli, Ingmar Weber, and Antonio Torralba. 2017. Learning cross-modal embeddings for cooking recipes and food images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 720. 2.Google ScholarGoogle ScholarCross RefCross Ref
  16. Liang Zhang, Bingpeng Ma, Guorong Li, Qingming Huang, and Qi Tian. 2017. Multi-networks joint learning for large-scale cross-modal retrieval. In Proceedings of the ACM International Conference on Multimedia. ACM, New York, NY, 907--915.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3441--3450.Google ScholarGoogle ScholarCross RefCross Ref
  18. Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the ACM International Conference on Multimedia. ACM, New York, NY, 7--16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Yuxin Peng, Xin Huang, and Jinwei Qi. 2016. Cross-media shared representation by hierarchical learning with multiple deep networks. In Proceedings of the International Joint Conferences on Artificial Intelligence. 3846--3853.Google ScholarGoogle Scholar
  20. Yuxin Peng, Jinwei Qi, Xin Huang, and Yuxin Yuan. 2018. CCL: Cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Transactions on Multimedia 20, 2 (2018), 405--420.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Yonghao He, Shiming Xiang, Cuicui Kang, Jian Wang, and Chunhong Pan. 2016. Cross-modal retrieval via deep and bidirectional representation learning. IEEE Transactions on Multimedia 18, 7 (2016), 1363--1377.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Mengdi Fan, Wenmin Wang, Peilei Dong, Liang Han, Ronggang Wang, and Ge Li. 2017. Cross-media retrieval by learning rich semantic embeddings of multimedia. In Proceedings of the ACM International Conference on Multimedia. ACM, New York, NY, 1698--1706.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In Proceedings of the ACM International Conference on Multimedia. ACM, New York, NY, 154--162.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Xi Zhang, Siyu Zhou, Jiashi Feng, Hanjiang Lai, Bo Li, Yan Pan, Jian Yin, and Shuicheng Yan. 2017. HashGAN: Attention-aware deep adversarial hashing for cross modal retrieval. arXiv:1711.09347.Google ScholarGoogle Scholar
  25. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.Google ScholarGoogle Scholar
  26. Baochen Sun, Jiashi Feng, and Kate Saenko. 2016. Return of frustratingly easy domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 6. 8.Google ScholarGoogle Scholar
  27. Karsten M. Borgwardt, Arthur Gretton, Malte J. Rasch, Hans-Peter Kriegel, Bernhard Schölkopf, and Alex J. Smola. 2006. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 22, 14 (2006), e49--e57.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems. 2672--2680.Google ScholarGoogle Scholar
  29. Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. 2016. Mode regularized generative adversarial networks. arXiv:1612.02136.Google ScholarGoogle Scholar
  30. Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. 2017. Generalization and equilibrium in generative adversarial nets (GANs). arXiv:1703.00573.Google ScholarGoogle Scholar
  31. Sanjeev Arora and Yi Zhang. 2017. Do GANs actually learn the distribution? arXiv:1706.08224. An empirical study.Google ScholarGoogle Scholar
  32. Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse++: Improved visual-semantic embeddings. arXiv:1707.05612.Google ScholarGoogle Scholar
  33. Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using Amazon’s Mechanical Turk. In Proceedings of the North American Chapter of the Association for Computational Linguistics Workshop. 139--147.Google ScholarGoogle Scholar
  34. Zhenguo Yang, Qing Li, Wenyin Liu, and Jianming Lv. 2019. Shared multi-view data representation for multi-domain event detection. IEEE Transactions on Pattern Analysis and Machine Intelligence. Epub ahead of print. DOI:10.1109/TPAMI.2019.2893953Google ScholarGoogle Scholar
  35. Harold Hotelling. 1936. Relations between two sets of variates. Biometrika 28, 3--4 (1936), 321--377.Google ScholarGoogle ScholarCross RefCross Ref
  36. David R. Hardoon, Sandor Szedmak, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16, 12 (2004), 2639--2664.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3441--3450.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Learning Shared Semantic Space with Correlation Alignment for Cross-Modal Event Retrieval

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 1
      February 2020
      363 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3384216
      Issue’s Table of Contents

      Copyright © 2020 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 February 2020
      • Accepted: 1 December 2019
      • Revised: 1 November 2019
      • Received: 1 May 2019
      Published in tomm Volume 16, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!