skip to main content
research-article

Variational Autoencoder with CCA for Audio–Visual Cross-modal Retrieval

Authors Info & Claims
Published:24 February 2023Publication History
Skip Abstract Section

Abstract

Cross-modal retrieval is to utilize one modality as a query to retrieve data from another modality, which has become a popular topic in information retrieval, machine learning, and databases. Finding a method to effectively measure the similarity between different modality data is the major challenge of cross-modal retrieval. Although several research works have calculated the correlation between different modality data via learning a common subspace representation, the encoder’s ability to extract features from multi-modal information is not satisfactory. In this article, we present a novel variational autoencoder architecture for audio–visual cross-modal retrieval by learning paired audio–visual correlation embedding and category correlation embedding as constraints to reinforce the mutuality of audio–visual information. On the one hand, audio encoder and visual encoder separately encode audio data and visual data into two different latent spaces. Further, two mutual latent spaces are respectively constructed by canonical correlation analysis. On the other hand, probabilistic modeling methods are used to deal with possible noise and missing information in the data. Additionally, in this way, the cross-modal discrepancies from intra-modal and inter-modal information are simultaneously eliminated in the joint embedding subspace. We conduct extensive experiments over two benchmark datasets. The experimental results confirm that the proposed architecture is effective in learning audio–visual correlation and is appreciably better than the existing cross-modal retrieval methods.

REFERENCES

  1. [1] Andrew Galen, Arora Raman, Bilmes Jeff, and Livescu Karen. 2013. Deep canonical correlation analysis. In International Conference on Machine Learning. PMLR, 12471255.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Wang Weiran, Arora Raman, Livescu Karen, and Bilmes Jeff. 2015. On deep multi-view representation learning. In International Conference on Machine Learning. PMLR, 10831092.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Wang Kaiye, Yin Qiyue, Wang Wei, Wu Shu, and Wang Liang. 2016. A comprehensive survey on cross-modal retrieval. arXiv:1607.06215. Retrieved from https://arxiv.org/abs/1607.06215.Google ScholarGoogle Scholar
  4. [4] Wei Yunchao, Zhao Yao, Lu Canyi, Wei Shikui, Liu Luoqi, Zhu Zhenfeng, and Yan Shuicheng. 2016. Cross-modal retrieval with CNN visual features: A new baseline. IEEE Trans. Cybernet. 47, 2 (2016), 449460.Google ScholarGoogle Scholar
  5. [5] Xu Xing, He Li, Lu Huimin, Gao Lianli, and Ji Yanli. 2019. Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22, 2 (2019), 657672.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Yu Yi, Shen Zhijie, and Zimmermann Roger. 2012. Automatic music soundtrack generation for outdoor videos from contextual sensor information. In Proceedings of the 20th ACM International Conference on Multimedia. 13771378.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Yu Yi, Tang Suhua, Raposo Francisco, and Chen Lei. 2019. Deep cross-modal correlation learning for audio and lyrics in music retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1 (2019), 116.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Peng Yuxin, Huang Xin, and Zhao Yunzhen. 2017. An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges. IEEE Trans. Circ. Syst. Vid. Technol. 28, 9 (2017), 23722385.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Zhang Jian, Peng Yuxin, and Yuan Mingkuan. 2018. Unsupervised generative adversarial cross-modal hashing. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Wang Kaiye, He Ran, Wang Liang, Wang Wei, and Tan Tieniu. 2015. Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 38, 10 (2015), 20102023.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Zhang Liang, Ma Bingpeng, Li Guorong, Huang Qingming, and Tian Qi. 2017. Generalized semi-supervised and structured subspace learning for cross-modal retrieval. IEEE Trans. Multimedia 20, 1 (2017), 128141.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Wu Jianlong, Lin Zhouchen, and Zha Hongbin. 2017. Joint latent subspace learning and regression for cross-modal retrieval. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 917920.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Hotelling Harold. 1992. Relations between two sets of variates. In Breakthroughs in Statistics. Springer, 162190.Google ScholarGoogle Scholar
  14. [14] Hardoon David R., Szedmak Sandor, and Shawe-Taylor John. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16, 12 (2004), 26392664.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Leurgans Sue E., Moyeed Rana A., and Silverman Bernard W.. 1993. Canonical correlation analysis when the data are curves. J. Roy. Stat. Soc.: Ser. B (Methodol.) 55, 3 (1993), 725740.Google ScholarGoogle Scholar
  16. [16] Yu Yi, Tang Suhua, Aizawa Kiyoharu, and Aizawa Akiko. 2018. Category-based deep CCA for fine-grained venue discovery from multimodal data. IEEE Trans. Neural Netw. Learn. Syst. 30, 4 (2018), 12501258.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Zeng Donghuo, Yu Yi, and Oyama Keizo. 2018. Audio-visual embedding for cross-modal music video retrieval through supervised deep CCA. In Proceedings of the IEEE International Symposium on Multimedia (ISM’18). IEEE, 143150.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Zhu Ye, Wu Yu, Latapie Hugo, Yang Yi, and Yan Yan. 2021. Learning audio-visual correlations from variational cross-modal generation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’21). IEEE, 43004304.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Rasiwasia Nikhil, Pereira Jose Costa, Coviello Emanuele, Doyle Gabriel, Lanckriet Gert R. G., Levy Roger, and Vasconcelos Nuno. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM International Conference on Multimedia. 251260.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Feng Fangxiang, Wang Xiaojie, and Li Ruifan. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the 22nd ACM International Conference on Multimedia. 716.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Zhai Deming, Chang Hong, Shan Shiguang, Chen Xilin, and Gao Wen. 2012. Multiview metric learning with global consistency and local smoothness. ACM Trans. Intell. Syst. Technol. 3, 3 (2012), 122.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Zhao Hang, Gan Chuang, Ma Wei-Chiu, and Torralba Antonio. 2019. The sound of motions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 17351744.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Gan Chuang, Zhao Hang, Chen Peihao, Cox David, and Torralba Antonio. 2019. Self-supervised moving vehicle tracking with stereo sound. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 70537062.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Zhang Liang, Ma Bingpeng, He Jianfeng, Li Guorong, Huang Qingming, and Tian Qi. 2017. Adaptively unified semi-supervised learning for cross-modal retrieval. In Proceedings of the International Joint Conferences on Artificial Intelligence (IJCAI’17). 34063412.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Yu En, Sun Jiande, Li Jing, Chang Xiaojun, Han Xian-Hua, and Hauptmann Alexander G.. 2018. Adaptive semi-supervised feature selection for cross-modal retrieval. IEEE Trans. Multimedia 21, 5 (2018), 12761288.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Mandal Devraj, Rao Pramod, and Biswas Soma. 2019. Semi-supervised cross-modal retrieval with label prediction. IEEE Trans. Multimedia 22, 9 (2019), 23452353.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Zhen Liangli, Hu Peng, Wang Xu, and Peng Dezhong. 2019. Deep supervised cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1039410403.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Wang Wei, Yang Xiaoyan, Ooi Beng Chin, Zhang Dongxiang, and Zhuang Yueting. 2016. Effective deep learning-based multi-modal retrieval. VLDB J. 25, 1 (2016), 79101.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Wang Bokun, Yang Yang, Xu Xing, Hanjalic Alan, and Shen Heng Tao. 2017. Adversarial cross-modal retrieval. In Proceedings of the 25th ACM International Conference on Multimedia. 154162.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Chen Peihao, Zhang Yang, Tan Mingkui, Xiao Hongdong, Huang Deng, and Gan Chuang. 2020. Generating visually aligned sound from videos. IEEE Trans. Image Process. 29 (2020), 82928302.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Gan Chuang, Huang Deng, Chen Peihao, Tenenbaum Joshua B., and Torralba Antonio. 2020. Foley music: Learning to generate music from videos. In European Conference on Computer Vision. Springer, 758775.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Gan Chuang, Huang Deng, Zhao Hang, Tenenbaum Joshua B., and Torralba Antonio. 2020. Music gesture for visual sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1047810487.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Zhao Hang, Gan Chuang, Rouditchenko Andrew, Vondrick Carl, McDermott Josh, and Torralba Antonio. 2018. The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV’18). 570586.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Mitchell Toby J. and Morris Max D.. 1992. The spatial correlation function approach to response surface estimation. In Proceedings of the 24th Conference on Winter Simulation. 565571.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Wen Yandong, Zhang Kaipeng, Li Zhifeng, and Qiao Yu. 2016. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision. Springer, 499515.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Hadsell Raia, Chopra Sumit, and LeCun Yann. 2006. Dimensionality reduction by learning an invariant mapping. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2. IEEE, 17351742.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Tian Yapeng, Shi Jing, Li Bochen, Duan Zhiyao, and Xu Chenliang. 2018. Audio-visual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision (ECCV’18). 247263.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Wang Zhou and Bovik Alan C.. 2009. Mean squared error: Love it or leave it? A new look at signal fidelity measures. IEEE Sign. Process. Mag. 26, 1 (2009), 98117.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Zhou Yipin, Wang Zhaowen, Fang Chen, Bui Trung, and Berg Tamara L.. 2018. Visual to sound: Generating natural sound for videos in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 35503558.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Gemmeke Jort F., Ellis Daniel P. W., Freedman Dylan, Jansen Aren, Lawrence Wade, Moore R. Channing, Plakal Manoj, and Ritter Marvin. 2017. Audio set: An ontology and human-labeled dataset for audio events. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, 776780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Simonyan Karen and Zisserman Andrew. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https://arxiv.org/abs/1409.1556.Google ScholarGoogle Scholar
  42. [42] Hershey Shawn, Chaudhuri Sourish, Ellis Daniel P. W., Gemmeke Jort F., Jansen Aren, Moore R. Channing, Plakal Manoj, Platt Devin, Saurous Rif A., Seybold Bryan, et al. 2017. CNN architectures for large-scale audio classification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, 131135.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Kingma Diederik P. and Ba Jimmy. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980. Retrieved from https://arxiv.org/abs/1412.6980.Google ScholarGoogle Scholar
  44. [44] Peng Yuxin, Huang Xin, and Qi Jinwei. 2016. Cross-media shared representation by hierarchical learning with multiple deep networks. In Proceedings of the International Joint Conferences on Artificial Intelligence (IJCAI’16). 38463853.Google ScholarGoogle Scholar
  45. [45] Lai Pei Ling and Fyfe Colin. 2000. Kernel and nonlinear canonical correlation analysis. Int. J. Neural Syst. 10, 05 (2000), 365377.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Rasiwasia Nikhil, Mahajan Dhruv, Mahadevan Vijay, and Aggarwal Gaurav. 2014. Cluster canonical correlation analysis. In Artificial Intelligence and Statistics. PMLR, 823831.Google ScholarGoogle Scholar
  47. [47] Gu Wen, Gu Xiaoyan, Gu Jingzi, Li Bo, Xiong Zhi, and Wang Weiping. 2019. Adversary guided asymmetric hashing for cross-modal retrieval. In Proceedings of the International Conference on Multimedia Retrieval. 159167.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] He Li, Xu Xing, Lu Huimin, Yang Yang, Shen Fumin, and Shen Heng Tao. 2017. Unsupervised cross-modal retrieval through adversarial learning. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’17). IEEE, 11531158.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Huang Zhicheng, Zeng Zhaoyang, Liu Bei, Fu Dongmei, and Fu Jianlong. 2020. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv:2004.00849. Retrieved from https://arxiv.org/abs/2004.00849.Google ScholarGoogle Scholar
  50. [50] Zeng Donghuo, Yu Yi, and Oyama Keizo. 2020. Deep triplet neural networks with cluster-cca for audio-visual cross-modal retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 16, 3 (2020), 123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Zeng Donghuo, Wu Jianming, Hattori Gen, Yu Yi, and Xu Rong. 2021. Learning explicit and implicit latent common spaces for audio-visual cross-modal retrieval. arXiv:2110.13556. Retrieved from https://arxiv.org/abs/2110.13556.Google ScholarGoogle Scholar
  52. [52] Maaten Laurens Van der and Hinton Geoffrey. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 11 (2008).Google ScholarGoogle Scholar

Index Terms

  1. Variational Autoencoder with CCA for Audio–Visual Cross-modal Retrieval

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 3s
      June 2023
      270 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3582887
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 February 2023
      • Online AM: 8 December 2022
      • Accepted: 20 October 2022
      • Revised: 27 July 2022
      • Received: 3 December 2021
      Published in tomm Volume 19, Issue 3s

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!