skip to main content
research-article

Learning Explicit and Implicit Dual Common Subspaces for Audio-visual Cross-modal Retrieval

Published:17 February 2023Publication History
Skip Abstract Section

Abstract

Audio-visual tracks in video contain rich semantic information with potential in many applications and research. Since the audio-visual data have inconsistent distributions and because of the heterogeneous nature of representations, the heterogeneous gap between modalities makes them impossible to compare directly. To bridge the modality gap, a frequently adopted approach is to simultaneously project audio-visual data into a common subspace to capture the commonalities and characteristics of modalities for measurement, which has been extensively studied in relation to the issues of modality-common and modality-specific feature learning in previous research. However, it is difficult for existing methods to address the tradeoff between both issues; e.g., the modality-common feature is learned from the latent commonalities of audio-visual data or the correlated features as aligned projections, in which the modality-specific feature can be lost. To solve the tradeoff, we propose a novel end-to-end architecture, which synchronously projects audio-visual data into the explicit and the implicit dual common subspaces. The explicit subspace is used to learn modality-common features and reduce the modality gap of explicitly paired audio-visual data, where the representation-specific details are abandoned to retain the common underlying structure of audio-visual data. The implicit subspace is used to learn modality-specific features, where each modality privately pulls apart the feature distances between different categories to maintain the category-based distinctions, by minimizing the distance between audio-visual features and corresponding labels. The comprehensive experimental results on two audio-visual datasets, VEGAS and AVE, demonstrate that our proposed model for using two different common subspaces for audio-visual cross-modal learning is effective and significantly outperforms the state-of-the-art cross-modal models that learn features from a single common subspace by 4.30% and 2.30% in terms of average MAP on the VEGAS and AVE datasets, respectively.

REFERENCES

  1. [1] Abu-El-Haija Sami, Kothari Nisarg, Lee Joonseok, Natsev Paul, Toderici George, Varadarajan Balakrishnan, and Vijayanarasimhan Sudheendra. 2016. YouTube-8M: A Large-Scale Video Classification Benchmark. arXiv:cs.CV/ 1609.08675Google ScholarGoogle Scholar
  2. [2] Andrew Galen, Arora Raman, Bilmes Jeff, and Livescu Karen. 2013. Deep canonical correlation analysis. In Proceedings of the 30th International Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 28. PMLR, 12471255.Google ScholarGoogle Scholar
  3. [3] Ayyavaraiah Monelli and Venkateswarlu Bondu. 2018. Joint graph regularization based semantic analysis for cross-media retrieval: A systematic review. International Journal of Engineering & Technology 7, 2.7 (2018), 257261. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Ayyavaraiah Monelli and Venkateswarlu Bondu. 2019. Cross media feature retrieval and optimization: A contemporary review of research scope, challenges and objectives. In International Conference on Computational Vision and Bio Inspired Computing. Springer, 11251136. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Bousmalis Konstantinos, Trigeorgis George, Silberman Nathan, Krishnan Dilip, and Erhan Dumitru. 2016. Domain separation networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems. Curran Associates Inc., 343351.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Cao Wenming, Lin Qiubin, He Zhihai, and He Zhiquan. 2019. Hybrid representation learning for cross-modal retrieval. Neurocomputing 345 (2019), 4557. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Cao Yue, Long Mingsheng, Wang Jianmin, Yang Qiang, and Yu Philip S.. 2016. Deep visual-semantic hashing for cross-modal retrieval. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 14451454. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Carvalho Micael, Cadène Rémi, Picard David, Soulier Laure, Thome Nicolas, and Cord Matthieu. 2018. Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 3544. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Chen Honglie, Xie Weidi, Vedaldi Andrea, and Zisserman Andrew. 2020. Vggsound: A large-scale audio-visual dataset. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 721725. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Chua Tat-Seng, Tang Jinhui, Hong Richang, Li Haojie, Luo Zhiping, and Zheng Yantao. 2009. Nus-wide: A real-world web image database from National University of Singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval (CIVR’09). ACM, 19. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Dorfer Matthias, Schlüter Jan, Vall Andreu, Korzeniowski Filip, and Widmer Gerhard. 2018. End-to-end cross-modality retrieval with CCA projections and pairwise ranking loss. International Journal of Multimedia Information Retrieval 7, 2 (2018), 117128. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Frome Andrea, Corrado Gregory S., Shlens Jonathon, Bengio Samy, Dean Jeffrey, Ranzato Marc’Aurelio, and Mikolov Tomás. 2013. DeViSE: A deep visual-semantic embedding model. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Vol. 2. ACM, 21212129.Google ScholarGoogle Scholar
  13. [13] Gan Chuang, Huang Deng, Chen Peihao, Tenenbaum Joshua B., and Torralba Antonio. 2020. Foley music: Learning to generate music from videos. In European Conference on Computer Vision (Lecture Notes in Computer Science), Vol. 12356. Springer, 758775. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Gan Chuang, Huang Deng, Zhao Hang, Tenenbaum Joshua B., and Torralba Antonio. 2020. Music gesture for visual sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1047810487. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Gan Chuang, Schwartz Jeremy, Alter Seth, Schrimpf Martin, Traer James, Freitas Julian De, Kubilius Jonas, Bhandwaldar Abhishek, Haber Nick, Sano Megumi, et al. 2020. Threedworld: A platform for interactive multi-modal physical simulation. arXiv preprint arXiv:2007.04954 (2020).Google ScholarGoogle Scholar
  16. [16] Geigle Gregor, Pfeiffer Jonas, Reimers Nils, Vulic Ivan, and Gurevych Iryna. 2022. Retrieve fast, rerank smart: Cooperative and joint approaches for improved cross-modal retrieval. Trans. Assoc. Comput. Linguistics 10 (2022), 503521. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Gu Wen, Gu Xiaoyan, Gu Jingzi, Li Bo, Xiong Zhi, and Wang Weiping. 2019. Adversary guided asymmetric hashing for cross-modal retrieval. In Proceedings of the 2019 on International Conference on Multimedia Retrieval. ACM, 159167. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Han Ning, Chen Jingjing, Xiao Guangyi, Zeng Yawen, Shi Chuhao, and Chen Hao. 2021. Visual spatio-temporal relation-enhanced network for cross-modal text-video retrieval. arXiv preprint arXiv:2110.15609 (2021).Google ScholarGoogle Scholar
  19. [19] Harada Shota, Hayashi Hideaki, and Uchida Seiichi. 2019. Biosignal generation and latent variable analysis with recurrent generative adversarial networks. IEEE Access 7 (2019), 144292144302. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Hardoon David R., Szedmák Sándor, and Shawe-Taylor John. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16, 12 (2004), 26392664. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Hazarika Devamanyu, Zimmermann Roger, and Poria Soujanya. 2020. MISA: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia. ACM, 11221131. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] He Li, Xu Xing, Lu Huimin, Yang Yang, Shen Fumin, and Shen Heng Tao. 2017. Unsupervised cross-modal retrieval through adversarial learning. In 2017 IEEE International Conference on Multimedia and Expo. IEEE Computer Society, 11531158. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Hershey Shawn, Chaudhuri Sourish, Ellis Daniel P. W., Gemmeke Jort F., Jansen Aren, Moore R. Channing, Plakal Manoj, Platt Devin, Saurous Rif A., Seybold Bryan, et al. 2017. CNN architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, 131135. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Hsieh William W.. 2000. Nonlinear canonical correlation analysis by neural networks. Neural Networks 13, 10 (2000), 10951105. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Jiang Xinyang, Wu Fei, Li Xi, Zhao Zhou, Lu Weiming, Tang Siliang, and Zhuang Yueting. 2015. Deep compositional cross-modal learning to rank via local-global alignment. In Proceedings of the 23rd Annual ACM Conference on Multimedia Conference. ACM, 6978. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Karpathy Andrej, Joulin Armand, and Fei-Fei Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems. ACM, 18891897.Google ScholarGoogle Scholar
  27. [27] Kaur Parminder, Pannu Husanbir Singh, and Malhi Avleen Kaur. 2021. Comparative analysis on cross-modal information retrieval: A review. Computer Science Review 39 (2021), 100336. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Kay Will, Carreira João, Simonyan Karen, Zhang Brian, Hillier Chloe, Vijayanarasimhan Sudheendra, Viola Fabio, Green Tim, Back Trevor, Natsev Paul, Suleyman Mustafa, and Zisserman Andrew. 2017. The kinetics human action video dataset. CoRR abs/1705.06950 (2017). arXiv:1705.06950 http://arxiv.org/abs/1705.06950.Google ScholarGoogle Scholar
  29. [29] Lai Pei Ling and Fyfe Colin. 2000. Kernel and nonlinear canonical correlation analysis. Int. J. Neural Syst. 10, 5 (2000), 365377. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Liu Pengfei, Qiu Xipeng, and Huang Xuanjing. 2017. Adversarial multi-task learning for text classification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vol. 1. ACL, 110. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Liu Xingbo, Nie Xiushan, Sun Haoliang, Cui Chaoran, and Yin Yilong. 2018. Modality-specific structure preserving hashing for cross-modal retrieval. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, 16781682. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Lu Rui, Duan Zhiyao, and Zhang Changshui. 2018. Listen and look: Audio-visual matching assisted speech source separation. IEEE Signal Processing Letters 25, 9 (Sep.2018), 13151319. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Ma Xinhong, Zhang Tianzhu, and Xu Changsheng. 2020. Multi-level correlation adversarial hashing for cross-modal retrieval. IEEE Transactions on Multimedia 22, 12 (2020), 31013114. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Menon Aditya Krishna, Surian Didi, and Chawla Sanjay. 2015. Cross-modal retrieval: A pairwise classification approach. In Proceedings of the 2015 SIAM International Conference on Data Mining. SIAM, 199207. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Morgado Pedro, Misra Ishan, and Vasconcelos Nuno. 2021. Robust audio-visual instance discrimination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1293412945. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Müller Meinard, Arzt Andreas, Balke Stefan, Dorfer Matthias, and Widmer Gerhard. 2018. Cross-modal music retrieval and applications: An overview of key methodologies. IEEE Signal Processing Magazine 36, 1 (2018), 5262. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Ngiam Jiquan, Khosla Aditya, Kim Mingyu, Nam Juhan, Lee Honglak, and Ng Andrew Y.. 2011. Multimodal deep learning. In 28th International Conference on Machine Learning. Omnipress, 689696.Google ScholarGoogle Scholar
  38. [38] Nie Xiushan, Wang Bowei, Li Jiajia, Hao Fanchang, Jian Muwei, and Yin Yilong. 2021. Deep multiscale fusion hashing for cross-modal retrieval. IEEE Trans. Circuits Syst. Video Technol. 31, 1 (2021), 401410. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Peng Yuxin, Huang Xin, and Zhao Yunzhen. 2017. An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges. IEEE Transactions on Circuits and Systems for Video Technology 28, 9 (2017), 23722385. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Peng Yuxin, Qi Jinwei, and Yuan Yuxin. 2018. Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Transactions on Image Processing 27, 11 (2018), 55855599. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Peng Yuxin, Qi Jinwei, and Yuan Yuxin. 2018. Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Trans. Image Process. 27, 11 (July2018), 55855599. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Peng Yu-xin, Zhu Wen-wu, Zhao Yao, Xu Chang-sheng, Huang Qing-ming, Lu Han-qing, Zheng Qing-hua, Huang Tie-jun, and Gao Wen. 2017. Cross-media analysis and reasoning: Advances and directions. Frontiers of Information Technology & Electronic Engineering 18, 1 (2017), 4457. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Pereira José Costa, Coviello Emanuele, Doyle Gabriel, Rasiwasia Nikhil, Lanckriet Gert R. G., Levy Roger, and Vasconcelos Nuno. 2014. On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 36, 3 (2014), 521535. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Radford Alec, Kim Jong Wook, Hallacy Chris, Ramesh Aditya, Goh Gabriel, Agarwal Sandhini, Sastry Girish, Askell Amanda, Mishkin Pamela, Clark Jack, Krueger Gretchen, and Sutskever Ilya. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 139. PMLR, 87488763.Google ScholarGoogle Scholar
  45. [45] Ranjan Viresh, Rasiwasia Nikhil, and Jawahar C. V.. 2015. Multi-label cross-modal retrieval. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 40944102. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Rasiwasia Nikhil, Mahajan Dhruv, Mahadevan Vijay, and Aggarwal Gaurav. 2014. Cluster canonical correlation analysis. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics. JMLR.org, 823831. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Roy Anurag, Verma Vinay Kumar, Ghosh Kripabandhu, and Ghosh Saptarshi. 2020. ZSCRGAN: A GAN-based expectation maximization model for zero-shot retrieval of images from textual descriptions. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. ACM, Virtual Event, 13151324. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Ruder Sebastian and Plank Barbara. 2018. Strong baselines for neural semi-supervised learning under domain shift. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Vol. 1. ACL, 10441054. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Shao Jie, Zhao Zhicheng, Su Fei, and Yue Ting. 2017. Towards improving canonical correlation analysis for cross-modal retrieval. In Proceedings of the Thematic Workshops of ACM Multimedia 2017. ACM, 332339. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Socher Richard, Karpathy Andrej, Le Quoc V., Manning Christopher D., and Ng Andrew Y.. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2 (2014), 207218. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Tian Yapeng, Shi Jing, Li Bochen, Duan Zhiyao, and Xu Chenliang. 2018. Audio-visual event localization in unconstrained videos. In 15th European Conference on Computer Vision (ECCV’18) (Lecture Notes in Computer Science), Vol. 11206. Springer, 252268. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Vilaça Luís, Yu Yi, and Viana Paula. 2022. Recent advances and challenges in deep audio-visual correlation learning. arXiv preprint arXiv:2202.13673 (2022).Google ScholarGoogle Scholar
  53. [53] Wang Bokun, Yang Yang, Xu Xing, Hanjalic Alan, and Shen Heng Tao. 2017. Adversarial cross-modal retrieval. In Proceedings of the 25th ACM International Conference on Multimedia (MM’17). ACM, 154162. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Wang Jian, He Yonghao, Kang Cuicui, Xiang Shiming, and Pan Chunhong. 2015. Image-text cross-modal retrieval via modality-specific feature learning. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, 347354. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. [55] Wang Kaiye, Yin Qiyue, Wang Wei, Wu Shu, and Wang Liang. 2016. A comprehensive survey on cross-modal retrieval. CoRR abs/1607.06215 (2016), arXiv:1607.06215. arXiv:1607.06215 http://arxiv.org/abs/1607.06215.Google ScholarGoogle Scholar
  56. [56] Wu Fei, Jing Xiao-Yuan, Wu Zhiyong, Ji Yimu, Dong Xiwei, Luo Xiaokai, Huang Qinghua, and Wang Ruchuan. 2020. Modality-specific and shared generative adversarial network for cross-modal retrieval. Pattern Recognition 104 (Aug.2020), 107335. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Xie De, Deng Cheng, Li Chao, Liu Xianglong, and Tao Dacheng. 2020. Multi-task consistency-preserving adversarial hashing for cross-modal retrieval. IEEE Transactions on Image Processing 29 (2020), 36263637. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Xiong Haixia, Ou Weihua, Yan Zengxian, Gou Jianping, Zhou Quan, and Wang Anzhi. 2020. Modality-specific matrix factorization hashing for cross-modal retrieval. Journal of Ambient Intelligence and Humanized Computing 1 (June2020), 115. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Xu Gongwen, Li Xiaomei, and Zhang Zhijun. 2020. Semantic consistency cross-modal retrieval with semi-supervised graph regularization. IEEE Access 8 (2020), 1427814288. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Xu Haoming, Zeng Runhao, Wu Qingyao, Tan Mingkui, and Gan Chuang. 2020. Cross-modal relation-aware networks for audio-visual event localization. In Proceedings of the 28th ACM International Conference on Multimedia. 38933901.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. [61] Xu X., Lu H., Song J., Yang Y., Shen H. T., and Li X.. 2020. Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval. IEEE Transactions on Cybernetics 50, 6 (2020), 24002413. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  62. [62] Yanagi Rintaro, Togo Ren, Ogawa Takahiro, and Haseyama Miki. 2020. Enhancing cross-modal retrieval based on modality-specific and embedding spaces. IEEE Access 8 (May2020), 9677796786. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Yang Erkun, Deng Cheng, Liu Wei, Liu Xianglong, Tao Dacheng, and Gao Xinbo. 2017. Pairwise relationship guided deep hashing for cross-modal retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31. AAAI, 16181625.Google ScholarGoogle ScholarCross RefCross Ref
  64. [64] Yu Yi, Tang Suhua, Aizawa Kiyoharu, and Aizawa Akiko. 2019. Category-based deep CCA for fine-grained venue discovery from multimodal data. IEEE Transactions on Neural Networks and Learning Systems 30, 4 (2019), 12501258.Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Yu Yi, Tang Suhua, Raposo Francisco, and Chen Lei. 2019. Deep cross-modal correlation learning for audio and lyrics in music retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1, Article 20 (Feb.2019), 16 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. [66] Zeng Donghuo and Oyama Keizo. 2019. Learning joint embedding for cross-modal retrieval. In International Conference on Data Mining Workshops (ICDMW’19). IEEE,10701071.Google ScholarGoogle Scholar
  67. [67] Zeng Donghuo, Yu Yi, and Oyama Keizo. 2018. Audio-visual embedding for cross-modal music video retrieval through supervised deep CCA. In 2018 IEEE International Symposium on Multimedia. IEEE Computer Society, 143150. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  68. [68] Zeng Donghuo, Yu Yi, and Oyama Keizo. 2020. Deep triplet neural networks with cluster-CCA for audio-visual cross-modal retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 16, 3, Article 76 (July2020), 23 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. [69] Zeng Donghuo, Yu Yi, and Oyama Keizo. 2020. MusicTM-dataset for joint representation learning among sheet music, lyrics, and musical audio. In National Conference on Sound and Music Technology. Springer, 7889. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  70. [70] Zhang Jian, Peng Yuxin, and Yuan Mingkuan. 2018. Unsupervised generative adversarial cross-modal hashing. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. AAAI Press, 539546.Google ScholarGoogle ScholarCross RefCross Ref
  71. [71] Zhang Liang, Ma Bingpeng, Li Guorong, Huang Qingming, and Tian Qi. 2016. PL-ranking: A novel ranking method for cross-modal retrieval. In Proceedings of the 24th ACM International Conference on Multimedia (MM’16). ACM, 13551364. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. [72] Zhang Xiang, Dong Guohua, Du Yimo, Wu Chengkun, Luo Zhigang, and Yang Canqun. 2018. Collaborative subspace graph hashing for cross-modal retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. ACM, 213221. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. [73] Zhao Hang, Gan Chuang, Ma Wei-Chiu, and Torralba Antonio. 2019. The sound of motions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 17351744. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  74. [74] Zhao Hang, Gan Chuang, Rouditchenko Andrew, Vondrick Carl, McDermott Josh, and Torralba Antonio. 2018. The sound of pixels. In Proceedings of the European Conference on Computer Vision. 570586. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. [75] Zhen Liangli, Hu Peng, Peng Xi, Goh Rick Siow Mong, and Zhou Joey Tianyi. 2022. Deep multimodal transfer learning for cross-modal retrieval. IEEE Transactions on Neural Networks and Learning Systems 33, 2 (2022), 798810. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  76. [76] Zhen Liangli, Hu Peng, Wang Xu, and Peng Dezhong. 2019. Deep supervised cross-modal retrieval. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1039410403. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  77. [77] Zheng Zhedong, Zheng Liang, Garrett Michael, Yang Yi, Xu Mingliang, and Shen Yi-Dong. 2020. Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 2 (2020), 123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. [78] Zhou Yipin, Wang Zhaowen, Fang Chen, Bui Trung, and Berg Tamara L.. 2018. Visual to sound: Generating natural sound for videos in the wild. In 2018 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 35503558. DOI:Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Learning Explicit and Implicit Dual Common Subspaces for Audio-visual Cross-modal Retrieval

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Multimedia Computing, Communications, and Applications
            ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 2s
            April 2023
            545 pages
            ISSN:1551-6857
            EISSN:1551-6865
            DOI:10.1145/3572861
            • Editor:
            • Abdulmotaleb El Saddik
            Issue’s Table of Contents

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 17 February 2023
            • Online AM: 22 September 2022
            • Accepted: 4 September 2022
            • Revised: 29 June 2022
            • Received: 15 January 2022
            Published in tomm Volume 19, Issue 2s

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Full Text

          View this article in Full Text.

          View Full Text

          HTML Format

          View this article in HTML Format .

          View HTML Format
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!