Abstract
Audio-visual tracks in video contain rich semantic information with potential in many applications and research. Since the audio-visual data have inconsistent distributions and because of the heterogeneous nature of representations, the heterogeneous gap between modalities makes them impossible to compare directly. To bridge the modality gap, a frequently adopted approach is to simultaneously project audio-visual data into a common subspace to capture the commonalities and characteristics of modalities for measurement, which has been extensively studied in relation to the issues of modality-common and modality-specific feature learning in previous research. However, it is difficult for existing methods to address the tradeoff between both issues; e.g., the modality-common feature is learned from the latent commonalities of audio-visual data or the correlated features as aligned projections, in which the modality-specific feature can be lost. To solve the tradeoff, we propose a novel end-to-end architecture, which synchronously projects audio-visual data into the explicit and the implicit dual common subspaces. The explicit subspace is used to learn modality-common features and reduce the modality gap of explicitly paired audio-visual data, where the representation-specific details are abandoned to retain the common underlying structure of audio-visual data. The implicit subspace is used to learn modality-specific features, where each modality privately pulls apart the feature distances between different categories to maintain the category-based distinctions, by minimizing the distance between audio-visual features and corresponding labels. The comprehensive experimental results on two audio-visual datasets, VEGAS and AVE, demonstrate that our proposed model for using two different common subspaces for audio-visual cross-modal learning is effective and significantly outperforms the state-of-the-art cross-modal models that learn features from a single common subspace by 4.30% and 2.30% in terms of average MAP on the VEGAS and AVE datasets, respectively.
- [1] . 2016. YouTube-8M: A Large-Scale Video Classification Benchmark. arXiv:cs.CV/ 1609.08675Google Scholar
- [2] . 2013. Deep canonical correlation analysis. In Proceedings of the 30th International Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 28. PMLR, 1247–1255.Google Scholar
- [3] . 2018. Joint graph regularization based semantic analysis for cross-media retrieval: A systematic review. International Journal of Engineering & Technology 7, 2.7 (2018), 257–261.
DOI: Google ScholarCross Ref
- [4] . 2019. Cross media feature retrieval and optimization: A contemporary review of research scope, challenges and objectives. In International Conference on Computational Vision and Bio Inspired Computing. Springer, 1125–1136.
DOI: Google ScholarCross Ref
- [5] . 2016. Domain separation networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems. Curran Associates Inc., 343–351.Google Scholar
Digital Library
- [6] . 2019. Hybrid representation learning for cross-modal retrieval. Neurocomputing 345 (2019), 45–57.
DOI: Google ScholarDigital Library
- [7] . 2016. Deep visual-semantic hashing for cross-modal retrieval. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1445–1454.
DOI: Google ScholarDigital Library
- [8] . 2018. Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 35–44.
DOI: Google ScholarDigital Library
- [9] . 2020. Vggsound: A large-scale audio-visual dataset. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 721–725.
DOI: Google ScholarCross Ref
- [10] . 2009. Nus-wide: A real-world web image database from National University of Singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval (CIVR’09). ACM, 1–9.
DOI: Google ScholarDigital Library
- [11] . 2018. End-to-end cross-modality retrieval with CCA projections and pairwise ranking loss. International Journal of Multimedia Information Retrieval 7, 2 (2018), 117–128.
DOI: Google ScholarCross Ref
- [12] . 2013. DeViSE: A deep visual-semantic embedding model. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Vol. 2. ACM, 2121–2129.Google Scholar
- [13] . 2020. Foley music: Learning to generate music from videos. In European Conference on Computer Vision (Lecture Notes in Computer Science), Vol. 12356. Springer, 758–775.
DOI: Google ScholarDigital Library
- [14] . 2020. Music gesture for visual sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10478–10487.
DOI: Google ScholarCross Ref
- [15] . 2020. Threedworld: A platform for interactive multi-modal physical simulation. arXiv preprint arXiv:2007.04954 (2020).Google Scholar
- [16] . 2022. Retrieve fast, rerank smart: Cooperative and joint approaches for improved cross-modal retrieval. Trans. Assoc. Comput. Linguistics 10 (2022), 503–521.
DOI: Google ScholarCross Ref
- [17] . 2019. Adversary guided asymmetric hashing for cross-modal retrieval. In Proceedings of the 2019 on International Conference on Multimedia Retrieval. ACM, 159–167.
DOI: Google ScholarDigital Library
- [18] . 2021. Visual spatio-temporal relation-enhanced network for cross-modal text-video retrieval. arXiv preprint arXiv:2110.15609 (2021).Google Scholar
- [19] . 2019. Biosignal generation and latent variable analysis with recurrent generative adversarial networks. IEEE Access 7 (2019), 144292–144302.
DOI: Google ScholarCross Ref
- [20] . 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16, 12 (2004), 2639–2664.
DOI: Google ScholarDigital Library
- [21] . 2020. MISA: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia. ACM, 1122–1131.
DOI: Google ScholarDigital Library
- [22] . 2017. Unsupervised cross-modal retrieval through adversarial learning. In 2017 IEEE International Conference on Multimedia and Expo. IEEE Computer Society, 1153–1158.
DOI: Google ScholarCross Ref
- [23] . 2017. CNN architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, 131–135.
DOI: Google ScholarDigital Library
- [24] . 2000. Nonlinear canonical correlation analysis by neural networks. Neural Networks 13, 10 (2000), 1095–1105.
DOI: Google ScholarDigital Library
- [25] . 2015. Deep compositional cross-modal learning to rank via local-global alignment. In Proceedings of the 23rd Annual ACM Conference on Multimedia Conference. ACM, 69–78.
DOI: Google ScholarDigital Library
- [26] . 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems. ACM, 1889–1897.Google Scholar
- [27] . 2021. Comparative analysis on cross-modal information retrieval: A review. Computer Science Review 39 (2021), 100336.
DOI: Google ScholarCross Ref
- [28] . 2017. The kinetics human action video dataset. CoRR abs/1705.06950 (2017).
arXiv:1705.06950 http://arxiv.org/abs/1705.06950.Google Scholar - [29] . 2000. Kernel and nonlinear canonical correlation analysis. Int. J. Neural Syst. 10, 5 (2000), 365–377.
DOI: Google ScholarCross Ref
- [30] . 2017. Adversarial multi-task learning for text classification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vol. 1. ACL, 1–10.
DOI: Google ScholarCross Ref
- [31] . 2018. Modality-specific structure preserving hashing for cross-modal retrieval. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, 1678–1682.
DOI: Google ScholarDigital Library
- [32] . 2018. Listen and look: Audio-visual matching assisted speech source separation. IEEE Signal Processing Letters 25, 9 (
Sep. 2018), 1315–1319.DOI: Google ScholarCross Ref
- [33] . 2020. Multi-level correlation adversarial hashing for cross-modal retrieval. IEEE Transactions on Multimedia 22, 12 (2020), 3101–3114.
DOI: Google ScholarDigital Library
- [34] . 2015. Cross-modal retrieval: A pairwise classification approach. In Proceedings of the 2015 SIAM International Conference on Data Mining. SIAM, 199–207.
DOI: Google ScholarCross Ref
- [35] . 2021. Robust audio-visual instance discrimination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12934–12945.
DOI: Google ScholarCross Ref
- [36] . 2018. Cross-modal music retrieval and applications: An overview of key methodologies. IEEE Signal Processing Magazine 36, 1 (2018), 52–62.
DOI: Google ScholarCross Ref
- [37] . 2011. Multimodal deep learning. In 28th International Conference on Machine Learning. Omnipress, 689–696.Google Scholar
- [38] . 2021. Deep multiscale fusion hashing for cross-modal retrieval. IEEE Trans. Circuits Syst. Video Technol. 31, 1 (2021), 401–410.
DOI: Google ScholarDigital Library
- [39] . 2017. An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges. IEEE Transactions on Circuits and Systems for Video Technology 28, 9 (2017), 2372–2385.
DOI: Google ScholarDigital Library
- [40] . 2018. Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Transactions on Image Processing 27, 11 (2018), 5585–5599.
DOI: Google ScholarDigital Library
- [41] . 2018. Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Trans. Image Process. 27, 11 (
July 2018), 5585–5599.DOI: Google ScholarDigital Library
- [42] . 2017. Cross-media analysis and reasoning: Advances and directions. Frontiers of Information Technology & Electronic Engineering 18, 1 (2017), 44–57.
DOI: Google ScholarCross Ref
- [43] . 2014. On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 36, 3 (2014), 521–535.
DOI: Google ScholarDigital Library
- [44] . 2021. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 139. PMLR, 8748–8763.Google Scholar
- [45] . 2015. Multi-label cross-modal retrieval. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 4094–4102.
DOI: Google ScholarDigital Library
- [46] . 2014. Cluster canonical correlation analysis. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics. JMLR.org, 823–831.
DOI: Google ScholarCross Ref
- [47] . 2020. ZSCRGAN: A GAN-based expectation maximization model for zero-shot retrieval of images from textual descriptions. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. ACM, Virtual Event, 1315–1324.
DOI: Google ScholarDigital Library
- [48] . 2018. Strong baselines for neural semi-supervised learning under domain shift. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Vol. 1. ACL, 1044–1054.
DOI: Google ScholarCross Ref
- [49] . 2017. Towards improving canonical correlation analysis for cross-modal retrieval. In Proceedings of the Thematic Workshops of ACM Multimedia 2017. ACM, 332–339.
DOI: Google ScholarDigital Library
- [50] . 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2 (2014), 207–218.
DOI: Google ScholarCross Ref
- [51] . 2018. Audio-visual event localization in unconstrained videos. In 15th European Conference on Computer Vision (ECCV’18) (Lecture Notes in Computer Science), Vol. 11206. Springer, 252–268.
DOI: Google ScholarDigital Library
- [52] . 2022. Recent advances and challenges in deep audio-visual correlation learning. arXiv preprint arXiv:2202.13673 (2022).Google Scholar
- [53] . 2017. Adversarial cross-modal retrieval. In Proceedings of the 25th ACM International Conference on Multimedia (MM’17). ACM, 154–162.
DOI: Google ScholarDigital Library
- [54] . 2015. Image-text cross-modal retrieval via modality-specific feature learning. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, 347–354.
DOI: Google ScholarDigital Library
- [55] . 2016. A comprehensive survey on cross-modal retrieval. CoRR abs/1607.06215 (2016), arXiv:1607.06215.
arXiv:1607.06215 http://arxiv.org/abs/1607.06215.Google Scholar - [56] . 2020. Modality-specific and shared generative adversarial network for cross-modal retrieval. Pattern Recognition 104 (
Aug. 2020), 107335.DOI: Google ScholarCross Ref
- [57] . 2020. Multi-task consistency-preserving adversarial hashing for cross-modal retrieval. IEEE Transactions on Image Processing 29 (2020), 3626–3637.
DOI: Google ScholarDigital Library
- [58] . 2020. Modality-specific matrix factorization hashing for cross-modal retrieval. Journal of Ambient Intelligence and Humanized Computing 1 (
June 2020), 1–15.DOI: Google ScholarCross Ref
- [59] . 2020. Semantic consistency cross-modal retrieval with semi-supervised graph regularization. IEEE Access 8 (2020), 14278–14288.
DOI: Google ScholarCross Ref
- [60] . 2020. Cross-modal relation-aware networks for audio-visual event localization. In Proceedings of the 28th ACM International Conference on Multimedia. 3893–3901.Google Scholar
Digital Library
- [61] . 2020. Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval. IEEE Transactions on Cybernetics 50, 6 (2020), 2400–2413.
DOI: Google ScholarCross Ref
- [62] . 2020. Enhancing cross-modal retrieval based on modality-specific and embedding spaces. IEEE Access 8 (
May 2020), 96777–96786.DOI: Google ScholarCross Ref
- [63] . 2017. Pairwise relationship guided deep hashing for cross-modal retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31. AAAI, 1618–1625.Google Scholar
Cross Ref
- [64] . 2019. Category-based deep CCA for fine-grained venue discovery from multimodal data. IEEE Transactions on Neural Networks and Learning Systems 30, 4 (2019), 1250–1258.Google Scholar
Cross Ref
- [65] . 2019. Deep cross-modal correlation learning for audio and lyrics in music retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1, Article
20 (Feb. 2019), 16 pages.DOI: Google ScholarDigital Library
- [66] . Learning joint embedding for cross-modal retrieval. In International Conference on Data Mining Workshops (ICDMW’19). IEEE,1070–1071.Google Scholar
- [67] . 2018. Audio-visual embedding for cross-modal music video retrieval through supervised deep CCA. In 2018 IEEE International Symposium on Multimedia. IEEE Computer Society, 143–150.
DOI: Google ScholarCross Ref
- [68] . 2020. Deep triplet neural networks with cluster-CCA for audio-visual cross-modal retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 16, 3, Article
76 (July 2020), 23 pages.DOI: Google ScholarDigital Library
- [69] . 2020. MusicTM-dataset for joint representation learning among sheet music, lyrics, and musical audio. In National Conference on Sound and Music Technology. Springer, 78–89.
DOI: Google ScholarCross Ref
- [70] . 2018. Unsupervised generative adversarial cross-modal hashing. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. AAAI Press, 539–546.Google Scholar
Cross Ref
- [71] . 2016. PL-ranking: A novel ranking method for cross-modal retrieval. In Proceedings of the 24th ACM International Conference on Multimedia (MM’16). ACM, 1355–1364.
DOI: Google ScholarDigital Library
- [72] . 2018. Collaborative subspace graph hashing for cross-modal retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. ACM, 213–221.
DOI: Google ScholarDigital Library
- [73] . 2019. The sound of motions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1735–1744.
DOI: Google ScholarCross Ref
- [74] . 2018. The sound of pixels. In Proceedings of the European Conference on Computer Vision. 570–586.
DOI: Google ScholarDigital Library
- [75] . 2022. Deep multimodal transfer learning for cross-modal retrieval. IEEE Transactions on Neural Networks and Learning Systems 33, 2 (2022), 798–810.
DOI: Google ScholarCross Ref
- [76] . 2019. Deep supervised cross-modal retrieval. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 10394–10403.
DOI: Google ScholarCross Ref
- [77] . 2020. Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 2 (2020), 1–23.Google Scholar
Digital Library
- [78] . 2018. Visual to sound: Generating natural sound for videos in the wild. In 2018 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3550–3558.
DOI: Google ScholarCross Ref
Index Terms
Learning Explicit and Implicit Dual Common Subspaces for Audio-visual Cross-modal Retrieval
Recommendations
Variational Autoencoder with CCA for Audio–Visual Cross-modal Retrieval
Cross-modal retrieval is to utilize one modality as a query to retrieve data from another modality, which has become a popular topic in information retrieval, machine learning, and databases. Finding a method to effectively measure the similarity between ...
Adversarial Cross-Modal Retrieval
MM '17: Proceedings of the 25th ACM international conference on MultimediaCross-modal retrieval aims to enable flexible retrieval experience across different modalities (e.g., texts vs. images). The core of cross-modal retrieval research is to learn a common subspace where the items of different modalities can be directly ...
HCMSL: Hybrid Cross-modal Similarity Learning for Cross-modal Retrieval
The purpose of cross-modal retrieval is to find the relationship between different modal samples and to retrieve other modal samples with similar semantics by using a certain modal sample. As the data of different modalities presents heterogeneous low-...






Comments