Abstract
Cross-modal retrieval is to utilize one modality as a query to retrieve data from another modality, which has become a popular topic in information retrieval, machine learning, and databases. Finding a method to effectively measure the similarity between different modality data is the major challenge of cross-modal retrieval. Although several research works have calculated the correlation between different modality data via learning a common subspace representation, the encoder’s ability to extract features from multi-modal information is not satisfactory. In this article, we present a novel variational autoencoder architecture for audio–visual cross-modal retrieval by learning paired audio–visual correlation embedding and category correlation embedding as constraints to reinforce the mutuality of audio–visual information. On the one hand, audio encoder and visual encoder separately encode audio data and visual data into two different latent spaces. Further, two mutual latent spaces are respectively constructed by canonical correlation analysis. On the other hand, probabilistic modeling methods are used to deal with possible noise and missing information in the data. Additionally, in this way, the cross-modal discrepancies from intra-modal and inter-modal information are simultaneously eliminated in the joint embedding subspace. We conduct extensive experiments over two benchmark datasets. The experimental results confirm that the proposed architecture is effective in learning audio–visual correlation and is appreciably better than the existing cross-modal retrieval methods.
- [1] . 2013. Deep canonical correlation analysis. In International Conference on Machine Learning. PMLR, 1247–1255.Google Scholar
Digital Library
- [2] . 2015. On deep multi-view representation learning. In International Conference on Machine Learning. PMLR, 1083–1092.Google Scholar
Digital Library
- [3] . 2016. A comprehensive survey on cross-modal retrieval. arXiv:1607.06215. Retrieved from https://arxiv.org/abs/1607.06215.Google Scholar
- [4] . 2016. Cross-modal retrieval with CNN visual features: A new baseline. IEEE Trans. Cybernet. 47, 2 (2016), 449–460.Google Scholar
- [5] . 2019. Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22, 2 (2019), 657–672.Google Scholar
Digital Library
- [6] . 2012. Automatic music soundtrack generation for outdoor videos from contextual sensor information. In Proceedings of the 20th ACM International Conference on Multimedia. 1377–1378.Google Scholar
Digital Library
- [7] . 2019. Deep cross-modal correlation learning for audio and lyrics in music retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1 (2019), 1–16.Google Scholar
Digital Library
- [8] . 2017. An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges. IEEE Trans. Circ. Syst. Vid. Technol. 28, 9 (2017), 2372–2385.Google Scholar
Digital Library
- [9] . 2018. Unsupervised generative adversarial cross-modal hashing. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.Google Scholar
Cross Ref
- [10] . 2015. Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 38, 10 (2015), 2010–2023.Google Scholar
Digital Library
- [11] . 2017. Generalized semi-supervised and structured subspace learning for cross-modal retrieval. IEEE Trans. Multimedia 20, 1 (2017), 128–141.Google Scholar
Digital Library
- [12] . 2017. Joint latent subspace learning and regression for cross-modal retrieval. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 917–920.Google Scholar
Digital Library
- [13] . 1992. Relations between two sets of variates. In Breakthroughs in Statistics. Springer, 162–190.Google Scholar
- [14] . 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16, 12 (2004), 2639–2664.Google Scholar
Digital Library
- [15] . 1993. Canonical correlation analysis when the data are curves. J. Roy. Stat. Soc.: Ser. B (Methodol.) 55, 3 (1993), 725–740.Google Scholar
- [16] . 2018. Category-based deep CCA for fine-grained venue discovery from multimodal data. IEEE Trans. Neural Netw. Learn. Syst. 30, 4 (2018), 1250–1258.Google Scholar
Cross Ref
- [17] . 2018. Audio-visual embedding for cross-modal music video retrieval through supervised deep CCA. In Proceedings of the IEEE International Symposium on Multimedia (ISM’18). IEEE, 143–150.Google Scholar
Cross Ref
- [18] . 2021. Learning audio-visual correlations from variational cross-modal generation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’21). IEEE, 4300–4304.Google Scholar
Cross Ref
- [19] . 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM International Conference on Multimedia. 251–260.Google Scholar
Digital Library
- [20] . 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the 22nd ACM International Conference on Multimedia. 7–16.Google Scholar
Digital Library
- [21] . 2012. Multiview metric learning with global consistency and local smoothness. ACM Trans. Intell. Syst. Technol. 3, 3 (2012), 1–22.Google Scholar
Digital Library
- [22] . 2019. The sound of motions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1735–1744.Google Scholar
Cross Ref
- [23] . 2019. Self-supervised moving vehicle tracking with stereo sound. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7053–7062.Google Scholar
Cross Ref
- [24] . 2017. Adaptively unified semi-supervised learning for cross-modal retrieval. In Proceedings of the International Joint Conferences on Artificial Intelligence (IJCAI’17). 3406–3412.Google Scholar
Cross Ref
- [25] . 2018. Adaptive semi-supervised feature selection for cross-modal retrieval. IEEE Trans. Multimedia 21, 5 (2018), 1276–1288.Google Scholar
Digital Library
- [26] . 2019. Semi-supervised cross-modal retrieval with label prediction. IEEE Trans. Multimedia 22, 9 (2019), 2345–2353.Google Scholar
Cross Ref
- [27] . 2019. Deep supervised cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10394–10403.Google Scholar
Cross Ref
- [28] . 2016. Effective deep learning-based multi-modal retrieval. VLDB J. 25, 1 (2016), 79–101.Google Scholar
Digital Library
- [29] . 2017. Adversarial cross-modal retrieval. In Proceedings of the 25th ACM International Conference on Multimedia. 154–162.Google Scholar
Digital Library
- [30] . 2020. Generating visually aligned sound from videos. IEEE Trans. Image Process. 29 (2020), 8292–8302.Google Scholar
Cross Ref
- [31] . 2020. Foley music: Learning to generate music from videos. In European Conference on Computer Vision. Springer, 758–775.Google Scholar
Digital Library
- [32] . 2020. Music gesture for visual sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10478–10487.Google Scholar
Cross Ref
- [33] . 2018. The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV’18). 570–586.Google Scholar
Digital Library
- [34] . 1992. The spatial correlation function approach to response surface estimation. In Proceedings of the 24th Conference on Winter Simulation. 565–571.Google Scholar
Digital Library
- [35] . 2016. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision. Springer, 499–515.Google Scholar
Cross Ref
- [36] . 2006. Dimensionality reduction by learning an invariant mapping. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2. IEEE, 1735–1742.Google Scholar
Digital Library
- [37] . 2018. Audio-visual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision (ECCV’18). 247–263.Google Scholar
Digital Library
- [38] . 2009. Mean squared error: Love it or leave it? A new look at signal fidelity measures. IEEE Sign. Process. Mag. 26, 1 (2009), 98–117.Google Scholar
Cross Ref
- [39] . 2018. Visual to sound: Generating natural sound for videos in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3550–3558.Google Scholar
Cross Ref
- [40] . 2017. Audio set: An ontology and human-labeled dataset for audio events. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, 776–780.Google Scholar
Digital Library
- [41] . 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https://arxiv.org/abs/1409.1556.Google Scholar
- [42] . 2017. CNN architectures for large-scale audio classification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, 131–135.Google Scholar
Digital Library
- [43] . 2014. Adam: A method for stochastic optimization. arXiv:1412.6980. Retrieved from https://arxiv.org/abs/1412.6980.Google Scholar
- [44] . 2016. Cross-media shared representation by hierarchical learning with multiple deep networks. In Proceedings of the International Joint Conferences on Artificial Intelligence (IJCAI’16). 3846–3853.Google Scholar
- [45] . 2000. Kernel and nonlinear canonical correlation analysis. Int. J. Neural Syst. 10, 05 (2000), 365–377.Google Scholar
Cross Ref
- [46] . 2014. Cluster canonical correlation analysis. In Artificial Intelligence and Statistics. PMLR, 823–831.Google Scholar
- [47] . 2019. Adversary guided asymmetric hashing for cross-modal retrieval. In Proceedings of the International Conference on Multimedia Retrieval. 159–167.Google Scholar
Digital Library
- [48] . 2017. Unsupervised cross-modal retrieval through adversarial learning. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’17). IEEE, 1153–1158.Google Scholar
Cross Ref
- [49] . 2020. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv:2004.00849. Retrieved from https://arxiv.org/abs/2004.00849.Google Scholar
- [50] . 2020. Deep triplet neural networks with cluster-cca for audio-visual cross-modal retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 16, 3 (2020), 1–23.Google Scholar
Digital Library
- [51] . 2021. Learning explicit and implicit latent common spaces for audio-visual cross-modal retrieval. arXiv:2110.13556. Retrieved from https://arxiv.org/abs/2110.13556.Google Scholar
- [52] . 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 11 (2008).Google Scholar
Index Terms
Variational Autoencoder with CCA for Audio–Visual Cross-modal Retrieval
Recommendations
Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval
Cross-modal retrieval aims to retrieve data in one modality by a query in another modality, which has been a very interesting research issue in the field of multimedia, information retrieval, and computer vision, and database. Most existing works focus ...
Adversarial Cross-Modal Retrieval
MM '17: Proceedings of the 25th ACM international conference on MultimediaCross-modal retrieval aims to enable flexible retrieval experience across different modalities (e.g., texts vs. images). The core of cross-modal retrieval research is to learn a common subspace where the items of different modalities can be directly ...
HCMSL: Hybrid Cross-modal Similarity Learning for Cross-modal Retrieval
The purpose of cross-modal retrieval is to find the relationship between different modal samples and to retrieve other modal samples with similar semantics by using a certain modal sample. As the data of different modalities presents heterogeneous low-...






Comments