Abstract
Cross-modal retrieval aims to retrieve data in one modality by a query in another modality, which has been a very interesting research issue in the field of multimedia, information retrieval, and computer vision, and database. Most existing works focus on cross-modal retrieval between text-image, text-video, and lyrics-audio. Little research addresses cross-modal retrieval between audio and video due to limited audio-video paired datasets and semantic information. The main challenge of the audio-visual cross-modal retrieval task focuses on learning joint embeddings from a shared subspace for computing the similarity across different modalities, where generating new representations is to maximize the correlation between audio and visual modalities space. In this work, we propose TNN-C-CCA, a novel deep triplet neural network with cluster canonical correlation analysis, which is an end-to-end supervised learning architecture with an audio branch and a video branch. We not only consider the matching pairs in the common space but also compute the mismatching pairs when maximizing the correlation. In particular, two significant contributions are made. First, a better representation by constructing a deep triplet neural network with triplet loss for optimal projections can be generated to maximize correlation in the shared subspace. Second, positive examples and negative examples are used in the learning stage to improve the capability of embedding learning between audio and video. Our experiment is run over fivefold cross validation, where average performance is applied to demonstrate the performance of audio-video cross-modal retrieval. The experimental results achieved on two different audio-visual datasets show that the proposed learning architecture with two branches outperforms existing six canonical correlation analysis–based methods and four state-of-the-art-based cross-modal retrieval methods.
- Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Apostol (Paul) Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. YouTube-8M: A large-scale video classification benchmark. arXiv:1609.08675. https://arxiv.org/pdf/1609.08675v1.pdf.Google Scholar
- Galen Andrew, Raman Arora, Jeff A. Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In Proceedings of the 30th International Conference on Machine Learning (ICML’13). 1247--1255. DOI:https://doi.org/10.4324/9781315755649-3Google Scholar
- Luca Bertinetto, Jack Valmadre, João F. Henriques, Andrea Vedaldi, and Philip H. S. Torr. 2016. Fully-convolutional Siamese networks for object tracking. In Computer Vision—ECCV 2016 Workshops. Lecture Notes in Computer Science, Vol. 9914. Springer, 850--865. DOI:https://doi.org/10.1007/978-3-319-48881-3_56Google Scholar
- Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine Learning 20, 3 (1995), 273--297. DOI:https://doi.org/10.1007/BF00994018Google Scholar
Cross Ref
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’09). 248--255. DOI:https://doi.org/10.1109/CVPRW.2009.5206848Google Scholar
Cross Ref
- Fangxiang Feng, Xiaojie Wang, Ruifan Li, and Ibrar Ahmad. 2015. Correspondence autoencoders for cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 12, 1 (2015), Article 26, 22 pages. DOI:https://doi.org/10.1145/2808205Google Scholar
Digital Library
- Wen Gu, Xiaoyan Gu, Jingzi Gu, Bo Li, Zhi Xiong, and Weiping Wang. 2019. Adversary guided asymmetric hashing for cross-modal retrieval. In Proceedings of the 2019 International Conference on Multimedia Retrieval (ICMR’19). 159--167. DOI:https://doi.org/10.1145/3323873.3325045Google Scholar
Digital Library
- David R. Hardoon, Sándor Szedmák, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16, 12 (2004), 2639--2664. DOI:https://doi.org/10.1162/0899766042321814Google Scholar
Digital Library
- Li He, Xing Xu, Huimin Lu, Yang Yang, Fumin Shen, and Heng Tao Shen. 2017. Unsupervised cross-modal retrieval through adversarial learning. In Proceedings of the 2017 IEEE International Conference on Multimedia and Expo (ICME’17). 1153--1158. DOI:https://doi.org/10.1109/ICME.2017.8019549Google Scholar
Cross Ref
- Alexander Hermans, Lucas Beyer, and Bastian Leibe. 2017. In defense of the triplet loss for person re-identification. arxiv:1703.07737. http://arxiv.org/abs/1703.07737Google Scholar
- Mengqiu Hu, Yang Yang, Fumin Shen, Ning Xie, Richang Hong, and Heng Tao Shen. 2019. Collective reconstructive embeddings for cross-modal hashing. IEEE Transactions on Image Processing 28, 6 (2019), 2770--2784. DOI:https://doi.org/10.1109/TIP.2018.2890144Google Scholar
Cross Ref
- Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 448--456.Google Scholar
Digital Library
- Go Irie, Hiroyuki Arai, and Yukinobu Taniguchi. 2015. Alternating co-quantization for cross-modal hashing. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV’15). 1886--1894. DOI:https://doi.org/10.1109/ICCV.2015.219Google Scholar
Digital Library
- Yanli Ji, Yuhan Hu, Yang Yang, Fumin Shen, and Heng Tao Shen. 2019. Cross-domain facial expression recognition via an intra-category common feature and inter-category distinction feature fusion network. Neurocomputing 333 (2019), 231--239. DOI:https://doi.org/10.1016/j.neucom.2018.12.037Google Scholar
Digital Library
- Andrej Karpathy, Armand Joulin, and Fei-Fei Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems. 1889--1897.Google Scholar
- Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese neural networks for one-shot image recognition. In Proceedings of the ICML Deep Learning Workshop. 956--963. DOI:https://doi.org/10.1017/CBO9781107415324.004Google Scholar
- Pei Ling Lai and Colin Fyfe. 2000. Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems 10, 5 (2000), 365--377. DOI:https://doi.org/10.1142/S012906570000034XGoogle Scholar
Cross Ref
- Dongge Li, Nevenka Dimitrova, Mingkun Li, and Ishwar K. Sethi. 2003. Multimedia content processing through cross-modal association. In Proceedings of the 11th ACM International Conference on Multimedia (MM’03). 604--611. DOI:https://doi.org/10.1145/957013.957143Google Scholar
- R. Manmatha, Chao-Yuan Wu, Alexander J. Smola, and Philipp Krähenbühl. 2017. Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 2859--2867. DOI:https://doi.org/10.1109/ICCV.2017.309Google Scholar
- Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML’11). 689--696.Google Scholar
- Yuxin Peng, Xin Huang, and Jinwei Qi. 2016. Cross-media shared representation by hierarchical learning with multiple deep networks. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16). 3846--3853.Google Scholar
- Yuxin Peng and Jinwei Qi. 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (2019), 22.Google Scholar
Digital Library
- Yuxin Peng, Jinwei Qi, and Yuxin Yuan. 2018. Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Transactions on Image Processing 27, 11 (2018), 5585--5599.Google Scholar
Digital Library
- Viresh Ranjan, Nikhil Rasiwasia, and C. V. Jawahar. 2015. Multi-label cross-modal retrieval. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV’15). 4094--4102. DOI:https://doi.org/10.1109/ICCV.2015.466Google Scholar
Digital Library
- Nikhil Rasiwasia, Dhruv Mahajan, Vijay Mahadevan, and Gaurav Aggarwal. 2014. Cluster canonical correlation analysis. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS’14). 823--831. DOI:https://doi.org/10.1201/b18358-8Google Scholar
- Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th International Conference on Multimedia. 251--260. DOI:https://doi.org/10.1145/1873951.1873987Google Scholar
Digital Library
- Mohammad Rastegari, Jonghyun Choi, Shobeir Fakhraei, Hal Daumé III, and Larry S. Davis. 2013. Predictable dual-view hashing. In Proceedings of the 30th International Conference on Machine Learning (ICML’13). 1328--1336. http://proceedings.mlr.press/v28/rastegari13.html.Google Scholar
- Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 815--823. DOI:https://doi.org/10.1109/CVPR.2015.7298682Google Scholar
Cross Ref
- Rajiv Ratn Shah, Yi Yu, and Roger Zimmermann. 2014. ADVISOR: Personalized video soundtrack recommendation by late fusion with heuristic rankings. In Proceedings of the ACM International Conference on Multimedia (MM’14). 607--616. DOI:https://doi.org/10.1145/2647868.2654919Google Scholar
Digital Library
- Abhishek Sharma, Abhishek Kumar, Hal Daume, and David W. Jacobs. 2012. Generalized multiview analysis: A discriminative latent space. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). IEEE, Los Alamitos, CA, 2160--2167.Google Scholar
- Abhishek Sharma, Abhishek Kumar, Hal Daumé III, and David W. Jacobs. 2012. Generalized multiview analysis: A discriminative latent space. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 2160--2167. DOI:https://doi.org/10.1109/CVPR.2012.6247923Google Scholar
Cross Ref
- Blake Shaw, Bert Huang, and Tony Jebara. 2011. Learning a distance metric from a network. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems. 1899--1907.Google Scholar
- Chen Shen, Zhongming Jin, Wenqing Chu, Rongxin Jiang, Yaowu Chen, Guo-Jun Qi, and Xian-Sheng Hua. 2019. Multi-level similarity perception network for person re-identification. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2 (2019), Article 32, 19 pages. DOI:https://doi.org/10.1145/3309881Google Scholar
Digital Library
- Xiaoxiao Shi and Philip S. Yu. 2012. Dimensionality reduction on heterogeneous feature space. In Proceedings of the 12th IEEE International Conference on Data Mining (ICDM’12). 635--644. DOI:https://doi.org/10.1109/ICDM.2012.30Google Scholar
- Josef Sivic and Andrew Zisserman. 2003. Video Google: A text retrieval approach to object matching in videos. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’03). 1470--1477. DOI:https://doi.org/10.1109/ICCV.2003.1238663Google Scholar
Cross Ref
- Richard Socher and Fei-Fei Li. 2010. Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In Proceedings of the 23rd IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). 966--973. DOI:https://doi.org/10.1109/CVPR.2010.5540112Google Scholar
Cross Ref
- Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi. 2017. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI’17). 4278--4284. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14806.Google Scholar
- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 2818--2826. DOI:https://doi.org/10.1109/CVPR.2016.308Google Scholar
Cross Ref
- Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In Proceedings of the 2017 ACM Conference on Multimedia (MM’17). 154--162. DOI:https://doi.org/10.1145/3123266.3123326Google Scholar
Digital Library
- Cheng Wang, Haojin Yang, and Christoph Meinel. 2015. Deep semantic mapping for cross-modal retrieval. In Proceedings of the 27th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’15). 234--241. DOI:https://doi.org/10.1109/ICTAI.2015.45Google Scholar
Digital Library
- Di Wang, Xinbo Gao, Xiumei Wang, Lihuo He, and Bo Yuan. 2016. Multimodal discriminative binary embedding for large-scale cross-modal retrieval. IEEE Transactions on Image Processing 25, 10 (2016), 4540--4554.Google Scholar
Digital Library
- Jian Wang, Yonghao He, Cuicui Kang, Shiming Xiang, and Chunhong Pan. 2015. Image-text cross-modal retrieval via modality-specific feature learning. In Proceedings of the 5th ACM International Conference on Multimedia Retrieval. 347--354. DOI:https://doi.org/10.1145/2671188.2749341Google Scholar
Digital Library
- Kaiye Wang, Ran He, Wei Wang, Liang Wang, and Tieniu Tan. 2013. Learning coupled feature spaces for cross-modal matching. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’13). 2088--2095. DOI:https://doi.org/10.1109/ICCV.2013.261Google Scholar
Digital Library
- Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 5005--5013. DOI:https://doi.org/10.1109/CVPR.2016.541Google Scholar
Cross Ref
- Zhangcheng Wang, Ya Li, Richang Hong, and Xinmei Tian. 2019. Eigenvector-based distance metric learning for image classification and retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 3 (2019), Article 84, 19 pages. DOI:https://doi.org/10.1145/3340262Google Scholar
Digital Library
- Chang Xu, Dacheng Tao, and Chao Xu. 2013. A survey on multi-view learning. Neural Computing and Applications 23 (2013), 2031--2038.Google Scholar
Cross Ref
- Xing Xu, Li He, Huimin Lu, Lianli Gao, and Yanli Ji. 2019. Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22, 2 (2019), 657--672.Google Scholar
Digital Library
- Xing Xu, Li He, Atsushi Shimada, Rin-Ichiro Taniguchi, and Huimin Lu. 2016. Learning unified binary codes for cross-modal retrieval via latent semantic hashing. Neurocomputing 213 (2016), 191--203. DOI:https://doi.org/10.1016/j.neucom.2015.11.133Google Scholar
Digital Library
- X. Xu, H. Lu, J. Song, Y. Yang, H. T. Shen, and X. Li. 2019. Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval. IEEE Transactions on Cybernetics 49, 7 (2019), 1--14. DOI:https://doi.org/10.1109/TCYB.2019.2928180Google Scholar
- Xing Xu, Fumin Shen, Yang Yang, Heng Tao Shen, and Xuelong Li. 2017. Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Transactions on Image Processing 26, 5 (2017), 2494--2507.Google Scholar
Digital Library
- Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3441--3450. DOI:https://doi.org/10.1109/CVPR.2015.7298966Google Scholar
Cross Ref
- Yan Yan, Feiping Nie, Wen Li, Chenqiang Gao, Yi Yang, and Dong Xu. 2016. Image classification by cross-media active learning with privileged information. IEEE Transactions on Multimedia 18, 12 (2016), 2494--2502. DOI:https://doi.org/10.1109/TMM.2016.2602938Google Scholar
Digital Library
- Yi Yang, Feiping Nie, Dong Xu, Jiebo Luo, Yueting Zhuang, and Yunhe Pan. 2012. A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 4 (2012), 723--742. DOI:https://doi.org/10.1109/10.1109/TPAMI.2011.170Google Scholar
Digital Library
- Ting Yao, Tao Mei, and Chong-Wah Ngo. 2015. Learning query and image similarities with ranking canonical correlation analysis. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 28--36.Google Scholar
Digital Library
- Yi Yu, Suhua Tang, Kiyoharu Aizawa, and Akiko Aizawa. 2018. Category-based deep CCA for fine-grained venue discovery from multimodal data. IEEE Transactions on Neural Networks and Learning Systems 30, 99 (2018), 1--9.Google Scholar
- Yi Yu, Suhua Tang, Francisco Raposo, and Lei Chen. 2019. Deep cross-modal correlation learning for audio and lyrics in music retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (2019), Article 20, 16 pages. DOI:https://doi.org/10.1145/3281746Google Scholar
Digital Library
- Donghuo Zeng, Yi Yu, and Keizo Oyama. 2018. Audio-visual embedding for cross-modal music video retrieval through supervised deep CCA. In Proceedings of the 2018 IEEE International Symposium on Multimedia (ISM’18). 143--150. DOI:https://doi.org/10.1109/ISM.2018.00-21Google Scholar
Cross Ref
- Jian Zhang, Yuxin Peng, and Mingkuan Yuan. 2018. Unsupervised generative adversarial cross-modal hashing. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI’18), the 30th Conference on Innovative Applications of Artificial Intelligence (IAAI’18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI’18). 539--546. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16746.Google Scholar
- Yin Zheng, Yu-Jin Zhang, and Hugo Larochelle. 2014. Topic modeling of multimodal data: An autoregressive approach. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 1370--1377. DOI:https://doi.org/10.1109/CVPR.2014.178Google Scholar
Digital Library
- Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, and Tamara L. Berg. 2018. Visual to sound: Generating natural sound for videos in the wild. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 3550--3558. DOI:https://doi.org/10.1109/CVPR.2018.00374Google Scholar
Cross Ref
- Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. 2017. Uncovering the temporal context for video question answering. International Journal of Computer Vision 124, 3 (2017), 409--421. DOI:https://doi.org/10.1007/s11263-017-1033-7Google Scholar
Digital Library
Index Terms
Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval
Recommendations
Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval
Deep cross-modal learning has successfully demonstrated excellent performance in cross-modal multimedia retrieval, with the aim of learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal ...
Scalable Deep Multimodal Learning for Cross-Modal Retrieval
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information RetrievalCross-modal retrieval takes one type of data as the query to retrieve relevant data of another type. Most of existing cross-modal retrieval approaches were proposed to learn a common subspace in a joint manner, where the data from all modalities have to ...
Adversarial Cross-Modal Retrieval
MM '17: Proceedings of the 25th ACM international conference on MultimediaCross-modal retrieval aims to enable flexible retrieval experience across different modalities (e.g., texts vs. images). The core of cross-modal retrieval research is to learn a common subspace where the items of different modalities can be directly ...






Comments