skip to main content
research-article

Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval

Published:14 July 2020Publication History
Skip Abstract Section

Abstract

Cross-modal retrieval aims to retrieve data in one modality by a query in another modality, which has been a very interesting research issue in the field of multimedia, information retrieval, and computer vision, and database. Most existing works focus on cross-modal retrieval between text-image, text-video, and lyrics-audio. Little research addresses cross-modal retrieval between audio and video due to limited audio-video paired datasets and semantic information. The main challenge of the audio-visual cross-modal retrieval task focuses on learning joint embeddings from a shared subspace for computing the similarity across different modalities, where generating new representations is to maximize the correlation between audio and visual modalities space. In this work, we propose TNN-C-CCA, a novel deep triplet neural network with cluster canonical correlation analysis, which is an end-to-end supervised learning architecture with an audio branch and a video branch. We not only consider the matching pairs in the common space but also compute the mismatching pairs when maximizing the correlation. In particular, two significant contributions are made. First, a better representation by constructing a deep triplet neural network with triplet loss for optimal projections can be generated to maximize correlation in the shared subspace. Second, positive examples and negative examples are used in the learning stage to improve the capability of embedding learning between audio and video. Our experiment is run over fivefold cross validation, where average performance is applied to demonstrate the performance of audio-video cross-modal retrieval. The experimental results achieved on two different audio-visual datasets show that the proposed learning architecture with two branches outperforms existing six canonical correlation analysis–based methods and four state-of-the-art-based cross-modal retrieval methods.

References

  1. Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Apostol (Paul) Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. YouTube-8M: A large-scale video classification benchmark. arXiv:1609.08675. https://arxiv.org/pdf/1609.08675v1.pdf.Google ScholarGoogle Scholar
  2. Galen Andrew, Raman Arora, Jeff A. Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In Proceedings of the 30th International Conference on Machine Learning (ICML’13). 1247--1255. DOI:https://doi.org/10.4324/9781315755649-3Google ScholarGoogle Scholar
  3. Luca Bertinetto, Jack Valmadre, João F. Henriques, Andrea Vedaldi, and Philip H. S. Torr. 2016. Fully-convolutional Siamese networks for object tracking. In Computer Vision—ECCV 2016 Workshops. Lecture Notes in Computer Science, Vol. 9914. Springer, 850--865. DOI:https://doi.org/10.1007/978-3-319-48881-3_56Google ScholarGoogle Scholar
  4. Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine Learning 20, 3 (1995), 273--297. DOI:https://doi.org/10.1007/BF00994018Google ScholarGoogle ScholarCross RefCross Ref
  5. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’09). 248--255. DOI:https://doi.org/10.1109/CVPRW.2009.5206848Google ScholarGoogle ScholarCross RefCross Ref
  6. Fangxiang Feng, Xiaojie Wang, Ruifan Li, and Ibrar Ahmad. 2015. Correspondence autoencoders for cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 12, 1 (2015), Article 26, 22 pages. DOI:https://doi.org/10.1145/2808205Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Wen Gu, Xiaoyan Gu, Jingzi Gu, Bo Li, Zhi Xiong, and Weiping Wang. 2019. Adversary guided asymmetric hashing for cross-modal retrieval. In Proceedings of the 2019 International Conference on Multimedia Retrieval (ICMR’19). 159--167. DOI:https://doi.org/10.1145/3323873.3325045Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. David R. Hardoon, Sándor Szedmák, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16, 12 (2004), 2639--2664. DOI:https://doi.org/10.1162/0899766042321814Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Li He, Xing Xu, Huimin Lu, Yang Yang, Fumin Shen, and Heng Tao Shen. 2017. Unsupervised cross-modal retrieval through adversarial learning. In Proceedings of the 2017 IEEE International Conference on Multimedia and Expo (ICME’17). 1153--1158. DOI:https://doi.org/10.1109/ICME.2017.8019549Google ScholarGoogle ScholarCross RefCross Ref
  10. Alexander Hermans, Lucas Beyer, and Bastian Leibe. 2017. In defense of the triplet loss for person re-identification. arxiv:1703.07737. http://arxiv.org/abs/1703.07737Google ScholarGoogle Scholar
  11. Mengqiu Hu, Yang Yang, Fumin Shen, Ning Xie, Richang Hong, and Heng Tao Shen. 2019. Collective reconstructive embeddings for cross-modal hashing. IEEE Transactions on Image Processing 28, 6 (2019), 2770--2784. DOI:https://doi.org/10.1109/TIP.2018.2890144Google ScholarGoogle ScholarCross RefCross Ref
  12. Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 448--456.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Go Irie, Hiroyuki Arai, and Yukinobu Taniguchi. 2015. Alternating co-quantization for cross-modal hashing. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV’15). 1886--1894. DOI:https://doi.org/10.1109/ICCV.2015.219Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Yanli Ji, Yuhan Hu, Yang Yang, Fumin Shen, and Heng Tao Shen. 2019. Cross-domain facial expression recognition via an intra-category common feature and inter-category distinction feature fusion network. Neurocomputing 333 (2019), 231--239. DOI:https://doi.org/10.1016/j.neucom.2018.12.037Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Andrej Karpathy, Armand Joulin, and Fei-Fei Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems. 1889--1897.Google ScholarGoogle Scholar
  16. Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese neural networks for one-shot image recognition. In Proceedings of the ICML Deep Learning Workshop. 956--963. DOI:https://doi.org/10.1017/CBO9781107415324.004Google ScholarGoogle Scholar
  17. Pei Ling Lai and Colin Fyfe. 2000. Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems 10, 5 (2000), 365--377. DOI:https://doi.org/10.1142/S012906570000034XGoogle ScholarGoogle ScholarCross RefCross Ref
  18. Dongge Li, Nevenka Dimitrova, Mingkun Li, and Ishwar K. Sethi. 2003. Multimedia content processing through cross-modal association. In Proceedings of the 11th ACM International Conference on Multimedia (MM’03). 604--611. DOI:https://doi.org/10.1145/957013.957143Google ScholarGoogle Scholar
  19. R. Manmatha, Chao-Yuan Wu, Alexander J. Smola, and Philipp Krähenbühl. 2017. Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 2859--2867. DOI:https://doi.org/10.1109/ICCV.2017.309Google ScholarGoogle Scholar
  20. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML’11). 689--696.Google ScholarGoogle Scholar
  21. Yuxin Peng, Xin Huang, and Jinwei Qi. 2016. Cross-media shared representation by hierarchical learning with multiple deep networks. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16). 3846--3853.Google ScholarGoogle Scholar
  22. Yuxin Peng and Jinwei Qi. 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (2019), 22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Yuxin Peng, Jinwei Qi, and Yuxin Yuan. 2018. Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Transactions on Image Processing 27, 11 (2018), 5585--5599.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Viresh Ranjan, Nikhil Rasiwasia, and C. V. Jawahar. 2015. Multi-label cross-modal retrieval. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV’15). 4094--4102. DOI:https://doi.org/10.1109/ICCV.2015.466Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Nikhil Rasiwasia, Dhruv Mahajan, Vijay Mahadevan, and Gaurav Aggarwal. 2014. Cluster canonical correlation analysis. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS’14). 823--831. DOI:https://doi.org/10.1201/b18358-8Google ScholarGoogle Scholar
  26. Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th International Conference on Multimedia. 251--260. DOI:https://doi.org/10.1145/1873951.1873987Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Mohammad Rastegari, Jonghyun Choi, Shobeir Fakhraei, Hal Daumé III, and Larry S. Davis. 2013. Predictable dual-view hashing. In Proceedings of the 30th International Conference on Machine Learning (ICML’13). 1328--1336. http://proceedings.mlr.press/v28/rastegari13.html.Google ScholarGoogle Scholar
  28. Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 815--823. DOI:https://doi.org/10.1109/CVPR.2015.7298682Google ScholarGoogle ScholarCross RefCross Ref
  29. Rajiv Ratn Shah, Yi Yu, and Roger Zimmermann. 2014. ADVISOR: Personalized video soundtrack recommendation by late fusion with heuristic rankings. In Proceedings of the ACM International Conference on Multimedia (MM’14). 607--616. DOI:https://doi.org/10.1145/2647868.2654919Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Abhishek Sharma, Abhishek Kumar, Hal Daume, and David W. Jacobs. 2012. Generalized multiview analysis: A discriminative latent space. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). IEEE, Los Alamitos, CA, 2160--2167.Google ScholarGoogle Scholar
  31. Abhishek Sharma, Abhishek Kumar, Hal Daumé III, and David W. Jacobs. 2012. Generalized multiview analysis: A discriminative latent space. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 2160--2167. DOI:https://doi.org/10.1109/CVPR.2012.6247923Google ScholarGoogle ScholarCross RefCross Ref
  32. Blake Shaw, Bert Huang, and Tony Jebara. 2011. Learning a distance metric from a network. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems. 1899--1907.Google ScholarGoogle Scholar
  33. Chen Shen, Zhongming Jin, Wenqing Chu, Rongxin Jiang, Yaowu Chen, Guo-Jun Qi, and Xian-Sheng Hua. 2019. Multi-level similarity perception network for person re-identification. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2 (2019), Article 32, 19 pages. DOI:https://doi.org/10.1145/3309881Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Xiaoxiao Shi and Philip S. Yu. 2012. Dimensionality reduction on heterogeneous feature space. In Proceedings of the 12th IEEE International Conference on Data Mining (ICDM’12). 635--644. DOI:https://doi.org/10.1109/ICDM.2012.30Google ScholarGoogle Scholar
  35. Josef Sivic and Andrew Zisserman. 2003. Video Google: A text retrieval approach to object matching in videos. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’03). 1470--1477. DOI:https://doi.org/10.1109/ICCV.2003.1238663Google ScholarGoogle ScholarCross RefCross Ref
  36. Richard Socher and Fei-Fei Li. 2010. Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In Proceedings of the 23rd IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). 966--973. DOI:https://doi.org/10.1109/CVPR.2010.5540112Google ScholarGoogle ScholarCross RefCross Ref
  37. Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi. 2017. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI’17). 4278--4284. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14806.Google ScholarGoogle Scholar
  38. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 2818--2826. DOI:https://doi.org/10.1109/CVPR.2016.308Google ScholarGoogle ScholarCross RefCross Ref
  39. Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In Proceedings of the 2017 ACM Conference on Multimedia (MM’17). 154--162. DOI:https://doi.org/10.1145/3123266.3123326Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Cheng Wang, Haojin Yang, and Christoph Meinel. 2015. Deep semantic mapping for cross-modal retrieval. In Proceedings of the 27th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’15). 234--241. DOI:https://doi.org/10.1109/ICTAI.2015.45Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Di Wang, Xinbo Gao, Xiumei Wang, Lihuo He, and Bo Yuan. 2016. Multimodal discriminative binary embedding for large-scale cross-modal retrieval. IEEE Transactions on Image Processing 25, 10 (2016), 4540--4554.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Jian Wang, Yonghao He, Cuicui Kang, Shiming Xiang, and Chunhong Pan. 2015. Image-text cross-modal retrieval via modality-specific feature learning. In Proceedings of the 5th ACM International Conference on Multimedia Retrieval. 347--354. DOI:https://doi.org/10.1145/2671188.2749341Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Kaiye Wang, Ran He, Wei Wang, Liang Wang, and Tieniu Tan. 2013. Learning coupled feature spaces for cross-modal matching. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’13). 2088--2095. DOI:https://doi.org/10.1109/ICCV.2013.261Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 5005--5013. DOI:https://doi.org/10.1109/CVPR.2016.541Google ScholarGoogle ScholarCross RefCross Ref
  45. Zhangcheng Wang, Ya Li, Richang Hong, and Xinmei Tian. 2019. Eigenvector-based distance metric learning for image classification and retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 3 (2019), Article 84, 19 pages. DOI:https://doi.org/10.1145/3340262Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Chang Xu, Dacheng Tao, and Chao Xu. 2013. A survey on multi-view learning. Neural Computing and Applications 23 (2013), 2031--2038.Google ScholarGoogle ScholarCross RefCross Ref
  47. Xing Xu, Li He, Huimin Lu, Lianli Gao, and Yanli Ji. 2019. Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22, 2 (2019), 657--672.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Xing Xu, Li He, Atsushi Shimada, Rin-Ichiro Taniguchi, and Huimin Lu. 2016. Learning unified binary codes for cross-modal retrieval via latent semantic hashing. Neurocomputing 213 (2016), 191--203. DOI:https://doi.org/10.1016/j.neucom.2015.11.133Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. X. Xu, H. Lu, J. Song, Y. Yang, H. T. Shen, and X. Li. 2019. Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval. IEEE Transactions on Cybernetics 49, 7 (2019), 1--14. DOI:https://doi.org/10.1109/TCYB.2019.2928180Google ScholarGoogle Scholar
  50. Xing Xu, Fumin Shen, Yang Yang, Heng Tao Shen, and Xuelong Li. 2017. Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Transactions on Image Processing 26, 5 (2017), 2494--2507.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3441--3450. DOI:https://doi.org/10.1109/CVPR.2015.7298966Google ScholarGoogle ScholarCross RefCross Ref
  52. Yan Yan, Feiping Nie, Wen Li, Chenqiang Gao, Yi Yang, and Dong Xu. 2016. Image classification by cross-media active learning with privileged information. IEEE Transactions on Multimedia 18, 12 (2016), 2494--2502. DOI:https://doi.org/10.1109/TMM.2016.2602938Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Yi Yang, Feiping Nie, Dong Xu, Jiebo Luo, Yueting Zhuang, and Yunhe Pan. 2012. A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 4 (2012), 723--742. DOI:https://doi.org/10.1109/10.1109/TPAMI.2011.170Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Ting Yao, Tao Mei, and Chong-Wah Ngo. 2015. Learning query and image similarities with ranking canonical correlation analysis. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 28--36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Yi Yu, Suhua Tang, Kiyoharu Aizawa, and Akiko Aizawa. 2018. Category-based deep CCA for fine-grained venue discovery from multimodal data. IEEE Transactions on Neural Networks and Learning Systems 30, 99 (2018), 1--9.Google ScholarGoogle Scholar
  56. Yi Yu, Suhua Tang, Francisco Raposo, and Lei Chen. 2019. Deep cross-modal correlation learning for audio and lyrics in music retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (2019), Article 20, 16 pages. DOI:https://doi.org/10.1145/3281746Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Donghuo Zeng, Yi Yu, and Keizo Oyama. 2018. Audio-visual embedding for cross-modal music video retrieval through supervised deep CCA. In Proceedings of the 2018 IEEE International Symposium on Multimedia (ISM’18). 143--150. DOI:https://doi.org/10.1109/ISM.2018.00-21Google ScholarGoogle ScholarCross RefCross Ref
  58. Jian Zhang, Yuxin Peng, and Mingkuan Yuan. 2018. Unsupervised generative adversarial cross-modal hashing. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI’18), the 30th Conference on Innovative Applications of Artificial Intelligence (IAAI’18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI’18). 539--546. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16746.Google ScholarGoogle Scholar
  59. Yin Zheng, Yu-Jin Zhang, and Hugo Larochelle. 2014. Topic modeling of multimodal data: An autoregressive approach. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 1370--1377. DOI:https://doi.org/10.1109/CVPR.2014.178Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, and Tamara L. Berg. 2018. Visual to sound: Generating natural sound for videos in the wild. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 3550--3558. DOI:https://doi.org/10.1109/CVPR.2018.00374Google ScholarGoogle ScholarCross RefCross Ref
  61. Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. 2017. Uncovering the temporal context for video question answering. International Journal of Computer Vision 124, 3 (2017), 409--421. DOI:https://doi.org/10.1007/s11263-017-1033-7Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-Modal Retrieval

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 3
        August 2020
        364 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3409646
        Issue’s Table of Contents

        Copyright © 2020 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 14 July 2020
        • Online AM: 7 May 2020
        • Accepted: 1 March 2020
        • Revised: 1 February 2020
        • Received: 1 July 2019
        Published in tomm Volume 16, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!