skip to main content
survey

A Review on Methods and Applications in Multimodal Deep Learning

Published:17 February 2023Publication History
Skip Abstract Section

Abstract

Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This article focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, physiological signals, flow, RGB, pose, depth, mesh, and point cloud. Detailed analysis of the baseline approaches and an in-depth study of recent advancements during the past five years (2017 to 2021) in multimodal deep learning applications has been provided. A fine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on different applications in more depth. Last, main issues are highlighted separately for each domain, along with their possible future research directions.

Skip Supplemental Material Section

Supplemental Material

REFERENCES

  1. [1] Aafaq Nayyer, Akhtar Naveed, Liu Wei, Gilani Syed Zulqarnain, and Mian Ajmal. 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1248712496.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Abacha Asma Ben, Hasan Sadid A., Datla Vivek V., Liu Joey, Demner-Fushman Dina, and Müller Henning. 2019. VQA-Med: Overview of the medical visual question answering task at ImageCLEF 2019.CLEF (Working Notes) 2, 6 (2019).Google ScholarGoogle Scholar
  3. [3] Agrawal Aishwarya, Batra Dhruv, Parikh Devi, and Kembhavi Aniruddha. 2018. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern recognition. 49714980.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 60776086.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Antol Stanislaw, Agrawal Aishwarya, Lu Jiasen, Mitchell Margaret, Batra Dhruv, Zitnick C. Lawrence, and Parikh Devi. 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 24252433.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Arik Sercan, Chen Jitong, Peng Kainan, Ping Wei, and Zhou Yanqi. 2018. Neural voice cloning with a few samples. Adv. Neural Inf. Process. Syst. 31 (2018).Google ScholarGoogle Scholar
  7. [7] Arık Sercan Ö., Chrzanowski Mike, Coates Adam, Diamos Gregory, Gibiansky Andrew, Kang Yongguo, Li Xian, Miller John, Ng Andrew, Raiman Jonathan, et al. 2017. Deep voice: Real-time neural text-to-speech. In Proceedings of the International Conference on Machine Learning. PMLR, 195204.Google ScholarGoogle Scholar
  8. [8] Atrey Pradeep K., Hossain M. Anwar, Saddik Abdulmotaleb El, and Kankanhalli Mohan S.. 2010. Multimodal fusion for multimedia analysis: A survey. Multim. Syst. 16, 6 (2010), 345379.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Auer Sören, Bizer Christian, Kobilarov Georgi, Lehmann Jens, Cyganiak Richard, and Ives Zachary. 2007. DBpedia: A nucleus for a web of open data. In The Semantic Web. Springer, 722735.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Bahrick Lorraine E.. 1983. Infants’ perception of substance and temporal synchrony in multimodal events. Infant Behav. Devel. 6, 4 (1983), 429451.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Baltrušaitis Tadas, Ahuja Chaitanya, and Morency Louis-Philippe. 2019. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2 (2019), 423443.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Basu Kinjal, Shakerin Farhad, and Gupta Gopal. 2020. AQuA: ASP-based visual question answering. In Proceedings of the International Symposium on Practical Aspects of Declarative Languages. Springer, 5772.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Beecks Christian, Lokoč Jakub, Seidl Thomas, and Skopal Tomáš. 2011. Indexing the signature quadratic form distance for efficient content-based multimedia retrieval. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval. 18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Ben-Younes Hedi, Cadene Rémi, Cord Matthieu, and Thome Nicolas. 2017. MUTAN: Multimodal Tucker fusion for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 26122620.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Bisk Yonatan, Holtzman Ari, Thomason Jesse, Andreas Jacob, Bengio Yoshua, Chai Joyce, Lapata Mirella, Lazaridou Angeliki, May Jonathan, Nisnevich Aleksandr, et al. 2020. Experience grounds language. arXiv preprint arXiv:2004.10151 (2020).Google ScholarGoogle Scholar
  16. [16] Boateng George. 2020. Towards real-time multimodal emotion recognition among couples. In Proceedings of the International Conference on Multimodal Interaction. 748753.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Bollacker Kurt, Evans Colin, Paritosh Praveen, Sturge Tim, and Taylor Jamie. 2008. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 12471250.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Busso Carlos, Bulut Murtaza, Lee Chi-Chun, Kazemzadeh Abe, Mower Emily, Kim Samuel, Chang Jeannette N., Lee Sungbok, and Narayanan Shrikanth S.. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42, 4 (2008), 335359.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Busso Carlos, Parthasarathy Srinivas, Burmania Alec, AbdelWahab Mohammed, Sadoughi Najmeh, and Provost Emily Mower. 2016. MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Trans. Affect. Comput. 8, 1 (2016), 6780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Heilbron Fabian Caba, Escorcia Victor, Ghanem Bernard, and Niebles Juan Carlos. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961970.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Cadene Remi, Ben-Younes Hedi, Cord Matthieu, and Thome Nicolas. 2019. MUREL: Multimodal relational reasoning for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19891998.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Cao Pengfei, Yang Zhongyi, Sun Liang, Liang Yanchun, Yang Mary Qu, and Guan Renchu. 2019. Image captioning with bidirectional semantic attention-based guiding of long short-term memory. Neural Process. Lett. 50, 1 (2019), 103119.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Carletta Jean, Ashby Simone, Bourban Sebastien, Flynn Mike, Guillemot Mael, Hain Thomas, Kadlec Jaroslav, Karaiskos Vasilis, Kraaij Wessel, Kronenthal Melissa, et al. 2005. The AMI meeting corpus: A pre-announcement. In Proceedings of the International Workshop on Machine Learning for Multimodal Interaction. Springer, 2839.Google ScholarGoogle Scholar
  24. [24] Chen David and Dolan William B.. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 190200.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Chen Jing, Wang Chenhui, Wang Kejun, Yin Chaoqun, Zhao Cong, Xu Tao, Zhang Xinyi, Huang Ziqiang, Liu Meichen, and Yang Tao. 2021. HEU Emotion: A large-scale database for multimodal emotion recognition in the wild. Neural Comput. Applic. 33, 14 (2021), 86698685.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Chen Long, Jiang Zhihong, Xiao Jun, and Liu Wei. 2021. Human-like controllable image captioning with verb-specific semantic roles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1684616856.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Chen Minghai, Ding Guiguang, Zhao Sicheng, Chen Hui, Liu Qiang, and Han Jungong. 2017. Reference-based LSTM for image captioning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Chen Xinlei, Fang Hao, Lin Tsung-Yi, Vedantam Ramakrishna, Gupta Saurabh, Dollár Piotr, and Zitnick C. Lawrence. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).Google ScholarGoogle Scholar
  29. [29] Chen Yangyu, Wang Shuhui, Zhang Weigang, and Huang Qingming. 2018. Less is more: Picking informative frames for video captioning. In Proceedings of the European Conference on Computer Vision (ECCV). 358373.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Cheng Ling, Wei Wei, Mao Xianling, Liu Yong, and Miao Chunyan. 2020. Stack-VS: Stacked visual-semantic attention for image caption generation. IEEE Access 8 (2020), 154953154965.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Chong Luyao, Jin Meng, and He Yuan. 2019. EmoChat: Bringing multimodal emotion detection to mobile conversation. In Proceedings of the 5th International Conference on Big Data Computing and Communications (BIGCOM). IEEE, 213221.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Cimtay Yucel, Ekmekcioglu Erhan, and Caglar-Ozhan Seyma. 2020. Cross-subject multimodal emotion recognition based on hybrid fusion. IEEE Access 8 (2020), 168865168878.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Cukurova Mutlu, Giannakos Michail, and Martinez-Maldonado Roberto. 2020. The promise and challenges of multimodal learning analytics. (2020), 14411449 pages.Google ScholarGoogle Scholar
  34. [34] Dang-Nguyen Duc-Tien, Piras Luca, Giacinto Giorgio, Boato Giulia, and Natale Francesco GB DE. 2017. Multimodal retrieval with diversification and relevance feedback for tourist attraction images. ACM Trans. Multim. Comput. Commun. Applic. 13, 4 (2017), 124.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Desta Mikyas T., Chen Larry, and Kornuta Tomasz. 2018. Object-based reasoning in VQA. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 18141823.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Diwakar Parul. 2021. Automatic image captioning using deep learning. In Proceedings of the International Conference on Innovative Computing & Communication (ICICC).Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Elias Isaac, Zen Heiga, Shen Jonathan, Zhang Yu, Jia Ye, Weiss Ron J., and Wu Yonghui. 2021. Parallel Tacotron: Non-autoregressive and controllable TTS. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 57095713.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Fang Zhiyuan, Gokhale Tejas, Banerjee Pratyay, Baral Chitta, and Yang Yezhou. 2020. Video2Commonsense: Generating commonsense descriptions to enrich video captioning. arXiv preprint arXiv:2003.05162 (2020).Google ScholarGoogle Scholar
  39. [39] Feng Yang, Ma Lin, Liu Wei, and Luo Jiebo. 2019. Unsupervised image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 41254134.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Gan Chuang, Gan Zhe, He Xiaodong, Gao Jianfeng, and Deng Li. 2017. StyleNet: Generating attractive visual captions with styles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31373146.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Gao Jing, Li Peng, Chen Zhikui, and Zhang Jianing. 2020. A survey on deep learning for multimodal data fusion. Neural Computat. 32, 5 (2020), 829864.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Gao Ruohan and Grauman Kristen. 2019. 2.5D visual sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 324333.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Gao Yue, Zhang Hanwang, Zhao Xibin, and Yan Shuicheng. 2017. Event classification in microblogs via social tracking. ACM Trans. Intell. Syst. Technol. 8, 3 (2017), 114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Garcia Nuno Cruz, Bargal Sarah Adel, Ablavsky Vitaly, Morerio Pietro, Murino Vittorio, and Sclaroff Stan. 2021. Distillation multiple choice learning for multimodal action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 27552764.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Gibiansky Andrew, Arik Sercan, Diamos Gregory, Miller John, Peng Kainan, Ping Wei, Raiman Jonathan, and Zhou Yanqi. 2017. Deep Voice 2: Multi-speaker neural text-to-speech. Adv. Neural Inf. Process. Syst. 30 (2017).Google ScholarGoogle Scholar
  46. [46] Gunes Hatice and Piccardi Massimo. 2006. A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06). IEEE, 11481153.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Guo Longteng, Liu Jing, Yao Peng, Li Jiangwei, and Lu Hanqing. 2019. MSCap: Multi-style image captioning with unpaired stylized text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 42044213.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Guo Wenzhong, Wang Jianwen, and Wang Shiping. 2019. Deep multimodal representation learning: A survey. IEEE Access 7 (2019), 6337363394.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Guo Wenya, Zhang Ying, Yang Jufeng, and Yuan Xiaojie. 2021. Re-attention for visual question answering. IEEE Trans. Image Process. 30 (2021), 67306743.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Gurari Danna, Li Qing, Stangl Abigale J., Guo Anhong, Lin Chi, Grauman Kristen, Luo Jiebo, and Bigham Jeffrey P.. 2018. VizWiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 36083617.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Hao Jiaqi, Liu Shiguang, and Xu Qing. 2021. Controlling eye blink for talking face generation via eye conversion. In Proceedings of the SIGGRAPH Asia Technical Communications Conference. 14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Hazarika Devamanyu, Gorantla Sruthi, Poria Soujanya, and Zimmermann Roger. 2018. Self-attentive feature-level fusion for multimodal emotion detection. In Proceedings of the IEEE Conference on Multimedia Information Processing and Retrieval (MIPR). IEEE, 196201.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Hazarika Devamanyu, Poria Soujanya, Mihalcea Rada, Cambria Erik, and Zimmermann Roger. 2018. ICON: Interactive conversational memory network for multimodal emotion detection. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 25942604.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] He Xinwei, Shi Baoguang, Bai Xiang, Xia Gui-Song, Zhang Zhaoxiang, and Dong Weisheng. 2019. Image caption generation with part of speech guidance. Pattern Recog. Lett. 119 (2019), 229237.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. [55] Hodosh Micah, Young Peter, and Hockenmaier Julia. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 47 (2013), 853899.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Hoffman-Plotkin Debbie and Twentyman Craig T.. 1984. A multimodal assessment of behavioral and cognitive deficits in abused and neglected preschoolers. Child Devel. 55, 3 (1984), 794802.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Hong Danfeng, Gao Lianru, Yokoya Naoto, Yao Jing, Chanussot Jocelyn, Du Qian, and Zhang Bing. 2020. More diverse means better: Multimodal deep learning meets remote-sensing imagery classification. IEEE Trans. Geosci. Rem. Sens. 59, 5 (2020), 43404354.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Huan Ruo-Hong, Shu Jia, Bao Sheng-Lin, Liang Rong-Hua, Chen Peng, and Chi Kai-Kai. 2021. Video multimodal emotion recognition based on Bi-GRU and attention fusion. Multim. Tools. Applic. 80, 6 (2021), 82138240.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Huang Shaonian, Huang Dongjun, and Zhou Xinmin. 2018. Learning multimodal deep representations for crowd anomaly event detection. Math. Prob. Eng. 2018 (2018).Google ScholarGoogle Scholar
  60. [60] Huang Yongrui, Yang Jianhao, Liao Pengkai, and Pan Jiahui. 2017. Fusion of facial expressions and EEG for multimodal emotion recognition. Computat. Intell. Neurosci. 2017 (2017).Google ScholarGoogle Scholar
  61. [61] Huang Yi, Yang Xiaoshan, Gao Junyu, Sang Jitao, and Xu Changsheng. 2020. Knowledge-driven egocentric multimodal activity recognition. ACM Trans. Multim. Comput. Commun. Applic. 16, 4 (2020), 1133.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. [62] Ito Keith and Johnson Linda. 2017. The LJ speech dataset. Retrieved from https://keithito.com/LJ-Speech-Dataset.Google ScholarGoogle Scholar
  63. [63] Jaiswal Mimansa, Aldeneh Zakaria, Bara Cristian-Paul, Luo Yuanhang, Burzo Mihai, Mihalcea Rada, and Provost Emily Mower. 2019. Muse-ing on the impact of utterance ordering on crowdsourced emotion annotations. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 74157419.Google ScholarGoogle ScholarCross RefCross Ref
  64. [64] Jaiswal Mimansa, Aldeneh Zakaria, and Provost Emily Mower. 2019. Controlling for confounders in multimodal emotion classification via adversarial learning. In Proceedings of the International Conference on Multimodal Interaction. 174184.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. [65] Ji Jiayi, Luo Yunpeng, Sun Xiaoshuai, Chen Fuhai, Luo Gen, Wu Yongjian, Gao Yue, and Ji Rongrong. 2021. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence. 16551663.Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Jiang Weitao, Li Xiying, Hu Haifeng, Lu Qiang, and Liu Bohong. 2021. Multi-gate attention network for image captioning. IEEE Access 9 (2021), 6970069709.Google ScholarGoogle ScholarCross RefCross Ref
  67. [67] Jiang Wenhao, Ma Lin, Jiang Yu-Gang, Liu Wei, and Zhang Tong. 2018. Recurrent fusion network for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV). 499515.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. [68] Jing Longlong, Vahdani Elahe, Tan Jiaxing, and Tian Yingli. 2021. Cross-modal center loss for 3D cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 31423151.Google ScholarGoogle ScholarCross RefCross Ref
  69. [69] Johnson Justin, Hariharan Bharath, Maaten Laurens Van Der, Fei-Fei Li, Zitnick C. Lawrence, and Girshick Ross. 2017. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 29012910.Google ScholarGoogle ScholarCross RefCross Ref
  70. [70] Juang Biing Hwang and Rabiner Laurence R.. 1991. Hidden Markov models for speech recognition. Technometrics 33, 3 (1991), 251272.Google ScholarGoogle ScholarCross RefCross Ref
  71. [71] Kafle Kushal and Kanan Christopher. 2017. An analysis of visual question answering algorithms. In Proceedings of the IEEE International Conference on Computer Vision. 19651973.Google ScholarGoogle ScholarCross RefCross Ref
  72. [72] Kazakos Evangelos, Nagrani Arsha, Zisserman Andrew, and Damen Dima. 2019. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 54925501.Google ScholarGoogle ScholarCross RefCross Ref
  73. [73] Koutras Petros, Zlatinsi Athanasia, and Maragos Petros. 2018. Exploring CNN-based architectures for multimodal salient event detection in videos. In Proceedings of the IEEE 13th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP). IEEE, 15.Google ScholarGoogle ScholarCross RefCross Ref
  74. [74] Krishna Ranjay, Hata Kenji, Ren Frederic, Fei-Fei Li, and Niebles Juan Carlos. 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. 706715.Google ScholarGoogle ScholarCross RefCross Ref
  75. [75] Krishna Ranjay, Zhu Yuke, Groth Oliver, Johnson Justin, Hata Kenji, Kravitz Joshua, Chen Stephanie, Kalantidis Yannis, Li Li-Jia, Shamma David A., et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 1 (2017), 3273.Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. [76] Lai Helang, Chen Hongying, and Wu Shuangyan. 2020. Different contextual window sizes based RNNs for multimodal emotion detection in interactive conversations. IEEE Access 8 (2020), 119516119526.Google ScholarGoogle ScholarCross RefCross Ref
  77. [77] Lazarus Arnold A.. 1973. Multimodal behavior therapy: Treating the “BASIC ID”. J. Nerv. Ment. Dis. 156, 6 (1973).Google ScholarGoogle ScholarCross RefCross Ref
  78. [78] Lei Zhou and Huang Yiyong. 2021. Video captioning based on channel soft attention and semantic reconstructor. Fut. Internet 13, 2 (2021), 55.Google ScholarGoogle ScholarCross RefCross Ref
  79. [79] Li Linjie, Gan Zhe, Cheng Yu, and Liu Jingjing. 2019. Relation-aware graph attention network for visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1031310322.Google ScholarGoogle ScholarCross RefCross Ref
  80. [80] Li Lijun and Gong Boqing. 2019. End-to-end video captioning with multitask reinforcement learning. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 339348.Google ScholarGoogle ScholarCross RefCross Ref
  81. [81] Li Linghui, Tang Sheng, Zhang Yongdong, Deng Lixi, and Tian Qi. 2017. GLA: Global–local attention for image description. IEEE Trans. Multim. 20, 3 (2017), 726737.Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. [82] Li Minjia, Xie Lun, Lv Zeping, Li Juan, and Wang Zhiliang. 2020. Multistep deep system for multimodal emotion detection with invalid data in the Internet of Things. IEEE Access 8 (2020), 187208187221.Google ScholarGoogle ScholarCross RefCross Ref
  83. [83] Li Xirong, Lan Weiyu, Dong Jianfeng, and Liu Hailong. 2016. Adding Chinese captions to images. In Proceedings of the ACM International Conference on Multimedia Retrieval. 271275.Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. [84] Li Yingming, Yang Ming, and Zhang Zhongfei. 2019. A survey of multi-view representation learning. IEEE Trans. Knowl. Data Eng. 31, 10 (2019), 18631883.Google ScholarGoogle ScholarCross RefCross Ref
  85. [85] Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. Springer, 740755.Google ScholarGoogle ScholarCross RefCross Ref
  86. [86] Liu Hugo and Singh Push. 2004. ConceptNet—a practical commonsense reasoning tool-kit. BT Technol. J. 22, 4 (2004), 211226.Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. [87] Liu Maofu, Hu Huijun, Li Lingjun, Yu Yan, and Guan Weili. 2020. Chinese image caption generation via visual attention and topic modeling. IEEE Trans. Cyber. 52, 2 (2020).Google ScholarGoogle Scholar
  88. [88] Liu Maofu, Li Lingjun, Hu Huijun, Guan Weili, and Tian Jing. 2020. Image caption generation with dual attention mechanism. Inf. Process. Manag. 57, 2 (2020), 102178.Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. [89] Liu Sheng, Ren Zhou, and Yuan Junsong. 2018. SibNet: Sibling convolutional encoder for video captioning. In Proceedings of the 26th ACM International Conference on Multimedia. 14251434.Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. [90] Lobry Sylvain, Marcos Diego, Murray Jesse, and Tuia Devis. 2020. RSVQA: Visual question answering for remote sensing data. IEEE Trans. Geosci. Rem. Sens. 58, 12 (2020), 85558566.Google ScholarGoogle ScholarCross RefCross Ref
  91. [91] Long Yu, Tang Pengjie, Wang Hanli, and Yu Jian. 2021. Improving reasoning with contrastive visual information for visual question answering. Electron. Lett. 57, 20 (2021), 758760.Google ScholarGoogle ScholarCross RefCross Ref
  92. [92] Malinowski Mateusz and Fritz Mario. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. Adv. Neural Inf. Process. Syst. 27 (2014).Google ScholarGoogle Scholar
  93. [93] Marino Kenneth, Rastegari Mohammad, Farhadi Ali, and Mottaghi Roozbeh. 2019. OK-VQA: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 31953204.Google ScholarGoogle ScholarCross RefCross Ref
  94. [94] Martin Olivier, Kotsia Irene, Macq Benoit, and Pitas Ioannis. 2006. The eNTERFACE’05 audio-visual emotion database. In Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW’06). IEEE, 88.Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. [95] Mathews Alexander, Xie Lexing, and He Xuming. 2016. SentiCap: Generating image descriptions with sentiments. In Proceedings of the AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  96. [96] McGurk Harry and MacDonald John. 1976. Hearing lips and seeing voices. Nature 264, 5588 (1976), 746748.Google ScholarGoogle ScholarCross RefCross Ref
  97. [97] McKeown Gary, Valstar Michel, Cowie Roddy, Pantic Maja, and Schroder Marc. 2011. The SEMAINE database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Trans. Affect. Comput. 3, 1 (2011), 517.Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. [98] Miller George A.. 1995. WordNet: A lexical database for English. Commun. ACM. 38, 11 (1995), 3941.Google ScholarGoogle ScholarDigital LibraryDigital Library
  99. [99] Mittal Trisha, Bhattacharya Uttaran, Chandra Rohan, Bera Aniket, and Manocha Dinesh. 2020. M3er: Multiplicative multimodal emotion recognition using facial, textual, and speech cues. In Proceedings of the AAAI Conference on Artificial Intelligence. 13591367.Google ScholarGoogle ScholarCross RefCross Ref
  100. [100] Mogadala Aditya, Kalimuthu Marimuthu, and Klakow Dietrich. 2021. Trends in integration of vision and language research: A survey of tasks, datasets, and methods. J. Artif. Intell. Res. 71 (2021), 11831317.Google ScholarGoogle ScholarDigital LibraryDigital Library
  101. [101] Morency Louis-Philippe. 2020. Multimodal Machine Learning (or Deep Learning for Multimodal Systems). Retrieved from https://www.microsoft.com/en-us/research/wp-content/uploads/2017/07/Integrative_AI_Louis_Philippe_Morency.pdf.Google ScholarGoogle Scholar
  102. [102] Mulligan Robert M. and Shaw Marilyn L.. 1980. Multimodal signal detection: Independent decisions vs. integration. Percept. Psychophys. 28, 5 (1980), 471478.Google ScholarGoogle ScholarCross RefCross Ref
  103. [103] Mun Jonghwan, Yang Linjie, Ren Zhou, Xu Ning, and Han Bohyung. 2019. Streamlined dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 65886597.Google ScholarGoogle ScholarCross RefCross Ref
  104. [104] Narasimhan Medhini and Schwing Alexander G.. 2018. Straight to the facts: Learning knowledge base retrieval for factual visual question answering. In Proceedings of the European Conference on Computer Vision (ECCV). 451468.Google ScholarGoogle ScholarDigital LibraryDigital Library
  105. [105] Nguyen Dung, Nguyen Kien, Sridharan Sridha, Dean David, and Fookes Clinton. 2018. Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition. Comput. Vis. Image Underst. 174 (2018), 3342.Google ScholarGoogle ScholarCross RefCross Ref
  106. [106] Nguyen Dung, Nguyen Kien, Sridharan Sridha, Ghasemi Afsane, Dean David, and Fookes Clinton. 2017. Deep spatio-temporal features for multimodal emotion recognition. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 12151223.Google ScholarGoogle ScholarCross RefCross Ref
  107. [107] Oord Aaron, Li Yazhe, Babuschkin Igor, Simonyan Karen, Vinyals Oriol, Kavukcuoglu Koray, Driessche George, Lockhart Edward, Cobo Luis, Stimberg Florian, et al. 2018. Parallel WaveNet: Fast high-fidelity speech synthesis. In Proceedings of the International Conference on Machine Learning. PMLR, 39183926.Google ScholarGoogle Scholar
  108. [108] Ordonez Vicente, Kulkarni Girish, and Berg Tamara. 2011. Im2Text: Describing images using 1 million captioned photographs. Adv. Neural Inf. Process. Syst. 24 (2011).Google ScholarGoogle Scholar
  109. [109] Panayotov Vassil, Chen Guoguo, Povey Daniel, and Khudanpur Sanjeev. 2015. LibriSpeech: An ASR corpus based on public domain audio books. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 52065210.Google ScholarGoogle ScholarCross RefCross Ref
  110. [110] Pantic Maja, Cowie Roderick, D’Errico Francesca, Heylen Dirk, Mehu Marc, Pelachaud Catherine, Poggi Isabella, Schroeder Marc, and Vinciarelli Alessandro. 2011. Social signal processing: The research agenda. In Visual Analysis of Humans. Springer, 511538.Google ScholarGoogle ScholarCross RefCross Ref
  111. [111] Park Dong Huk, Hendricks Lisa Anne, Akata Zeynep, Schiele Bernt, Darrell Trevor, and Rohrbach Marcus. 2016. Attentive explanations: Justifying decisions and pointing to the evidence. arXiv preprint arXiv:1612.04757 (2016).Google ScholarGoogle Scholar
  112. [112] Patro Badri, Patel Shivansh, and Namboodiri Vinay. 2020. Robust explanations for visual question answering. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 15771586.Google ScholarGoogle ScholarCross RefCross Ref
  113. [113] Pei Wenjie, Zhang Jiyuan, Wang Xiangrong, Ke Lei, Shen Xiaoyong, and Tai Yu-Wing. 2019. Memory-attended recurrent network for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 83478356.Google ScholarGoogle ScholarCross RefCross Ref
  114. [114] Perez-Martin Jesus, Bustos Benjamin, and Pérez Jorge. 2021. Improving video captioning with temporal composition of a visual-syntactic embedding. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 30393049.Google ScholarGoogle ScholarCross RefCross Ref
  115. [115] Petajan Eric and Graf Hans Peter. 1996. Automatic lipreading research: Historic overview and current work. In Multimedia Communications and Video Coding. Springer, 265275.Google ScholarGoogle ScholarCross RefCross Ref
  116. [116] Ping Wei, Peng Kainan, Gibiansky Andrew, Arik Sercan O., Kannan Ajay, Narang Sharan, Raiman Jonathan, and Miller John. 2017. Deep voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv preprint arXiv:1710.07654 (2017).Google ScholarGoogle Scholar
  117. [117] Rahman Md, Abedin Thasin, Prottoy Khondokar S. S., Moshruba Ayana, Siddiqui Fazlul Hasan, et al. 2020. Semantically sensible video captioning (SSVC). arXiv preprint arXiv:2009.07335 (2020).Google ScholarGoogle Scholar
  118. [118] Ramachandram Dhanesh and Taylor Graham W.. 2017. Deep multimodal learning: A survey on recent advances and trends. IEEE Sig. Process. Mag. 34, 6 (2017), 96108.Google ScholarGoogle ScholarCross RefCross Ref
  119. [119] Rashtchian Cyrus, Young Peter, Hodosh Micah, and Hockenmaier Julia. 2010. Collecting image annotations using Amazon’s Mechanical Turk. In Proceedings of the NAACL HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 139147.Google ScholarGoogle Scholar
  120. [120] Regneri Michaela, Rohrbach Marcus, Wetzel Dominikus, Thater Stefan, Schiele Bernt, and Pinkal Manfred. 2013. Grounding action descriptions in videos. Trans. Assoc. Computat. Ling. 1 (2013), 2536.Google ScholarGoogle ScholarCross RefCross Ref
  121. [121] Ren Mengye, Kiros Ryan, and Zemel Richard. 2015. Exploring models and data for image question answering. Adv. Neural Inf. Process. Syst. 28 (2015).Google ScholarGoogle Scholar
  122. [122] Ringeval Fabien, Sonderegger Andreas, Sauer Juergen, and Lalanne Denis. 2013. Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In Proceedings of the 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG). IEEE, 18.Google ScholarGoogle ScholarCross RefCross Ref
  123. [123] Rohrbach Anna, Rohrbach Marcus, Qiu Wei, Friedrich Annemarie, Pinkal Manfred, and Schiele Bernt. 2014. Coherent multi-sentence video description with variable level of detail. In Proceedings of the German Conference on Pattern Recognition. Springer, 184195.Google ScholarGoogle ScholarCross RefCross Ref
  124. [124] Sengupta Saptarshi, Basak Sanchita, Saikia Pallabi, Paul Sayak, Tsalavoutis Vasilios, Atiah Frederick, Ravi Vadlamani, and Peters Alan. 2020. A review of deep learning with special emphasis on architectures, applications and recent trends. Knowl.-Based syst. 194 (2020), 105596.Google ScholarGoogle ScholarCross RefCross Ref
  125. [125] Shen Jonathan, Pang Ruoming, Weiss Ron J., Schuster Mike, Jaitly Navdeep, Yang Zongheng, Chen Zhifeng, Zhang Yu, Wang Yuxuan, Skerrv-Ryan R. J., et al. 2018. Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 47794783.Google ScholarGoogle ScholarDigital LibraryDigital Library
  126. [126] Sigurdsson Gunnar A., Varol Gül, Wang Xiaolong, Farhadi Ali, Laptev Ivan, and Gupta Abhinav. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision. Springer, 510526.Google ScholarGoogle ScholarCross RefCross Ref
  127. [127] Snoek Cees G. M. and Worring Marcel. 2005. Multimodal video indexing: A review of the state-of-the-art. Multim. Tools Applic. 25, 1 (2005), 535.Google ScholarGoogle ScholarDigital LibraryDigital Library
  128. [128] Souza Rafael, Fernandes André, Teixeira Thiago S. F. X., Teodoro George, and Ferreira Renato. 2021. Online multimedia retrieval on CPU–GPU platforms with adaptive work partition. J. Parallel Distrib. Comput. 148 (2021), 3145.Google ScholarGoogle ScholarCross RefCross Ref
  129. [129] Taigman Yaniv, Wolf Lior, Polyak Adam, and Nachmani Eliya. 2018. VoiceLoop: Voice fitting and synthesis via a phonological loop. arXiv preprint arXiv:1707.06588 (2018).Google ScholarGoogle Scholar
  130. [130] Tandon Niket, Melo Gerard De, Suchanek Fabian, and Weikum Gerhard. 2014. WebChild: Harvesting and organizing commonsense knowledge from the web. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining. 523532.Google ScholarGoogle ScholarDigital LibraryDigital Library
  131. [131] Tao Fei and Busso Carlos. 2020. End-to-end audiovisual speech recognition system with multitask learning. IEEE Trans. Multim. 23 (2020), 111.Google ScholarGoogle ScholarCross RefCross Ref
  132. [132] Torabi Atousa, Pal Christopher, Larochelle Hugo, and Courville Aaron. 2015. Using descriptive video services to create a large data source for video annotation research. arXiv preprint arXiv:1503.01070 (2015).Google ScholarGoogle Scholar
  133. [133] Tripathi Samarth and Beigi Homayoon. 2018. Multi-modal emotion recognition on IEMOCAP with neural networks. arXiv preprint arXiv:1804.05788 (2018).Google ScholarGoogle Scholar
  134. [134] Tur Gokhan, Stolcke Andreas, Voss Lynn, Dowding John, Favre Benoît, Fernández Raquel, Frampton Matthew, Frandsen Michael, Frederickson Clint, Graciarena Martin, et al. 2008. The CALO meeting speech recognition and understanding system. In Proceedings of the IEEE Spoken Language Technology Workshop. IEEE, 6972.Google ScholarGoogle ScholarCross RefCross Ref
  135. [135] Veaux Christophe, Yamagishi Junichi, MacDonald Kirsten, et al. 2016. SUPERSEDED-CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit. (2016).Google ScholarGoogle Scholar
  136. [136] Waibel Alex, Steusloff Hartwig, Stiefelhagen Rainer, and Watson Kym. 2009. Computers in the human interaction loop. In Computers in the Human Interaction Loop. Springer, 36.Google ScholarGoogle ScholarCross RefCross Ref
  137. [137] Wan Chia-Hung, Chuang Shun-Po, and Lee Hung-Yi. 2019. Towards audio to scene image synthesis using generative adversarial network. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 496500.Google ScholarGoogle ScholarCross RefCross Ref
  138. [138] Wang Bairui, Ma Lin, Zhang Wei, and Liu Wei. 2018. Reconstruction network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 76227631.Google ScholarGoogle ScholarCross RefCross Ref
  139. [139] Wang Bin, Wang Cungang, Zhang Qian, Su Ying, Wang Yang, and Xu Yanyan. 2020. Cross-lingual image caption generation based on visual attention model. IEEE Access 8 (2020), 104543104554.Google ScholarGoogle Scholar
  140. [140] Wang Peng, Wu Qi, Shen Chunhua, Dick Anthony, and Hengel Anton Van Den. 2017. FVQA: Fact-based visual question answering. IEEE Trans. Pattern Anal. Mach. Intell. 40, 10 (2017), 24132427.Google ScholarGoogle ScholarDigital LibraryDigital Library
  141. [141] Wang Peng, Wu Qi, Shen Chunhua, and Hengel Anton van den. 2017. The VQA-machine: Learning how to use existing vision algorithms to answer new questions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11731182.Google ScholarGoogle ScholarCross RefCross Ref
  142. [142] Wang Wei, Ding Yuxuan, and Tian Chunna. 2018. A novel semantic attribute-based feature for image caption generation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 30813085.Google ScholarGoogle ScholarDigital LibraryDigital Library
  143. [143] Wang Xin, Chen Wenhu, Wu Jiawei, Wang Yuan-Fang, and Wang William Yang. 2018. Video captioning via hierarchical reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 42134222.Google ScholarGoogle ScholarCross RefCross Ref
  144. [144] Wang Xu, Hu Peng, Zhen Liangli, and Peng Dezhong. 2021. DRSL: Deep relational similarity learning for cross-modal retrieval. Inf. Sci. 546 (2021), 298311.Google ScholarGoogle ScholarCross RefCross Ref
  145. [145] Wang Yuxuan, Skerry-Ryan R. J., Stanton Daisy, Wu Yonghui, Weiss Ron J., Jaitly Navdeep, Yang Zongheng, Xiao Ying, Chen Zhifeng, Bengio Samy, et al. 2017. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135 (2017).Google ScholarGoogle Scholar
  146. [146] Wei Ran, Mi Li, Hu Yaosi, and Chen Zhenzhong. 2020. Exploiting the local temporal information for video captioning. J. Vis. Commun. Image Represent. 67 (2020), 102751.Google ScholarGoogle ScholarDigital LibraryDigital Library
  147. [147] Wei Yiwei, Wang Leiquan, Cao Haiwen, Shao Mingwen, and Wu Chunlei. 2020. Multi-attention generative adversarial network for image captioning. Neurocomputing 387 (2020), 9199.Google ScholarGoogle ScholarDigital LibraryDigital Library
  148. [148] Wu Hanbo, Ma Xin, and Li Yibin. 2021. Spatiotemporal multimodal learning with 3D CNNs for video action recognition. IEEE Trans. Circ. Syst. Vid. Technol. 32, 3 (2021), 12501261.Google ScholarGoogle ScholarCross RefCross Ref
  149. [149] Wu Jie and Hu Haifeng. 2017. Cascade recurrent neural network for image caption generation. Electron. Lett. 53, 25 (2017), 16421643.Google ScholarGoogle Scholar
  150. [150] Wu Qi, Teney Damien, Wang Peng, Shen Chunhua, Dick Anthony, and Hengel Anton Van Den. 2017. Visual question answering: A survey of methods and datasets. Comput. Vis. Image Underst. 163 (2017), 2140.Google ScholarGoogle ScholarDigital LibraryDigital Library
  151. [151] Xi Yuling, Zhang Yanning, Ding Songtao, and Wan Shaohua. 2020. Visual question answering model based on visual relationship detection. Sig. Process.: Image Commun. 80 (2020), 115648.Google ScholarGoogle ScholarDigital LibraryDigital Library
  152. [152] Xu Jun, Mei Tao, Yao Ting, and Rui Yong. 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 52885296.Google ScholarGoogle ScholarCross RefCross Ref
  153. [153] Xu Wanru, Yu Jian, Miao Zhenjiang, Wan Lili, Tian Yi, and Ji Qiang. 2020. Deep reinforcement polishing network for video captioning. IEEE Trans. Multim. 23 (2020), 17721784.Google ScholarGoogle ScholarDigital LibraryDigital Library
  154. [154] Yang Zhenguo, Li Qing, Liu Wenyin, and Lv Jianming. 2019. Shared multi-view data representation for multi-domain event detection. IEEE Trans. Pattern Anal. Mach. Intell. 42, 5 (2019), 12431256.Google ScholarGoogle Scholar
  155. [155] Yazdavar Amir Hossein, Mahdavinejad Mohammad Saeid, Bajaj Goonmeet, Romine William, Sheth Amit, Monadjemi Amir Hassan, Thirunarayan Krishnaprasad, Meddar John M., Myers Annie, Pathak Jyotishman, et al. 2020. Multimodal mental health analysis in social media. PLoS One 15, 4 (2020), e0226248.Google ScholarGoogle Scholar
  156. [156] Young Peter, Lai Alice, Hodosh Micah, and Hockenmaier Julia. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Computat. Ling. 2 (2014), 6778.Google ScholarGoogle ScholarCross RefCross Ref
  157. [157] Yu Jing, Zhu Zihao, Wang Yujing, Zhang Weifeng, Hu Yue, and Tan Jianlong. 2020. Cross-modal knowledge reasoning for knowledge-based visual question answering. Pattern Recog. 108 (2020), 107563.Google ScholarGoogle ScholarCross RefCross Ref
  158. [158] Yu Zhou, Yu Jun, Cui Yuhao, Tao Dacheng, and Tian Qi. 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 62816290.Google ScholarGoogle ScholarCross RefCross Ref
  159. [159] Yu Zhou, Yu Jun, Fan Jianping, and Tao Dacheng. 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 18211830.Google ScholarGoogle ScholarCross RefCross Ref
  160. [160] Yudistira Novanto and Kurita Takio. 2020. Correlation Net: Spatiotemporal multimodal deep learning for action recognition. Sig. Process.: Image Commun. 82 (2020), 115731.Google ScholarGoogle ScholarDigital LibraryDigital Library
  161. [161] Yuhas Ben P., Goldstein Moise H., and Sejnowski Terrence J.. 1989. Integration of acoustic and visual speech signals using neural networks. IEEE Commun. Mag. 27, 11 (1989), 6571.Google ScholarGoogle ScholarDigital LibraryDigital Library
  162. [162] Zhang Chao, Yang Zichao, He Xiaodong, and Deng Li. 2020. Multimodal intelligence: Representation learning, information fusion, and applications. IEEE J. Select. Topics Sig. Process. 14, 3 (2020), 478493.Google ScholarGoogle ScholarCross RefCross Ref
  163. [163] Zhang Su-Fang, Zhai Jun-Hai, Xie Bo-Jun, Zhan Yan, and Wang Xin. 2019. Multimodal representation learning: Advances, trends and challenges. In Proceedings of the International Conference on Machine Learning and Cybernetics (ICMLC). IEEE, 16.Google ScholarGoogle ScholarCross RefCross Ref
  164. [164] Zhang Wei, Wang Bairui, Ma Lin, and Liu Wei. 2019. Reconstruct and represent video contents for captioning via reinforcement learning. IEEE Trans. Pattern Anal. Mach. Intell. 42, 12 (2019), 30883101.Google ScholarGoogle ScholarCross RefCross Ref
  165. [165] Zhang Zongjian, Wu Qiang, Wang Yang, and Chen Fang. 2018. High-quality image captioning with fine-grained and semantic-guided visual attention. IEEE Trans. Multim. 21, 7 (2018), 16811693.Google ScholarGoogle ScholarCross RefCross Ref
  166. [166] Zhang Zhiwang, Xu Dong, Ouyang Wanli, and Zhou Luping. 2020. Dense video captioning using graph-based sentence summarization. IEEE Trans. Multim. 23 (2020), 17991810.Google ScholarGoogle ScholarDigital LibraryDigital Library
  167. [167] Zhen Liangli, Hu Peng, Wang Xu, and Peng Dezhong. 2019. Deep supervised cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1039410403.Google ScholarGoogle ScholarCross RefCross Ref
  168. [168] Zhou Hang, Liu Yu, Liu Ziwei, Luo Ping, and Wang Xiaogang. 2019. Talking face generation by adversarially disentangled audio-visual representation. In Proceedings of the AAAI Conference on Artificial Intelligence. 92999306.Google ScholarGoogle ScholarDigital LibraryDigital Library
  169. [169] Zhou Hang, Sun Yasheng, Wu Wayne, Loy Chen Change, Wang Xiaogang, and Liu Ziwei. 2021. Pose-controllable talking face generation by implicitly modularized audio-visual representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 41764186.Google ScholarGoogle ScholarCross RefCross Ref
  170. [170] Zhou Yipin, Wang Zhaowen, Fang Chen, Bui Trung, and Berg Tamara L.. 2018. Visual to sound: Generating natural sound for videos in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 35503558.Google ScholarGoogle ScholarCross RefCross Ref
  171. [171] Zhu Junjie, Wei Yuxuan, Feng Yifan, Zhao Xibin, and Gao Yue. 2019. Physiological signals-based emotion recognition via high-order correlation learning. ACM Trans. Multim. Comput. Commun. Applic. 15, 3s (2019), 118.Google ScholarGoogle ScholarDigital LibraryDigital Library
  172. [172] Zhu Yuke, Groth Oliver, Bernstein Michael, and Fei-Fei Li. 2016. Visual7W: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 49955004.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A Review on Methods and Applications in Multimodal Deep Learning

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 2s
        April 2023
        545 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3572861
        • Editor:
        • Abdulmotaleb El Saddik
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 February 2023
        • Online AM: 27 October 2022
        • Accepted: 31 May 2022
        • Revised: 9 April 2022
        • Received: 8 December 2021
        Published in tomm Volume 19, Issue 2s

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • survey
      • Article Metrics

        • Downloads (Last 12 months)838
        • Downloads (Last 6 weeks)193

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!