Abstract
Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This article focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, physiological signals, flow, RGB, pose, depth, mesh, and point cloud. Detailed analysis of the baseline approaches and an in-depth study of recent advancements during the past five years (2017 to 2021) in multimodal deep learning applications has been provided. A fine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on different applications in more depth. Last, main issues are highlighted separately for each domain, along with their possible future research directions.
Supplemental Material
Available for Download
Supplementary material
- [1] . 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12487–12496.Google Scholar
Cross Ref
- [2] . 2019. VQA-Med: Overview of the medical visual question answering task at ImageCLEF 2019.CLEF (Working Notes) 2, 6 (2019).Google Scholar
- [3] . 2018. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern recognition. 4971–4980.Google Scholar
Cross Ref
- [4] . 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.Google Scholar
Cross Ref
- [5] . 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425–2433.Google Scholar
Cross Ref
- [6] . 2018. Neural voice cloning with a few samples. Adv. Neural Inf. Process. Syst. 31 (2018).Google Scholar
- [7] . 2017. Deep voice: Real-time neural text-to-speech. In Proceedings of the International Conference on Machine Learning. PMLR, 195–204.Google Scholar
- [8] . 2010. Multimodal fusion for multimedia analysis: A survey. Multim. Syst. 16, 6 (2010), 345–379.Google Scholar
Digital Library
- [9] . 2007. DBpedia: A nucleus for a web of open data. In The Semantic Web. Springer, 722–735.Google Scholar
Digital Library
- [10] . 1983. Infants’ perception of substance and temporal synchrony in multimodal events. Infant Behav. Devel. 6, 4 (1983), 429–451.Google Scholar
Cross Ref
- [11] . 2019. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2 (2019), 423–443.Google Scholar
Digital Library
- [12] . 2020. AQuA: ASP-based visual question answering. In Proceedings of the International Symposium on Practical Aspects of Declarative Languages. Springer, 57–72.Google Scholar
Digital Library
- [13] . 2011. Indexing the signature quadratic form distance for efficient content-based multimedia retrieval. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval. 1–8.Google Scholar
Digital Library
- [14] . 2017. MUTAN: Multimodal Tucker fusion for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2612–2620.Google Scholar
Cross Ref
- [15] . 2020. Experience grounds language. arXiv preprint arXiv:2004.10151 (2020).Google Scholar
- [16] . 2020. Towards real-time multimodal emotion recognition among couples. In Proceedings of the International Conference on Multimodal Interaction. 748–753.Google Scholar
Digital Library
- [17] . 2008. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1247–1250.Google Scholar
Digital Library
- [18] . 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42, 4 (2008), 335–359.Google Scholar
Cross Ref
- [19] . 2016. MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Trans. Affect. Comput. 8, 1 (2016), 67–80.Google Scholar
Digital Library
- [20] . 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961–970.Google Scholar
Cross Ref
- [21] . 2019. MUREL: Multimodal relational reasoning for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1989–1998.Google Scholar
Cross Ref
- [22] . 2019. Image captioning with bidirectional semantic attention-based guiding of long short-term memory. Neural Process. Lett. 50, 1 (2019), 103–119.Google Scholar
Digital Library
- [23] . 2005. The AMI meeting corpus: A pre-announcement. In Proceedings of the International Workshop on Machine Learning for Multimodal Interaction. Springer, 28–39.Google Scholar
- [24] . 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 190–200.Google Scholar
Digital Library
- [25] . 2021. HEU Emotion: A large-scale database for multimodal emotion recognition in the wild. Neural Comput. Applic. 33, 14 (2021), 8669–8685.Google Scholar
Digital Library
- [26] . 2021. Human-like controllable image captioning with verb-specific semantic roles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16846–16856.Google Scholar
Cross Ref
- [27] . 2017. Reference-based LSTM for image captioning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence.Google Scholar
Cross Ref
- [28] . 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).Google Scholar
- [29] . 2018. Less is more: Picking informative frames for video captioning. In Proceedings of the European Conference on Computer Vision (ECCV). 358–373.Google Scholar
Digital Library
- [30] . 2020. Stack-VS: Stacked visual-semantic attention for image caption generation. IEEE Access 8 (2020), 154953–154965.Google Scholar
Cross Ref
- [31] . 2019. EmoChat: Bringing multimodal emotion detection to mobile conversation. In Proceedings of the 5th International Conference on Big Data Computing and Communications (BIGCOM). IEEE, 213–221.Google Scholar
Cross Ref
- [32] . 2020. Cross-subject multimodal emotion recognition based on hybrid fusion. IEEE Access 8 (2020), 168865–168878.Google Scholar
Cross Ref
- [33] . 2020. The promise and challenges of multimodal learning analytics. (2020), 1441–1449 pages.Google Scholar
- [34] . 2017. Multimodal retrieval with diversification and relevance feedback for tourist attraction images. ACM Trans. Multim. Comput. Commun. Applic. 13, 4 (2017), 1–24.Google Scholar
Digital Library
- [35] . 2018. Object-based reasoning in VQA. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1814–1823.Google Scholar
Cross Ref
- [36] . 2021. Automatic image captioning using deep learning. In Proceedings of the International Conference on Innovative Computing & Communication (ICICC).Google Scholar
Cross Ref
- [37] . 2021. Parallel Tacotron: Non-autoregressive and controllable TTS. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5709–5713.Google Scholar
Cross Ref
- [38] . 2020. Video2Commonsense: Generating commonsense descriptions to enrich video captioning. arXiv preprint arXiv:2003.05162 (2020).Google Scholar
- [39] . 2019. Unsupervised image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4125–4134.Google Scholar
Cross Ref
- [40] . 2017. StyleNet: Generating attractive visual captions with styles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3137–3146.Google Scholar
Cross Ref
- [41] . 2020. A survey on deep learning for multimodal data fusion. Neural Computat. 32, 5 (2020), 829–864.Google Scholar
Digital Library
- [42] . 2019. 2.5D visual sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 324–333.Google Scholar
Cross Ref
- [43] . 2017. Event classification in microblogs via social tracking. ACM Trans. Intell. Syst. Technol. 8, 3 (2017), 1–14.Google Scholar
Digital Library
- [44] . 2021. Distillation multiple choice learning for multimodal action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2755–2764.Google Scholar
Cross Ref
- [45] . 2017. Deep Voice 2: Multi-speaker neural text-to-speech. Adv. Neural Inf. Process. Syst. 30 (2017).Google Scholar
- [46] . 2006. A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06). IEEE, 1148–1153.Google Scholar
Digital Library
- [47] . 2019. MSCap: Multi-style image captioning with unpaired stylized text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4204–4213.Google Scholar
Cross Ref
- [48] . 2019. Deep multimodal representation learning: A survey. IEEE Access 7 (2019), 63373–63394.Google Scholar
Cross Ref
- [49] . 2021. Re-attention for visual question answering. IEEE Trans. Image Process. 30 (2021), 6730–6743.Google Scholar
Digital Library
- [50] . 2018. VizWiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3608–3617.Google Scholar
Cross Ref
- [51] . 2021. Controlling eye blink for talking face generation via eye conversion. In Proceedings of the SIGGRAPH Asia Technical Communications Conference. 1–4.Google Scholar
Digital Library
- [52] . 2018. Self-attentive feature-level fusion for multimodal emotion detection. In Proceedings of the IEEE Conference on Multimedia Information Processing and Retrieval (MIPR). IEEE, 196–201.Google Scholar
Cross Ref
- [53] . 2018. ICON: Interactive conversational memory network for multimodal emotion detection. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2594–2604.Google Scholar
Cross Ref
- [54] . 2019. Image caption generation with part of speech guidance. Pattern Recog. Lett. 119 (2019), 229–237.Google Scholar
Digital Library
- [55] . 2013. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 47 (2013), 853–899.Google Scholar
Cross Ref
- [56] . 1984. A multimodal assessment of behavioral and cognitive deficits in abused and neglected preschoolers. Child Devel. 55, 3 (1984), 794–802.Google Scholar
Cross Ref
- [57] . 2020. More diverse means better: Multimodal deep learning meets remote-sensing imagery classification. IEEE Trans. Geosci. Rem. Sens. 59, 5 (2020), 4340–4354.Google Scholar
Cross Ref
- [58] . 2021. Video multimodal emotion recognition based on Bi-GRU and attention fusion. Multim. Tools. Applic. 80, 6 (2021), 8213–8240.Google Scholar
Digital Library
- [59] . 2018. Learning multimodal deep representations for crowd anomaly event detection. Math. Prob. Eng. 2018 (2018).Google Scholar
- [60] . 2017. Fusion of facial expressions and EEG for multimodal emotion recognition. Computat. Intell. Neurosci. 2017 (2017).Google Scholar
- [61] . 2020. Knowledge-driven egocentric multimodal activity recognition. ACM Trans. Multim. Comput. Commun. Applic. 16, 4 (2020), 1–133.Google Scholar
Digital Library
- [62] . 2017. The LJ speech dataset. Retrieved from https://keithito.com/LJ-Speech-Dataset.Google Scholar
- [63] . 2019. Muse-ing on the impact of utterance ordering on crowdsourced emotion annotations. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7415–7419.Google Scholar
Cross Ref
- [64] . 2019. Controlling for confounders in multimodal emotion classification via adversarial learning. In Proceedings of the International Conference on Multimodal Interaction. 174–184.Google Scholar
Digital Library
- [65] . 2021. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence. 1655–1663.Google Scholar
Cross Ref
- [66] . 2021. Multi-gate attention network for image captioning. IEEE Access 9 (2021), 69700–69709.Google Scholar
Cross Ref
- [67] . 2018. Recurrent fusion network for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV). 499–515.Google Scholar
Digital Library
- [68] . 2021. Cross-modal center loss for 3D cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3142–3151.Google Scholar
Cross Ref
- [69] . 2017. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2901–2910.Google Scholar
Cross Ref
- [70] . 1991. Hidden Markov models for speech recognition. Technometrics 33, 3 (1991), 251–272.Google Scholar
Cross Ref
- [71] . 2017. An analysis of visual question answering algorithms. In Proceedings of the IEEE International Conference on Computer Vision. 1965–1973.Google Scholar
Cross Ref
- [72] . 2019. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5492–5501.Google Scholar
Cross Ref
- [73] . 2018. Exploring CNN-based architectures for multimodal salient event detection in videos. In Proceedings of the IEEE 13th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP). IEEE, 1–5.Google Scholar
Cross Ref
- [74] . 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. 706–715.Google Scholar
Cross Ref
- [75] . 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 1 (2017), 32–73.Google Scholar
Digital Library
- [76] . 2020. Different contextual window sizes based RNNs for multimodal emotion detection in interactive conversations. IEEE Access 8 (2020), 119516–119526.Google Scholar
Cross Ref
- [77] . 1973. Multimodal behavior therapy: Treating the “BASIC ID”. J. Nerv. Ment. Dis. 156, 6 (1973).Google Scholar
Cross Ref
- [78] . 2021. Video captioning based on channel soft attention and semantic reconstructor. Fut. Internet 13, 2 (2021), 55.Google Scholar
Cross Ref
- [79] . 2019. Relation-aware graph attention network for visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10313–10322.Google Scholar
Cross Ref
- [80] . 2019. End-to-end video captioning with multitask reinforcement learning. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 339–348.Google Scholar
Cross Ref
- [81] . 2017. GLA: Global–local attention for image description. IEEE Trans. Multim. 20, 3 (2017), 726–737.Google Scholar
Digital Library
- [82] . 2020. Multistep deep system for multimodal emotion detection with invalid data in the Internet of Things. IEEE Access 8 (2020), 187208–187221.Google Scholar
Cross Ref
- [83] . 2016. Adding Chinese captions to images. In Proceedings of the ACM International Conference on Multimedia Retrieval. 271–275.Google Scholar
Digital Library
- [84] . 2019. A survey of multi-view representation learning. IEEE Trans. Knowl. Data Eng. 31, 10 (2019), 1863–1883.Google Scholar
Cross Ref
- [85] . 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. Springer, 740–755.Google Scholar
Cross Ref
- [86] . 2004. ConceptNet—a practical commonsense reasoning tool-kit. BT Technol. J. 22, 4 (2004), 211–226.Google Scholar
Digital Library
- [87] . 2020. Chinese image caption generation via visual attention and topic modeling. IEEE Trans. Cyber. 52, 2 (2020).Google Scholar
- [88] . 2020. Image caption generation with dual attention mechanism. Inf. Process. Manag. 57, 2 (2020), 102178.Google Scholar
Digital Library
- [89] . 2018. SibNet: Sibling convolutional encoder for video captioning. In Proceedings of the 26th ACM International Conference on Multimedia. 1425–1434.Google Scholar
Digital Library
- [90] . 2020. RSVQA: Visual question answering for remote sensing data. IEEE Trans. Geosci. Rem. Sens. 58, 12 (2020), 8555–8566.Google Scholar
Cross Ref
- [91] . 2021. Improving reasoning with contrastive visual information for visual question answering. Electron. Lett. 57, 20 (2021), 758–760.Google Scholar
Cross Ref
- [92] . 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. Adv. Neural Inf. Process. Syst. 27 (2014).Google Scholar
- [93] . 2019. OK-VQA: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3195–3204.Google Scholar
Cross Ref
- [94] . 2006. The eNTERFACE’05 audio-visual emotion database. In Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW’06). IEEE, 8–8.Google Scholar
Digital Library
- [95] . 2016. SentiCap: Generating image descriptions with sentiments. In Proceedings of the AAAI Conference on Artificial Intelligence.Google Scholar
Cross Ref
- [96] . 1976. Hearing lips and seeing voices. Nature 264, 5588 (1976), 746–748.Google Scholar
Cross Ref
- [97] . 2011. The SEMAINE database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Trans. Affect. Comput. 3, 1 (2011), 5–17.Google Scholar
Digital Library
- [98] . 1995. WordNet: A lexical database for English. Commun. ACM. 38, 11 (1995), 39–41.Google Scholar
Digital Library
- [99] . 2020. M3er: Multiplicative multimodal emotion recognition using facial, textual, and speech cues. In Proceedings of the AAAI Conference on Artificial Intelligence. 1359–1367.Google Scholar
Cross Ref
- [100] . 2021. Trends in integration of vision and language research: A survey of tasks, datasets, and methods. J. Artif. Intell. Res. 71 (2021), 1183–1317.Google Scholar
Digital Library
- [101] . 2020. Multimodal Machine Learning (or Deep Learning for Multimodal Systems). Retrieved from https://www.microsoft.com/en-us/research/wp-content/uploads/2017/07/Integrative_AI_Louis_Philippe_Morency.pdf.Google Scholar
- [102] . 1980. Multimodal signal detection: Independent decisions vs. integration. Percept. Psychophys. 28, 5 (1980), 471–478.Google Scholar
Cross Ref
- [103] . 2019. Streamlined dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6588–6597.Google Scholar
Cross Ref
- [104] . 2018. Straight to the facts: Learning knowledge base retrieval for factual visual question answering. In Proceedings of the European Conference on Computer Vision (ECCV). 451–468.Google Scholar
Digital Library
- [105] . 2018. Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition. Comput. Vis. Image Underst. 174 (2018), 33–42.Google Scholar
Cross Ref
- [106] . 2017. Deep spatio-temporal features for multimodal emotion recognition. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1215–1223.Google Scholar
Cross Ref
- [107] . 2018. Parallel WaveNet: Fast high-fidelity speech synthesis. In Proceedings of the International Conference on Machine Learning. PMLR, 3918–3926.Google Scholar
- [108] . 2011. Im2Text: Describing images using 1 million captioned photographs. Adv. Neural Inf. Process. Syst. 24 (2011).Google Scholar
- [109] . 2015. LibriSpeech: An ASR corpus based on public domain audio books. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5206–5210.Google Scholar
Cross Ref
- [110] . 2011. Social signal processing: The research agenda. In Visual Analysis of Humans. Springer, 511–538.Google Scholar
Cross Ref
- [111] . 2016. Attentive explanations: Justifying decisions and pointing to the evidence. arXiv preprint arXiv:1612.04757 (2016).Google Scholar
- [112] . 2020. Robust explanations for visual question answering. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1577–1586.Google Scholar
Cross Ref
- [113] . 2019. Memory-attended recurrent network for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8347–8356.Google Scholar
Cross Ref
- [114] . 2021. Improving video captioning with temporal composition of a visual-syntactic embedding. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3039–3049.Google Scholar
Cross Ref
- [115] . 1996. Automatic lipreading research: Historic overview and current work. In Multimedia Communications and Video Coding. Springer, 265–275.Google Scholar
Cross Ref
- [116] . 2017. Deep voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv preprint arXiv:1710.07654 (2017).Google Scholar
- [117] . 2020. Semantically sensible video captioning (SSVC). arXiv preprint arXiv:2009.07335 (2020).Google Scholar
- [118] . 2017. Deep multimodal learning: A survey on recent advances and trends. IEEE Sig. Process. Mag. 34, 6 (2017), 96–108.Google Scholar
Cross Ref
- [119] . 2010. Collecting image annotations using Amazon’s Mechanical Turk. In Proceedings of the NAACL HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 139–147.Google Scholar
- [120] . 2013. Grounding action descriptions in videos. Trans. Assoc. Computat. Ling. 1 (2013), 25–36.Google Scholar
Cross Ref
- [121] . 2015. Exploring models and data for image question answering. Adv. Neural Inf. Process. Syst. 28 (2015).Google Scholar
- [122] . 2013. Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In Proceedings of the 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG). IEEE, 1–8.Google Scholar
Cross Ref
- [123] . 2014. Coherent multi-sentence video description with variable level of detail. In Proceedings of the German Conference on Pattern Recognition. Springer, 184–195.Google Scholar
Cross Ref
- [124] . 2020. A review of deep learning with special emphasis on architectures, applications and recent trends. Knowl.-Based syst. 194 (2020), 105596.Google Scholar
Cross Ref
- [125] . 2018. Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4779–4783.Google Scholar
Digital Library
- [126] . 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision. Springer, 510–526.Google Scholar
Cross Ref
- [127] . 2005. Multimodal video indexing: A review of the state-of-the-art. Multim. Tools Applic. 25, 1 (2005), 5–35.Google Scholar
Digital Library
- [128] . 2021. Online multimedia retrieval on CPU–GPU platforms with adaptive work partition. J. Parallel Distrib. Comput. 148 (2021), 31–45.Google Scholar
Cross Ref
- [129] . 2018. VoiceLoop: Voice fitting and synthesis via a phonological loop. arXiv preprint arXiv:1707.06588 (2018).Google Scholar
- [130] . 2014. WebChild: Harvesting and organizing commonsense knowledge from the web. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining. 523–532.Google Scholar
Digital Library
- [131] . 2020. End-to-end audiovisual speech recognition system with multitask learning. IEEE Trans. Multim. 23 (2020), 1–11.Google Scholar
Cross Ref
- [132] . 2015. Using descriptive video services to create a large data source for video annotation research. arXiv preprint arXiv:1503.01070 (2015).Google Scholar
- [133] . 2018. Multi-modal emotion recognition on IEMOCAP with neural networks. arXiv preprint arXiv:1804.05788 (2018).Google Scholar
- [134] . 2008. The CALO meeting speech recognition and understanding system. In Proceedings of the IEEE Spoken Language Technology Workshop. IEEE, 69–72.Google Scholar
Cross Ref
- [135] . 2016. SUPERSEDED-CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit. (2016).Google Scholar
- [136] . 2009. Computers in the human interaction loop. In Computers in the Human Interaction Loop. Springer, 3–6.Google Scholar
Cross Ref
- [137] . 2019. Towards audio to scene image synthesis using generative adversarial network. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 496–500.Google Scholar
Cross Ref
- [138] . 2018. Reconstruction network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7622–7631.Google Scholar
Cross Ref
- [139] . 2020. Cross-lingual image caption generation based on visual attention model. IEEE Access 8 (2020), 104543–104554.Google Scholar
- [140] . 2017. FVQA: Fact-based visual question answering. IEEE Trans. Pattern Anal. Mach. Intell. 40, 10 (2017), 2413–2427.Google Scholar
Digital Library
- [141] . 2017. The VQA-machine: Learning how to use existing vision algorithms to answer new questions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1173–1182.Google Scholar
Cross Ref
- [142] . 2018. A novel semantic attribute-based feature for image caption generation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3081–3085.Google Scholar
Digital Library
- [143] . 2018. Video captioning via hierarchical reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4213–4222.Google Scholar
Cross Ref
- [144] . 2021. DRSL: Deep relational similarity learning for cross-modal retrieval. Inf. Sci. 546 (2021), 298–311.Google Scholar
Cross Ref
- [145] . 2017. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135 (2017).Google Scholar
- [146] . 2020. Exploiting the local temporal information for video captioning. J. Vis. Commun. Image Represent. 67 (2020), 102751.Google Scholar
Digital Library
- [147] . 2020. Multi-attention generative adversarial network for image captioning. Neurocomputing 387 (2020), 91–99.Google Scholar
Digital Library
- [148] . 2021. Spatiotemporal multimodal learning with 3D CNNs for video action recognition. IEEE Trans. Circ. Syst. Vid. Technol. 32, 3 (2021), 1250–1261.Google Scholar
Cross Ref
- [149] . 2017. Cascade recurrent neural network for image caption generation. Electron. Lett. 53, 25 (2017), 1642–1643.Google Scholar
- [150] . 2017. Visual question answering: A survey of methods and datasets. Comput. Vis. Image Underst. 163 (2017), 21–40.Google Scholar
Digital Library
- [151] . 2020. Visual question answering model based on visual relationship detection. Sig. Process.: Image Commun. 80 (2020), 115648.Google Scholar
Digital Library
- [152] . 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5288–5296.Google Scholar
Cross Ref
- [153] . 2020. Deep reinforcement polishing network for video captioning. IEEE Trans. Multim. 23 (2020), 1772–1784.Google Scholar
Digital Library
- [154] . 2019. Shared multi-view data representation for multi-domain event detection. IEEE Trans. Pattern Anal. Mach. Intell. 42, 5 (2019), 1243–1256.Google Scholar
- [155] . 2020. Multimodal mental health analysis in social media. PLoS One 15, 4 (2020), e0226248.Google Scholar
- [156] . 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Computat. Ling. 2 (2014), 67–78.Google Scholar
Cross Ref
- [157] . 2020. Cross-modal knowledge reasoning for knowledge-based visual question answering. Pattern Recog. 108 (2020), 107563.Google Scholar
Cross Ref
- [158] . 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6281–6290.Google Scholar
Cross Ref
- [159] . 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 1821–1830.Google Scholar
Cross Ref
- [160] . 2020. Correlation Net: Spatiotemporal multimodal deep learning for action recognition. Sig. Process.: Image Commun. 82 (2020), 115731.Google Scholar
Digital Library
- [161] . 1989. Integration of acoustic and visual speech signals using neural networks. IEEE Commun. Mag. 27, 11 (1989), 65–71.Google Scholar
Digital Library
- [162] . 2020. Multimodal intelligence: Representation learning, information fusion, and applications. IEEE J. Select. Topics Sig. Process. 14, 3 (2020), 478–493.Google Scholar
Cross Ref
- [163] . 2019. Multimodal representation learning: Advances, trends and challenges. In Proceedings of the International Conference on Machine Learning and Cybernetics (ICMLC). IEEE, 1–6.Google Scholar
Cross Ref
- [164] . 2019. Reconstruct and represent video contents for captioning via reinforcement learning. IEEE Trans. Pattern Anal. Mach. Intell. 42, 12 (2019), 3088–3101.Google Scholar
Cross Ref
- [165] . 2018. High-quality image captioning with fine-grained and semantic-guided visual attention. IEEE Trans. Multim. 21, 7 (2018), 1681–1693.Google Scholar
Cross Ref
- [166] . 2020. Dense video captioning using graph-based sentence summarization. IEEE Trans. Multim. 23 (2020), 1799–1810.Google Scholar
Digital Library
- [167] . 2019. Deep supervised cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10394–10403.Google Scholar
Cross Ref
- [168] . 2019. Talking face generation by adversarially disentangled audio-visual representation. In Proceedings of the AAAI Conference on Artificial Intelligence. 9299–9306.Google Scholar
Digital Library
- [169] . 2021. Pose-controllable talking face generation by implicitly modularized audio-visual representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4176–4186.Google Scholar
Cross Ref
- [170] . 2018. Visual to sound: Generating natural sound for videos in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3550–3558.Google Scholar
Cross Ref
- [171] . 2019. Physiological signals-based emotion recognition via high-order correlation learning. ACM Trans. Multim. Comput. Commun. Applic. 15, 3s (2019), 1–18.Google Scholar
Digital Library
- [172] . 2016. Visual7W: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4995–5004.Google Scholar
Cross Ref
Index Terms
A Review on Methods and Applications in Multimodal Deep Learning
Recommendations
A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets
AbstractThe research progress in multimodal learning has grown rapidly over the last decade in several areas, especially in computer vision. The growing potential of multimodal data streams and deep learning algorithms has contributed to the increasing ...
Multimodal learning with deep Boltzmann machines
Data often consists of multiple diverse modalities. For example, images are tagged with textual information and videos are accompanied by audio. Each modality is characterized by having distinct statistical properties. We propose a Deep Boltzmann ...
Multimodal deep learning for solar radio burst classification
In this paper, multimodal deep learning for solar radio burst classification is proposed. We make the first attempt to build multimodal learning network to learn the joint representation of the solar radio spectrums captured from different frequency ...






Comments