Abstract
Multimodality Representation Learning, as a technique of learning to embed information from different modalities and their correlations, has achieved remarkable success on a variety of applications, such as Visual Question Answering (VQA), Natural Language for Visual Reasoning (NLVR), and Vision Language Retrieval (VLR). Among these applications, cross-modal interaction and complementary information from different modalities are crucial for advanced models to perform any multimodal task, e.g., understand, recognize, retrieve, or generate optimally. Researchers have proposed diverse methods to address these tasks. The different variants of transformer-based architectures performed extraordinarily on multiple modalities. This survey presents the comprehensive literature on the evolution and enhancement of deep learning multimodal architectures to deal with textual, visual and audio features for diverse cross-modal and modern multimodal tasks. This study summarizes the (i) recent task-specific deep learning methodologies, (ii) the pretraining types and multimodal pretraining objectives, (iii) from state-of-the-art pretrained multimodal approaches to unifying architectures, and (iv) multimodal task categories and possible future improvements that can be devised for better multimodal learning. Moreover, we prepare a dataset section for new researchers that covers most of the benchmarks for pretraining and finetuning. Finally, major challenges, gaps, and potential research topics are explored. A constantly-updated paperlist related to our survey is maintained at https://github.com/marslanm/multimodality-representation-learning.
- [1] . 2021. Recent advances and trends in multimodal deep learning: A review. arXiv preprint arXiv:2105.11087 (2021).Google Scholar
- [2] . 2018. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2 (2018), 423–443.Google Scholar
Digital Library
- [3] . 2022. Learning audio-visual speech representation by masked multimodal cluster prediction. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [4] . 2021. Detecting propaganda techniques in memes. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 6603–6617.Google Scholar
Cross Ref
- [5] . 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425–2433.Google Scholar
Digital Library
- [6] . 2019. Deep multimodal representation learning: A survey. IEEE Access 7 (2019), 63373–63394.Google Scholar
Cross Ref
- [7] . 2021. A survey on deep multimodal learning for computer vision: Advances, trends, applications, and datasets. The Visual Computer (2021), 1–32.Google Scholar
- [8] . 2019. VisualBERT: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019).Google Scholar
- [9] . 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural Information Processing Systems 32 (2019).Google Scholar
- [10] . 2020. 12-in-1: Multi-task vision and language representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10437–10446.Google Scholar
Cross Ref
- [11] . 1976. Hearing lips and seeing voices. Nature 264, 5588 (1976), 746–748.Google Scholar
Cross Ref
- [12] . 2010. Multimodal fusion for multimedia analysis: A survey. Multimedia Systems 16 (2010), 345–379.Google Scholar
Digital Library
- [13] . 2013. Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. IEEE Transactions on Multimedia 15, 7 (2013), 1553–1568.Google Scholar
Digital Library
- [14] . 1998. Comparison of automatic shot boundary detection algorithms. In Storage and Retrieval for Image and Video Databases VII, Vol. 3656. SPIE, 290–301.Google Scholar
- [15] Jean Carletta, Simone Ashby, Sebastien Bourban, Mike Flynn, Mael Guillemot, Thomas Hain, Jaroslav Kadlec, Vasilis Karaiskos, Wessel Kraaij, Melissa Kronenthal, Guillaume Lathoud, Mike Lincoln, Masson Agnes Lisowska, Iain McCowan, Wilfried Post, Dennis Reidsma, and Pierre D. Wellner. 2006. The AMI meeting corpus: A pre-announcement. In Machine Learning for Multimodal Interaction: Second International Workshop (MLMI 2005, Edinburgh, UK, July 11-13, 2005, Revised Selected Papers 2), Springer, 28–39.Google Scholar
- [16] . 2010. The SEMAINE corpus of emotionally coloured character interactions. In 2010 IEEE International Conference on Multimedia and Expo. IEEE, 1079–1084.Google Scholar
Cross Ref
- [17] . 2011. AVEC 2011–the first international audio/visual emotion challenge. In Affective Computing and Intelligent Interaction: Fourth International Conference, ACII 2011, Memphis, TN, USA, October 9–12, 2011, Proceedings, Part II. Springer, 415–424.Google Scholar
Cross Ref
- [18] . 2014. AVEC 2014: 3D dimensional affect and depression recognition challenge. In Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge. 3–10.Google Scholar
Digital Library
- [19] . 2022. Revisiting parameter-efficient tuning: Are we really there yet?. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2612–2626.Google Scholar
Cross Ref
- [20] . 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM International Conference on Multimedia. 251–260.Google Scholar
Digital Library
- [21] . 2016. Video2vec embeddings recognize events when examples are scarce. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 10 (2016), 2089–2103.Google Scholar
Digital Library
- [22] . 2015. Deep learning. Nature 521, 7553 (2015), 436–444.Google Scholar
Cross Ref
- [23] . 2015. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations (2015).Google Scholar
- [24] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Cross Ref
- [25] . 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425–2433.Google Scholar
Digital Library
- [26] . 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.Google Scholar
Digital Library
- [27] . 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).Google Scholar
- [28] Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, and Jianfeng Gao. 2022. Vision-language pre-training: Basics, recent advances, and future trends. Foundations and Trends® in Computer Graphics and Vision 14, 3-4 (2022), 163–352.Google Scholar
- [29] . 2016. Revisiting visual question answering baselines. In European Conference on Computer Vision. Springer, 727–739.Google Scholar
Cross Ref
- [30] . 2018. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6087–6096.Google Scholar
Cross Ref
- [31] . 2019. Language-conditioned graph networks for relational reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10294–10303.Google Scholar
Cross Ref
- [32] . 2019. Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6639–6648.Google Scholar
Cross Ref
- [33] . 2022. VLMo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems 35 (2022), 32897–32912.Google Scholar
- [34] . 2021. ViLT: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning. PMLR, 5583–5594.Google Scholar
- [35] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL-HLT. 4171–4186.Google Scholar
- [36] . 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (2017), 84–90.Google Scholar
Digital Library
- [37] . 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems 33 (2020), 12449–12460.Google Scholar
- [38] . 2022. Multimodal co-learning: Challenges, applications with datasets, recent advances and future directions. Information Fusion 81 (2022), 203–239.Google Scholar
Digital Library
- [39] . 2020. A survey on deep learning for multimodal data fusion. Neural Computation 32, 5 (2020), 829–864.Google Scholar
Digital Library
- [40] . 2023. VLP: A survey on vision-language pre-training. Machine Intelligence Research 20, 1 (2023), 38–56.
DOI: Google ScholarCross Ref
- [41] . 2022. A survey of vision-language pre-trained models. In Proceedings of the Thirty-first International Joint Conference on Artificial Intelligence (IJCAI-22) Survey Track.Google Scholar
Cross Ref
- [42] . 2022. Vision-and-language pretrained models: A survey. In Proceedings of the Thirty-first International Joint Conference on Artificial Intelligence, IJCAI 2022, (Ed.). ijcai.org, 5530–5537.Google Scholar
Cross Ref
- [43] . 2017. Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34, 6 (2017), 96–108.Google Scholar
Cross Ref
- [44] . 2021. Trends in integration of vision and language research: A survey of tasks, datasets, and methods. J. Artif. Int. Res. 71 (
Sep. 2021), 1183–1317.DOI: Google ScholarDigital Library
- [45] . 2017. Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding 163 (2017), 21–40.Google Scholar
Digital Library
- [46] . 2022. AMMU: A survey of transformer-based biomedical pretrained language models. Journal of Biomedical Informatics 126 (2022), 103982.Google Scholar
Digital Library
- [47] . 2022. A survey of data representation for multi-modality event detection and evolution. Applied Sciences 12, 4 (2022), 2204.Google Scholar
Cross Ref
- [48] . 2021. The multimodal sentiment analysis in car reviews (muse-car) dataset: Collection, insights and improvements. IEEE Transactions on Affective Computing (2021).Google Scholar
- [49] . 2021. Multimodal sentimental analysis for social media applications: A comprehensive review. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 11, 5 (2021), e1415.Google Scholar
Cross Ref
- [50] . 2021. Referring expression comprehension: A survey of methods and datasets. IEEE Transactions on Multimedia 23 (2021), 4426–4440.
DOI: Google ScholarCross Ref
- [51] . 2017. Cascade recurrent neural network for image caption generation. Electronics Letters 53, 25 (2017), 1642–1643.Google Scholar
Cross Ref
- [52] . 2021. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1655–1663.Google Scholar
Cross Ref
- [53] . 2014. Microsoft COCO: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740–755.Google Scholar
Cross Ref
- [54] . 2011. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML’11). 689–696.Google Scholar
- [55] . 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning. 1096–1103.Google Scholar
Digital Library
- [56] . 2022. Multi-modal knowledge graph construction and application: A survey. IEEE Transactions on Knowledge and Data Engineering (2022).Google Scholar
Digital Library
- [57] . 2016. Gated graph sequence neural networks. In Proceedings of International Conference on Learning Representations.Google Scholar
- [58] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. Stat 1050, 20 (2017), 10–48550.Google Scholar
- [59] . 2020. A novel graph-based multi-modal fusion encoder for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 3025–3035.Google Scholar
Cross Ref
- [60] . 2020. Multi-modal graph neural network for joint reasoning on vision and scene text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12746–12756.Google Scholar
Cross Ref
- [61] . 2022. VQA-GNN: Reasoning with multimodal semantic graph for visual question answering. arXiv preprint arXiv:2205.11501 (2022).Google Scholar
- [62] . 2021. Multi-gate attention network for image captioning. IEEE Access 9 (2021), 69700–69709.
DOI: Google ScholarCross Ref
- [63] . 2017. AMC: Attention guided multi-modal correlation learning for image search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2644–2652.Google Scholar
Cross Ref
- [64] . 2018. Video captioning via hierarchical reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4213–4222.Google Scholar
Cross Ref
- [65] . 2021. Gaussian process with graph convolutional kernel for relational learning. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 353–363.Google Scholar
Digital Library
- [66] . 2022. Multi-relational graph representation learning with Bayesian Gaussian process network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 5530–5538.Google Scholar
Cross Ref
- [67] . 2021. VinVL: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5579–5588.Google Scholar
Cross Ref
- [68] . 2021. M6: Multi-modality-to-multi-modality multitask mega-transformer for unified pretraining. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 3251–3261.Google Scholar
Digital Library
- [69] . 2020. ELECTRA: Pre-training text encoders as discriminators rather than generators. In Proceedings of the International Conference on Learning Representations. OpenReview.net.Google Scholar
- [70] . 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019).Google Scholar
- [71] . 2020. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2020), 1234–1240.Google Scholar
Cross Ref
- [72] . 2021. HateBERT: Retraining BERT for abusive language detection in English. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH’21). 17–25.Google Scholar
Cross Ref
- [73] . 2021. InfoXLM: An information-theoretic framework for cross-lingual language model pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3576–3588.Google Scholar
Cross Ref
- [74] . 2020. Pre-training technique to localize medical BERT and enhance biomedical BERT. arXiv preprint arXiv:2005.07202 (2020).Google Scholar
- [75] . 2020. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 8342–8360.Google Scholar
Cross Ref
- [76] Yujia Qin, Yankai Lin, Jing Yi, Jiajie Zhang, Xu Han, Zhengyan Zhang, Yusheng Su, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. 2022. Knowledge Inheritance for Pre-trained Language Models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3921–3937.Google Scholar
- [77] . 2018. Improving language understanding by generative pre-training. The University of British Columbia (2018).Google Scholar
- [78] . 2021. Shuffled-token detection for refining pre-trained RoBERTa. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop. 88–93.Google Scholar
Cross Ref
- [79] . 2021. Product1M: Towards weakly supervised instance-level product retrieval via cross-modal pretraining. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11782–11791.Google Scholar
Cross Ref
- [80] . 2020. InterBERT: Vision-and-language interaction for multi-modal pretraining. arXiv preprint arXiv:2003.13198 (2020).Google Scholar
- [81] . 2022. KD-VLP: Improving end-to-end vision-and-language pretraining with object knowledge distillation. In Findings of the Association for Computational Linguistics: NAACL 2022. 1589–1600.Google Scholar
- [82] . 2020. UNITER: Universal image-text representation learning. In European Conference on Computer Vision. Springer, 104–120.Google Scholar
Digital Library
- [83] . 2020. HERO: Hierarchical encoder for video+ language omni-representation pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2046–2065.Google Scholar
Cross Ref
- [84] . 2021. UC2: Universal cross-lingual cross-modal vision-and-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4155–4165.Google Scholar
Cross Ref
- [85] . 2019. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [86] . 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.Google Scholar
Digital Library
- [87] . 2020. End-to-end object detection with transformers. In European Conference on Computer Vision. Springer, 213–229.Google Scholar
Digital Library
- [88] . 2021. Deformable DETR: Deformable transformers for end-to-end object detection. In International Conference on Learning Representations.Google Scholar
- [89] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, PMLR, 8748–8763.Google Scholar
- [90] . 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2556–2565.Google Scholar
Cross Ref
- [91] . 2020. Unified vision-language pre-training for image captioning and VQA. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 13041–13049.Google Scholar
Cross Ref
- [92] . 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision. 2641–2649.Google Scholar
Digital Library
- [93] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, Springer, 121–137.Google Scholar
- [94] . 2021. Virtex: Learning visual representations from textual annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11162–11173.Google Scholar
Cross Ref
- [95] . 2021. ERNIE-ViL: Knowledge enhanced vision-language representations through scene graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 3208–3216.Google Scholar
Cross Ref
- [96] . 2020. Vokenization: Improving language understanding with contextualized, visual-grounded supervision. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2066–2080.Google Scholar
Cross Ref
- [97] . 2022. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research), , , , , , and (Eds.), Vol. 162. PMLR, 12888–12900. https://proceedings.mlr.press/v162/li22n.htmlGoogle Scholar
- [98] . 2023. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023).Google Scholar
- [99] . 2021. HuBERT: How much can a bad teacher benefit ASR pre-training?. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6533–6537.Google Scholar
Cross Ref
- [100] . 2013. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26 (2013).Google Scholar
- [101] . 2018. AllenNLP: A deep semantic natural language processing platform. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS). 1–6.Google Scholar
Cross Ref
- [102] . 2020. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 5185–5198.Google Scholar
Cross Ref
- [103] Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nisnevich, Nicoloas Pinto, and Joseph P. Turian. 2020. Experience Grounds Language. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 8718–8735.Google Scholar
- [104] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, (2017), 32–73.Google Scholar
- [105] . 2016. YFCC100M: The new data in multimedia research. Commun. ACM 59, 2 (2016), 64–73.Google Scholar
Digital Library
- [106] . 2019. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8948–8957.Google Scholar
Cross Ref
- [107] . 2021. Towards a unified foundation model: Jointly pre-training transformers on unpaired images and text. arXiv preprint arXiv:2112.07074 (2021).Google Scholar
- [108] . 2021. Unit: Multimodal multitask learning with a unified transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1439–1449.Google Scholar
Cross Ref
- [109] . 2021. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems 34 (2021).Google Scholar
- [110] . 2022. OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning. PMLR, 23318–23340.Google Scholar
- [111] . 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7871–7880.Google Scholar
Cross Ref
- [112] . 2023. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500 (2023).Google Scholar
- [113] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Wei Yu, Vincent Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed Huai-hsin Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022).Google Scholar
- [114] . 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. (
March 2023). https://lmsys.org/blog/2023-03-30-vicuna/Google Scholar - [115] . 2021. Self-supervised multimodal opinion summarization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).Google Scholar
Cross Ref
- [116] Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou. 2021. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2579–2591.Google Scholar
- [117] . 2021. StrucTexT: Structured text understanding with multi-modal transformers. In Proceedings of the 29th ACM International Conference on Multimedia. 1912–1920.Google Scholar
Digital Library
- [118] . 2019. ICDAR2019 competition on scanned receipt OCR and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1516–1520.Google Scholar
Cross Ref
- [119] . 2019. FUNSD: A dataset for form understanding in noisy scanned documents. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Vol. 2. IEEE, 1–6.Google Scholar
Cross Ref
- [120] . 2022. XYLayoutLM: Towards layout-aware multimodal networks for visually-rich document understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4583–4592.Google Scholar
Cross Ref
- [121] . 2021. Do we really need explicit position encodings for vision transformers. arXiv preprint arXiv:2102.10882 3, 8 (2021).Google Scholar
- [122] . 2020. Multistage fusion with forget gate for multimodal summarization in open-domain videos. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1834–1845.Google Scholar
Cross Ref
- [123] . 2019. Multimodal abstractive summarization for how2 videos. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 6587–6596.Google Scholar
Cross Ref
- [124] . 2021. Vision guided generative pre-trained language models for multimodal abstractive summarization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.Google Scholar
Cross Ref
- [125] . 2018. How2: A large-scale dataset for multimodal language understanding. In Conference on Neural Information Processing Systems.Google Scholar
- [126] . 2020. DeCoAR 2.0: Deep contextualized acoustic representations with vector quantization. arXiv preprint arXiv:2012.06659 (2020).Google Scholar
- [127] . 2018. LRS3-TED: A large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018).Google Scholar
- [128] . 2019. Recurrent neural network transducer for audio-visual speech recognition. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 905–912.Google Scholar
Cross Ref
- [129] . 2020. Learning individual speaking styles for accurate lip to speech synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13796–13805.Google Scholar
Cross Ref
- [130] . 2017. On the importance of super-Gaussian speech priors for machine-learning based speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26, 2 (2017), 357–366.Google Scholar
Digital Library
- [131] . 2001. Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 6 (2001), 681–685.Google Scholar
Digital Library
- [132] . 2021. Leveraging category information for single-frame visual sound source separation. In 2021 9th European Workshop on Visual Information Processing (EUVIP). IEEE, 1–6.Google Scholar
- [133] . 2018. The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV). 570–586.Google Scholar
Digital Library
- [134] . 2016. Topic-based content and sentiment analysis of Ebola virus on Twitter and in the news. Journal of Information Science 42, 6 (2016), 763–781.Google Scholar
Digital Library
- [135] . 2018. On the role of text preprocessing in neural network architectures: An evaluation study on text categorization and sentiment analysis. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. 40–46.Google Scholar
Cross Ref
- [136] . 2021. Market strategies used by processed food manufacturers to increase and consolidate their power: A systematic review and document analysis. Globalization and Health 17, 1 (2021), 1–23.Google Scholar
Cross Ref
- [137] . 2020. SWAFN: Sentimental words aware fusion network for multimodal sentiment analysis. In Proceedings of the 28th International Conference on Computational Linguistics. 1067–1077.Google Scholar
Cross Ref
- [138] . 2017. Adaptive online event detection in news streams. Knowledge-Based Systems 138 (2017), 105–112.Google Scholar
Cross Ref
- [139] . 2022. Multi-source multimodal data and deep learning for disaster response: A systematic review. SN Computer Science 3, 1 (2022), 1–29.Google Scholar
Digital Library
- [140] . 2018. CrisisMMD: Multimodal Twitter datasets from natural disasters. In Twelfth International AAAI Conference on Web and Social Media.Google Scholar
Cross Ref
- [141] . 2021. Multi-modal generative adversarial networks for traffic event detection in smart cities. Expert Systems with Applications 177 (2021), 114939.Google Scholar
Digital Library
- [142] . 2019. Proppy: Organizing the news based on their propagandistic content. Information Processing & Management 56, 5 (2019), 1849–1864.Google Scholar
Digital Library
- [143] . 2019. Fine-grained analysis of propaganda in news article. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 5636–5646.Google Scholar
Cross Ref
- [144] . 2017. Multimodal fusion with recurrent neural networks for rumor detection on microblogs. In Proceedings of the 25th ACM International Conference on Multimedia. 795–816.Google Scholar
Digital Library
- [145] . 2020.Similarity-aware multi-modal fake news detection. In Advances in Knowledge Discovery and Data Mining: 24th Pacific-Asia Conference, PAKDD 2020, Singapore, May 11–14, 2020, Proceedings, Part II. Springer, 354–367.Google Scholar
Digital Library
- [146] . 2018. EANN: Event adversarial neural networks for multi-modal fake news detection. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 849–857.Google Scholar
Digital Library
- [147] . 2019. MVAE: Multimodal variational autoencoder for fake news detection. In The World Wide Web Conference. 2915–2921.Google Scholar
Digital Library
- [148] . 2022. FMFN: Fine-grained multimodal fusion networks for fake news detection. Applied Sciences 12, 3 (2022), 1093.Google Scholar
Cross Ref
- [149] . 2019. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6720–6731.Google Scholar
Cross Ref
- [150] . 2021. KVL-BERT: Knowledge enhanced visual-and-linguistic BERT for visual commonsense reasoning. Knowledge-Based Systems 230 (2021), 107408.Google Scholar
Digital Library
- [151] . 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 5100–5111.Google Scholar
Cross Ref
- [152] . 2020. Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 (2020).Google Scholar
- [153] . 2020. Vision-language navigation with self-supervised auxiliary reasoning tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10012–10022.Google Scholar
Cross Ref
- [154] . 2020. Unified vision-language pre-training for image captioning and VQA. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 13041–13049.Google Scholar
Cross Ref
- [155] . 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.Google Scholar
Cross Ref
- [156] . 2017. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5659–5667.Google Scholar
Cross Ref
- [157] . 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7008–7024.Google Scholar
Cross Ref
- [158] . 2021. How to find a good image-text embedding for remote sensing visual question answering?. In European Conference on Machine Learning (ECML) Workshops.Google Scholar
- [159] . 2021. An improved attention for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1653–1662.Google Scholar
Cross Ref
- [160] . 2019. Analyzing compositionality in visual question answering. Advances in Neural Information Processing Systems 7 (2019).Google Scholar
- [161] . 2019. OK-VQA: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf Conference on Computer Vision and Pattern Recognition. 3195–3204.Google Scholar
Cross Ref
- [162] Paul Pu Liang, Yiwei Lyu, Xiang Fan, Zetian Wu, Yun Cheng, Jason Wu, Leslie Yufan Chen, Peter Wu, Michelle A Lee, Yuke Zhu, Ruslan Salakhutdinov, and Louis-Philippe Morency. MultiBench: Multiscale Benchmarks for Multimodal Representation Learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).Google Scholar
- [163] . Benchmarking multimodal AutoML for tabular data with text fields. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).Google Scholar
- [164] . 2016. Attentive explanations: Justifying decisions and pointing to the evidence. arXiv preprint arXiv:1612.04757 (2016).Google Scholar
- [165] . 2018. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4971–4980.Google Scholar
Cross Ref
- [166] . 2016. Generative adversarial text to image synthesis. In International Conference on Machine Learning. PMLR, 1060–1069.Google Scholar
- [167] . 2011. The Caltech-UCSD Birds-200-2011 dataset. California Institute of Technology (2011).Google Scholar
- [168] . 2018. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1316–1324.Google Scholar
Cross Ref
- [169] . 2019. LipSound: Neural mel-spectrogram reconstruction for lip reading. In INTERSPEECH. 2768–2772.Google Scholar
- [170] . 2018. Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 12 (2018), 8717–8727.Google Scholar
Cross Ref
- [171] . 2015. TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE Transactions on Multimedia 17, 5 (2015), 603–615.Google Scholar
Digital Library
- [172] . 2018. Deep voice 3: Scaling text-to-speech with convolutional sequence learning. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [173] Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu. 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, 4779–4783.Google Scholar
- [174] . 2017. Vid2Speech: Speech reconstruction from silent video. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5095–5099.Google Scholar
Digital Library
- [175] . 2018. Lip2AudSpec: Speech reconstruction from silent lip movements video. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2516–2520.Google Scholar
Digital Library
- [176] . 2019. Video-driven speech reconstruction using generative adversarial networks. Proc. Interspeech 2019 (2019), 4125–4129.Google Scholar
Cross Ref
- [177] . 2020. VL-BERT: Pre-training of generic visual-linguistic representations. In International Conference on Learning Representations. https://openreview.net/forum?id=SygXPaEYvHGoogle Scholar
- [178] . 2019. Representation learning for electronic health records. arXiv preprint arXiv:1909.09248 (2019).Google Scholar
- [179] . 2021. Learning robust patient representations from multi-modal electronic health records: A supervised deep learning approach. In Proceedings of the 2021 SIAM International Conference on Data Mining (SDM). SIAM, 585–593.Google Scholar
Cross Ref
- [180] . 2022. \(\mathrm{MS}^2\)-GNN: Exploring GNN-based multimodal fusion network for depression detection. IEEE Transactions on Cybernetics (2022).Google Scholar
Cross Ref
- [181] . 2021. Predicting the survival of cancer patients with multimodal graph neural network. IEEE/ACM Transactions on Computational Biology and Bioinformatics 19, 2 (2021), 699–709.Google Scholar
- [182] . 2019. Probing the need for visual context in multimodal machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4159–4170.Google Scholar
Cross Ref
- [183] . 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations, and (Eds.).Google Scholar
- [184] . 2021. Multi-modal neural machine translation with deep semantic interactions. Information Sciences 554 (2021), 47–60.Google Scholar
Cross Ref
- [185] . 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research 47 (2013), 853–899.Google Scholar
Digital Library
- [186] . 2018. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers).Google Scholar
Cross Ref
- [187] . 2016. MIMIC-III, a freely accessible critical care database. Scientific Data 3, 1 (2016), 1–9.Google Scholar
Cross Ref
- [188] . 2017. Fashion 200K Benchmark. https://github.com/xthan/fashion-200k. (2017).
[Online; accessed 2017]. Google Scholar - [189] . 2011. Indoor scene segmentation using a structured light sensor. In Proceedings of the International Conference on Computer Vision - Workshop on 3D Representation and Recognition.Google Scholar
Cross Ref
- [190] . 2012. Indoor segmentation and support inference from RGBD images. In Proceedings of the 12th European Conference on Computer Vision-Volume Part V. 746–760.Google Scholar
Digital Library
- [191] . 2019. Good news, everyone! Context driven entity-aware captioning for news images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12466–12475.Google Scholar
Cross Ref
- [192] . 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5288–5296.Google Scholar
Cross Ref
- [193] . 2017. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM International Conference on Multimedia. 1645–1653.Google Scholar
Digital Library
- [194] . 2017. TGIF-QA: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2758–2766.Google Scholar
Cross Ref
- [195] . 2019. Multi-target embodied question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6309–6318.Google Scholar
Cross Ref
- [196] . 2019. VideoNavQA: Bridging the gap between visual and embodied question answering. In Visually Grounded Interaction and Language (ViGIL), NeurIPS 2019 Workshop.Google Scholar
- [197] . 2017. An analysis of visual question answering algorithms. In Proceedings of the IEEE International Conference on Computer Vision. 1965–1973.Google Scholar
Cross Ref
- [198] . 2020. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11621–11631.Google Scholar
Cross Ref
- [199] . 2008. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing. IEEE, 722–729.Google Scholar
Digital Library
- [200] . 2016. MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259 (2016).Google Scholar
- [201] . 2020. Can we read speech beyond the lips? Rethinking RoI selection for deep visual speech recognition. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). IEEE, 356–363.Google Scholar
Digital Library
- [202] . 2013. The MIT Stata Center Dataset. The International Journal of Robotics Research 32, 14 (2013), 1695–1699.Google Scholar
Digital Library
- [203] . 2022. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning. PMLR, 1298–1312.Google Scholar
- [204] . 2022. FLAVA: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15638–15650.Google Scholar
Cross Ref
- [205] . 2023. Multimodal learning with graphs. Nature Machine Intelligence (2023), 1–11.Google Scholar
- [206] . 2021. M3P: Learning universal representations via multitask multilingual multimodal pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3977–3986.Google Scholar
Cross Ref
- [207] . 2021. MURAL: Multimodal, multitask retrieval across languages. arXiv preprint arXiv:2109.05125 (2021).Google Scholar
- [208] . 2022. Cross-view language modeling: Towards unified cross-lingual cross-modal pre-training. arXiv preprint arXiv:2206.00621 (2022).Google Scholar
- [209] . 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems 34 (2021), 9694–9705.Google Scholar
- [210] Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish V. Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, and Radu Soricut. 2022. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794 (2022).Google Scholar
- [211] . 2021. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34 (2021), 200–212.Google Scholar
- [212] . 2021. Kaleido-BERT: Vision-language pre-training on fashion domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12647–12657.Google Scholar
Cross Ref
- [213] . 2023. VLP: A survey on vision-language pre-training. Machine Intelligence Research 20, 1 (2023), 38–56.Google Scholar
Cross Ref
- [214] . 2021. Compressing visual-linguistic model via knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1428–1438.Google Scholar
Cross Ref
- [215] . 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning. PMLR, 4904–4916.Google Scholar
- [216] Nanyi Fei, Zhiwu Lu, Yizhao Gao, Guoxing Yang, Yuqi Huo, Jingyuan Wen, Haoyu Lu, Ruihua Song, Xin Gao, Tao Xiang, Haoran Sun, and Jiling Wen. 2022. Towards artificial general intelligence via a multimodal foundation model. Nature Communications 13, 1 (2022), 3094.Google Scholar
- [217] . 2021. An overview of deep-learning-based audio-visual speech enhancement and separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 1368–1396.Google Scholar
Digital Library
Index Terms
- Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications
Recommendations
Generative Adversarial Networks for Multimodal Representation Learning in Video Hyperlinking
ICMR '17: Proceedings of the 2017 ACM on International Conference on Multimedia RetrievalContinuous multimodal representations suitable for multimodal information retrieval are usually obtained with methods that heavily rely on multimodal autoencoders. In video hyperlinking, a task that aims at retrieving video segments, the state of the ...
Multimodal spatial reference in mediated environments: users' preferences and the pragmatics of pointing and talking
CHI EA '06: CHI '06 Extended Abstracts on Human Factors in Computing SystemsThis paper describes the current results and future developments of a project on multimodal spatial reference in mediated environments. The database consists of video-recorded sessions, with 120 participants in three experimental designs, contrasting ...
Speech and face-to-face communication - An introduction
This issue focuses on face-to-face speech communication. Research works have demonstrated that this communicative situation is essential to language acquisition and development (e.g. naming). Face-to-face communication is in fact much more than speaking ...





Comments