skip to main content
survey

Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications

Published:23 October 2023Publication History
Skip Abstract Section

Abstract

Multimodality Representation Learning, as a technique of learning to embed information from different modalities and their correlations, has achieved remarkable success on a variety of applications, such as Visual Question Answering (VQA), Natural Language for Visual Reasoning (NLVR), and Vision Language Retrieval (VLR). Among these applications, cross-modal interaction and complementary information from different modalities are crucial for advanced models to perform any multimodal task, e.g., understand, recognize, retrieve, or generate optimally. Researchers have proposed diverse methods to address these tasks. The different variants of transformer-based architectures performed extraordinarily on multiple modalities. This survey presents the comprehensive literature on the evolution and enhancement of deep learning multimodal architectures to deal with textual, visual and audio features for diverse cross-modal and modern multimodal tasks. This study summarizes the (i) recent task-specific deep learning methodologies, (ii) the pretraining types and multimodal pretraining objectives, (iii) from state-of-the-art pretrained multimodal approaches to unifying architectures, and (iv) multimodal task categories and possible future improvements that can be devised for better multimodal learning. Moreover, we prepare a dataset section for new researchers that covers most of the benchmarks for pretraining and finetuning. Finally, major challenges, gaps, and potential research topics are explored. A constantly-updated paperlist related to our survey is maintained at https://github.com/marslanm/multimodality-representation-learning.

REFERENCES

  1. [1] Summaira Jabeen, Li Xi, Shoib Amin Muhammad, Li Songyuan, and Abdul Jabbar. 2021. Recent advances and trends in multimodal deep learning: A review. arXiv preprint arXiv:2105.11087 (2021).Google ScholarGoogle Scholar
  2. [2] Baltrušaitis Tadas, Ahuja Chaitanya, and Morency Louis-Philippe. 2018. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2 (2018), 423443.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Shi Bowen, Hsu Wei-Ning, Lakhotia Kushal, and Mohamed Abdelrahman. 2022. Learning audio-visual speech representation by masked multimodal cluster prediction. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  4. [4] Dimitrov Dimitar, Ali Bishr Bin, Shaar Shaden, Alam Firoj, Silvestri Fabrizio, Firooz Hamed, Nakov Preslav, and Martino Giovanni Da San. 2021. Detecting propaganda techniques in memes. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 66036617.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Antol Stanislaw, Agrawal Aishwarya, Lu Jiasen, Mitchell Margaret, Batra Dhruv, Zitnick C. Lawrence, and Parikh Devi. 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 24252433.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Guo Wenzhong, Wang Jianwen, and Wang Shiping. 2019. Deep multimodal representation learning: A survey. IEEE Access 7 (2019), 6337363394.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Bayoudh Khaled, Knani Raja, Hamdaoui Fayçal, and Mtibaa Abdellatif. 2021. A survey on deep multimodal learning for computer vision: Advances, trends, applications, and datasets. The Visual Computer (2021), 132.Google ScholarGoogle Scholar
  8. [8] Li Liunian Harold, Yatskar Mark, Yin Da, Hsieh Cho-Jui, and Chang Kai-Wei. 2019. VisualBERT: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019).Google ScholarGoogle Scholar
  9. [9] Lu Jiasen, Batra Dhruv, Parikh Devi, and Lee Stefan. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural Information Processing Systems 32 (2019).Google ScholarGoogle Scholar
  10. [10] Lu Jiasen, Goswami Vedanuj, Rohrbach Marcus, Parikh Devi, and Lee Stefan. 2020. 12-in-1: Multi-task vision and language representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1043710446.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] McGurk Harry and MacDonald John. 1976. Hearing lips and seeing voices. Nature 264, 5588 (1976), 746748.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Atrey Pradeep K., Hossain M. Anwar, Saddik Abdulmotaleb El, and Kankanhalli Mohan S.. 2010. Multimodal fusion for multimedia analysis: A survey. Multimedia Systems 16 (2010), 345379.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Evangelopoulos Georgios, Zlatintsi Athanasia, Potamianos Alexandros, Maragos Petros, Rapantzikos Konstantinos, Skoumas Georgios, and Avrithis Yannis. 2013. Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. IEEE Transactions on Multimedia 15, 7 (2013), 15531568.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Lienhart Rainer W.. 1998. Comparison of automatic shot boundary detection algorithms. In Storage and Retrieval for Image and Video Databases VII, Vol. 3656. SPIE, 290301.Google ScholarGoogle Scholar
  15. [15] Jean Carletta, Simone Ashby, Sebastien Bourban, Mike Flynn, Mael Guillemot, Thomas Hain, Jaroslav Kadlec, Vasilis Karaiskos, Wessel Kraaij, Melissa Kronenthal, Guillaume Lathoud, Mike Lincoln, Masson Agnes Lisowska, Iain McCowan, Wilfried Post, Dennis Reidsma, and Pierre D. Wellner. 2006. The AMI meeting corpus: A pre-announcement. In Machine Learning for Multimodal Interaction: Second International Workshop (MLMI 2005, Edinburgh, UK, July 11-13, 2005, Revised Selected Papers 2), Springer, 28–39.Google ScholarGoogle Scholar
  16. [16] McKeown Gary, Valstar Michel F., Cowie Roderick, and Pantic Maja. 2010. The SEMAINE corpus of emotionally coloured character interactions. In 2010 IEEE International Conference on Multimedia and Expo. IEEE, 10791084.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Schuller Björn, Valstar Michel, Eyben Florian, McKeown Gary, Cowie Roddy, and Pantic Maja. 2011. AVEC 2011–the first international audio/visual emotion challenge. In Affective Computing and Intelligent Interaction: Fourth International Conference, ACII 2011, Memphis, TN, USA, October 9–12, 2011, Proceedings, Part II. Springer, 415424.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Valstar Michel, Schuller Björn, Smith Kirsty, Almaev Timur, Eyben Florian, Krajewski Jarek, Cowie Roddy, and Pantic Maja. 2014. AVEC 2014: 3D dimensional affect and depression recognition challenge. In Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge. 310.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Chen Guanzheng, Liu Fangyu, Meng Zaiqiao, and Liang Shangsong. 2022. Revisiting parameter-efficient tuning: Are we really there yet?. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 26122626.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Rasiwasia Nikhil, Pereira Jose Costa, Coviello Emanuele, Doyle Gabriel, Lanckriet Gert R. G., Levy Roger, and Vasconcelos Nuno. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM International Conference on Multimedia. 251260.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Habibian Amirhossein, Mensink Thomas, and Snoek Cees G. M.. 2016. Video2vec embeddings recognize events when examples are scarce. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 10 (2016), 20892103.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] LeCun Yann, Bengio Yoshua, and Hinton Geoffrey. 2015. Deep learning. Nature 521, 7553 (2015), 436444.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Simonyan Karen and Zisserman Andrew. 2015. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations (2015).Google ScholarGoogle Scholar
  24. [24] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770778.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Antol Stanislaw, Agrawal Aishwarya, Lu Jiasen, Mitchell Margaret, Batra Dhruv, Zitnick C. Lawrence, and Parikh Devi. 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 24252433.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 17351780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).Google ScholarGoogle Scholar
  28. [28] Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, and Jianfeng Gao. 2022. Vision-language pre-training: Basics, recent advances, and future trends. Foundations and Trends® in Computer Graphics and Vision 14, 3-4 (2022), 163–352.Google ScholarGoogle Scholar
  29. [29] Jabri Allan, Joulin Armand, and Maaten Laurens van der. 2016. Revisiting visual question answering baselines. In European Conference on Computer Vision. Springer, 727739.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Nguyen Duy-Kien and Okatani Takayuki. 2018. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 60876096.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Hu Ronghang, Rohrbach Anna, Darrell Trevor, and Saenko Kate. 2019. Language-conditioned graph networks for relational reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1029410303.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Gao Peng, Jiang Zhengkai, You Haoxuan, Lu Pan, Hoi Steven C. H., Wang Xiaogang, and Li Hongsheng. 2019. Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 66396648.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Bao Hangbo, Wang Wenhui, Dong Li, Liu Qiang, Mohammed Owais Khan, Aggarwal Kriti, Som Subhojit, Piao Songhao, and Wei Furu. 2022. VLMo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems 35 (2022), 3289732912.Google ScholarGoogle Scholar
  34. [34] Kim Wonjae, Son Bokyung, and Kim Ildoo. 2021. ViLT: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning. PMLR, 55835594.Google ScholarGoogle Scholar
  35. [35] Kenton Jacob Devlin Ming-Wei Chang and Toutanova Lee Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL-HLT. 41714186.Google ScholarGoogle Scholar
  36. [36] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (2017), 8490.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Baevski Alexei, Zhou Yuhao, Mohamed Abdelrahman, and Auli Michael. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems 33 (2020), 1244912460.Google ScholarGoogle Scholar
  38. [38] Rahate Anil, Walambe Rahee, Ramanna Sheela, and Kotecha Ketan. 2022. Multimodal co-learning: Challenges, applications with datasets, recent advances and future directions. Information Fusion 81 (2022), 203239.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Gao Jing, Li Peng, Chen Zhikui, and Zhang Jianing. 2020. A survey on deep learning for multimodal data fusion. Neural Computation 32, 5 (2020), 829864.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Chen Fei-Long, Zhang Du-Zhen, Han Ming-Lun, Chen Xiu-Yi, Shi Jing, Xu Shuang, and Xu Bo. 2023. VLP: A survey on vision-language pre-training. Machine Intelligence Research 20, 1 (2023), 3856. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Du Yifan, Liu Zikang, Li Junyi, and Zhao Wayne Xin. 2022. A survey of vision-language pre-trained models. In Proceedings of the Thirty-first International Joint Conference on Artificial Intelligence (IJCAI-22) Survey Track.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Long Siqu, Cao Feiqi, Han Soyeon Caren, and Yang Haiqin. 2022. Vision-and-language pretrained models: A survey. In Proceedings of the Thirty-first International Joint Conference on Artificial Intelligence, IJCAI 2022, Raedt Luc De (Ed.). ijcai.org, 55305537.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Ramachandram Dhanesh and Taylor Graham W.. 2017. Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34, 6 (2017), 96108.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Mogadala Aditya, Kalimuthu Marimuthu, and Klakow Dietrich. 2021. Trends in integration of vision and language research: A survey of tasks, datasets, and methods. J. Artif. Int. Res. 71 (Sep.2021), 11831317. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Wu Qi, Teney Damien, Wang Peng, Shen Chunhua, Dick Anthony, and Hengel Anton van den. 2017. Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding 163 (2017), 2140.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Kalyan Katikapalli Subramanyam, Rajasekharan Ajit, and Sangeetha Sivanesan. 2022. AMMU: A survey of transformer-based biomedical pretrained language models. Journal of Biomedical Informatics 126 (2022), 103982.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Xiao Kejing, Qian Zhaopeng, and Qin Biao. 2022. A survey of data representation for multi-modality event detection and evolution. Applied Sciences 12, 4 (2022), 2204.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Stappen Lukas, Baird Alice, Schumann Lea, and Bjorn Schuller. 2021. The multimodal sentiment analysis in car reviews (muse-car) dataset: Collection, insights and improvements. IEEE Transactions on Affective Computing (2021).Google ScholarGoogle Scholar
  49. [49] Chandrasekaran Ganesh, Nguyen Tu N., and D Jude Hemanth. 2021. Multimodal sentimental analysis for social media applications: A comprehensive review. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 11, 5 (2021), e1415.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Qiao Yanyuan, Deng Chaorui, and Wu Qi. 2021. Referring expression comprehension: A survey of methods and datasets. IEEE Transactions on Multimedia 23 (2021), 44264440. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Wu Jie and Hu Haifeng. 2017. Cascade recurrent neural network for image caption generation. Electronics Letters 53, 25 (2017), 16421643.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Ji Jiayi, Luo Yunpeng, Sun Xiaoshuai, Chen Fuhai, Luo Gen, Wu Yongjian, Gao Yue, and Ji Rongrong. 2021. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 16551663.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence. 2014. Microsoft COCO: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740755.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Ngiam Jiquan, Khosla Aditya, Kim Mingyu, Nam Juhan, Lee Honglak, and Ng Andrew Y.. 2011. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML’11). 689696.Google ScholarGoogle Scholar
  55. [55] Vincent Pascal, Larochelle Hugo, Bengio Yoshua, and Manzagol Pierre-Antoine. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning. 10961103.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. [56] Zhu Xiangru, Li Zhixu, Wang Xiaodan, Jiang Xueyao, Sun Penglei, Wang Xuwu, Xiao Yanghua, and Yuan Nicholas Jing. 2022. Multi-modal knowledge graph construction and application: A survey. IEEE Transactions on Knowledge and Data Engineering (2022).Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. [57] Li Yujia, Zemel Richard, Brockschmidt Marc, and Tarlow Daniel. 2016. Gated graph sequence neural networks. In Proceedings of International Conference on Learning Representations.Google ScholarGoogle Scholar
  58. [58] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. Stat 1050, 20 (2017), 10–48550.Google ScholarGoogle Scholar
  59. [59] Yin Yongjing, Meng Fandong, Su Jinsong, Zhou Chulun, Yang Zhengyuan, Zhou Jie, and Luo Jiebo. 2020. A novel graph-based multi-modal fusion encoder for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 30253035.Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Gao Difei, Li Ke, Wang Ruiping, Shan Shiguang, and Chen Xilin. 2020. Multi-modal graph neural network for joint reasoning on vision and scene text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1274612756.Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Wang Yanan, Yasunaga Michihiro, Ren Hongyu, Wada Shinya, and Leskovec Jure. 2022. VQA-GNN: Reasoning with multimodal semantic graph for visual question answering. arXiv preprint arXiv:2205.11501 (2022).Google ScholarGoogle Scholar
  62. [62] Jiang Weitao, Li Xiying, Hu Haifeng, Lu Qiang, and Liu Bohong. 2021. Multi-gate attention network for image captioning. IEEE Access 9 (2021), 6970069709. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Chen Kan, Bui Trung, Fang Chen, Wang Zhaowen, and Nevatia Ram. 2017. AMC: Attention guided multi-modal correlation learning for image search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 26442652.Google ScholarGoogle ScholarCross RefCross Ref
  64. [64] Wang Xin, Chen Wenhu, Wu Jiawei, Wang Yuan-Fang, and Wang William Yang. 2018. Video captioning via hierarchical reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 42134222.Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Fang Jinyuan, Liang Shangsong, Meng Zaiqiao, and Zhang Qiang. 2021. Gaussian process with graph convolutional kernel for relational learning. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 353363.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. [66] Chen Guanzheng, Fang Jinyuan, Meng Zaiqiao, Zhang Qiang, and Liang Shangsong. 2022. Multi-relational graph representation learning with Bayesian Gaussian process network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 55305538.Google ScholarGoogle ScholarCross RefCross Ref
  67. [67] Zhang Pengchuan, Li Xiujun, Hu Xiaowei, Yang Jianwei, Zhang Lei, Wang Lijuan, Choi Yejin, and Gao Jianfeng. 2021. VinVL: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 55795588.Google ScholarGoogle ScholarCross RefCross Ref
  68. [68] Lin Junyang, Men Rui, Yang An, Zhou Chang, Zhang Yichang, Wang Peng, Zhou Jingren, Tang Jie, and Yang Hongxia. 2021. M6: Multi-modality-to-multi-modality multitask mega-transformer for unified pretraining. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 32513261.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. [69] Clark Kevin, Luong Minh-Thang, Le Quoc V., and Manning Christopher D.. 2020. ELECTRA: Pre-training text encoders as discriminators rather than generators. In Proceedings of the International Conference on Learning Representations. OpenReview.net.Google ScholarGoogle Scholar
  70. [70] Liu Yinhan, Ott Myle, Goyal Naman, Du Jingfei, Joshi Mandar, Chen Danqi, Levy Omer, Lewis Mike, Zettlemoyer Luke, and Stoyanov Veselin. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019).Google ScholarGoogle Scholar
  71. [71] Lee Jinhyuk, Yoon Wonjin, Kim Sungdong, Kim Donghyeon, Kim Sunkyu, So Chan Ho, and Kang Jaewoo. 2020. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2020), 12341240.Google ScholarGoogle ScholarCross RefCross Ref
  72. [72] Caselli Tommaso, Basile Valerio, Mitrović Jelena, and Granitzer Michael. 2021. HateBERT: Retraining BERT for abusive language detection in English. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH’21). 1725.Google ScholarGoogle ScholarCross RefCross Ref
  73. [73] Chi Zewen, Dong Li, Wei Furu, Yang Nan, Singhal Saksham, Wang Wenhui, Song Xia, Mao Xian-Ling, Huang He-Yan, and Zhou Ming. 2021. InfoXLM: An information-theoretic framework for cross-lingual language model pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 35763588.Google ScholarGoogle ScholarCross RefCross Ref
  74. [74] Wada Shoya, Takeda Toshihiro, Manabe Shiro, Konishi Shozo, Kamohara Jun, and Matsumura Yasushi. 2020. Pre-training technique to localize medical BERT and enhance biomedical BERT. arXiv preprint arXiv:2005.07202 (2020).Google ScholarGoogle Scholar
  75. [75] Gururangan Suchin, Marasović Ana, Swayamdipta Swabha, Lo Kyle, Beltagy Iz, Downey Doug, and Smith Noah A.. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 83428360.Google ScholarGoogle ScholarCross RefCross Ref
  76. [76] Yujia Qin, Yankai Lin, Jing Yi, Jiajie Zhang, Xu Han, Zhengyan Zhang, Yusheng Su, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. 2022. Knowledge Inheritance for Pre-trained Language Models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3921–3937.Google ScholarGoogle Scholar
  77. [77] Radford Alec, Narasimhan Karthik, Salimans Tim, and Sutskever Ilya. 2018. Improving language understanding by generative pre-training. The University of British Columbia (2018).Google ScholarGoogle Scholar
  78. [78] Panda Subhadarshi, Agrawal Anjali, Ha Jeewon, and Bloch Benjamin. 2021. Shuffled-token detection for refining pre-trained RoBERTa. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop. 8893.Google ScholarGoogle ScholarCross RefCross Ref
  79. [79] Zhan Xunlin, Wu Yangxin, Dong Xiao, Wei Yunchao, Lu Minlong, Zhang Yichi, Xu Hang, and Liang Xiaodan. 2021. Product1M: Towards weakly supervised instance-level product retrieval via cross-modal pretraining. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1178211791.Google ScholarGoogle ScholarCross RefCross Ref
  80. [80] Lin Junyang, Yang An, Zhang Yichang, Liu Jie, Zhou Jingren, and Yang Hongxia. 2020. InterBERT: Vision-and-language interaction for multi-modal pretraining. arXiv preprint arXiv:2003.13198 (2020).Google ScholarGoogle Scholar
  81. [81] Liu Yongfei, Wu Chenfei, Tseng Shao-Yen, Lal Vasudev, He Xuming, and Duan Nan. 2022. KD-VLP: Improving end-to-end vision-and-language pretraining with object knowledge distillation. In Findings of the Association for Computational Linguistics: NAACL 2022. 15891600.Google ScholarGoogle Scholar
  82. [82] Chen Yen-Chun, Li Linjie, Yu Licheng, Kholy Ahmed El, Ahmed Faisal, Gan Zhe, Cheng Yu, and Liu Jingjing. 2020. UNITER: Universal image-text representation learning. In European Conference on Computer Vision. Springer, 104120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. [83] Li Linjie, Chen Yen-Chun, Cheng Yu, Gan Zhe, Yu Licheng, and Liu Jingjing. 2020. HERO: Hierarchical encoder for video+ language omni-representation pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 20462065.Google ScholarGoogle ScholarCross RefCross Ref
  84. [84] Zhou Mingyang, Zhou Luowei, Wang Shuohang, Cheng Yu, Li Linjie, Yu Zhou, and Liu Jingjing. 2021. UC2: Universal cross-lingual cross-modal vision-and-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 41554165.Google ScholarGoogle ScholarCross RefCross Ref
  85. [85] Lan Zhenzhong, Chen Mingda, Goodman Sebastian, Gimpel Kevin, Sharma Piyush, and Soricut Radu. 2019. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  86. [86] Raffel Colin, Shazeer Noam, Roberts Adam, Lee Katherine, Narang Sharan, Matena Michael, Zhou Yanqi, Li Wei, and Liu Peter J.. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 54855551.Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. [87] Carion Nicolas, Massa Francisco, Synnaeve Gabriel, Usunier Nicolas, Kirillov Alexander, and Zagoruyko Sergey. 2020. End-to-end object detection with transformers. In European Conference on Computer Vision. Springer, 213229.Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. [88] Zhu Xizhou, Su Weijie, Lu Lewei, Li Bin, Wang Xiaogang, and Dai Jifeng. 2021. Deformable DETR: Deformable transformers for end-to-end object detection. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  89. [89] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, PMLR, 8748–8763.Google ScholarGoogle Scholar
  90. [90] Sharma Piyush, Ding Nan, Goodman Sebastian, and Soricut Radu. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 25562565.Google ScholarGoogle ScholarCross RefCross Ref
  91. [91] Zhou Luowei, Palangi Hamid, Zhang Lei, Hu Houdong, Corso Jason, and Gao Jianfeng. 2020. Unified vision-language pre-training for image captioning and VQA. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 1304113049.Google ScholarGoogle ScholarCross RefCross Ref
  92. [92] Plummer Bryan A., Wang Liwei, Cervantes Chris M., Caicedo Juan C., Hockenmaier Julia, and Lazebnik Svetlana. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision. 26412649.Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. [93] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, Springer, 121–137.Google ScholarGoogle Scholar
  94. [94] Desai Karan and Johnson Justin. 2021. Virtex: Learning visual representations from textual annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1116211173.Google ScholarGoogle ScholarCross RefCross Ref
  95. [95] Yu Fei, Tang Jiji, Yin Weichong, Sun Yu, Tian Hao, Wu Hua, and Wang Haifeng. 2021. ERNIE-ViL: Knowledge enhanced vision-language representations through scene graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 32083216.Google ScholarGoogle ScholarCross RefCross Ref
  96. [96] Tan Hao and Bansal Mohit. 2020. Vokenization: Improving language understanding with contextualized, visual-grounded supervision. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 20662080.Google ScholarGoogle ScholarCross RefCross Ref
  97. [97] Li Junnan, Li Dongxu, Xiong Caiming, and Hoi Steven. 2022. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research), Chaudhuri Kamalika, Jegelka Stefanie, Song Le, Szepesvari Csaba, Niu Gang, and Sabato Sivan (Eds.), Vol. 162. PMLR, 1288812900. https://proceedings.mlr.press/v162/li22n.htmlGoogle ScholarGoogle Scholar
  98. [98] Li Junnan, Li Dongxu, Savarese Silvio, and Hoi Steven. 2023. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023).Google ScholarGoogle Scholar
  99. [99] Hsu Wei-Ning, Tsai Yao-Hung Hubert, Bolte Benjamin, Salakhutdinov Ruslan, and Mohamed Abdelrahman. 2021. HuBERT: How much can a bad teacher benefit ASR pre-training?. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 65336537.Google ScholarGoogle ScholarCross RefCross Ref
  100. [100] Mikolov Tomas, Sutskever Ilya, Chen Kai, Corrado Greg S., and Dean Jeff. 2013. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26 (2013).Google ScholarGoogle Scholar
  101. [101] Gardner Matt, Grus Joel, Neumann Mark, Tafjord Oyvind, Dasigi Pradeep, Liu Nelson F., Peters Matthew E., Schmitz Michael, and Zettlemoyer Luke. 2018. AllenNLP: A deep semantic natural language processing platform. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS). 16.Google ScholarGoogle ScholarCross RefCross Ref
  102. [102] Bender Emily M. and Koller Alexander. 2020. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 51855198.Google ScholarGoogle ScholarCross RefCross Ref
  103. [103] Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nisnevich, Nicoloas Pinto, and Joseph P. Turian. 2020. Experience Grounds Language. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 8718–8735.Google ScholarGoogle Scholar
  104. [104] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, (2017), 32–73.Google ScholarGoogle Scholar
  105. [105] Thomee Bart, Shamma David A., Friedland Gerald, Elizalde Benjamin, Ni Karl, Poland Douglas, Borth Damian, and Li Li-Jia. 2016. YFCC100M: The new data in multimedia research. Commun. ACM 59, 2 (2016), 6473.Google ScholarGoogle ScholarDigital LibraryDigital Library
  106. [106] Agrawal Harsh, Desai Karan, Wang Yufei, Chen Xinlei, Jain Rishabh, Johnson Mark, Batra Dhruv, Parikh Devi, Lee Stefan, and Anderson Peter. 2019. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 89488957.Google ScholarGoogle ScholarCross RefCross Ref
  107. [107] Li Qing, Gong Boqing, Cui Yin, Kondratyuk Dan, Du Xianzhi, Yang Ming-Hsuan, and Brown Matthew. 2021. Towards a unified foundation model: Jointly pre-training transformers on unpaired images and text. arXiv preprint arXiv:2112.07074 (2021).Google ScholarGoogle Scholar
  108. [108] Hu Ronghang and Singh Amanpreet. 2021. Unit: Multimodal multitask learning with a unified transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14391449.Google ScholarGoogle ScholarCross RefCross Ref
  109. [109] Akbari Hassan, Yuan Liangzhe, Qian Rui, Chuang Wei-Hong, Chang Shih-Fu, Cui Yin, and Gong Boqing. 2021. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems 34 (2021).Google ScholarGoogle Scholar
  110. [110] Wang Peng, Yang An, Men Rui, Lin Junyang, Bai Shuai, Li Zhikang, Ma Jianxin, Zhou Chang, Zhou Jingren, and Yang Hongxia. 2022. OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning. PMLR, 2331823340.Google ScholarGoogle Scholar
  111. [111] Lewis Mike, Liu Yinhan, Goyal Naman, Ghazvininejad Marjan, Mohamed Abdelrahman, Levy Omer, Stoyanov Veselin, and Zettlemoyer Luke. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 78717880.Google ScholarGoogle ScholarCross RefCross Ref
  112. [112] Dai Wenliang, Li Junnan, Li Dongxu, Tiong Anthony Meng Huat, Zhao Junqi, Wang Weisheng, Li Boyang, Fung Pascale, and Hoi Steven. 2023. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500 (2023).Google ScholarGoogle Scholar
  113. [113] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Wei Yu, Vincent Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed Huai-hsin Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022).Google ScholarGoogle Scholar
  114. [114] Chiang Wei-Lin, Li Zhuohan, Lin Zi, Sheng Ying, Wu Zhanghao, Zhang Hao, Zheng Lianmin, Zhuang Siyuan, Zhuang Yonghao, Gonzalez Joseph E., Stoica Ion, and Xing Eric P.. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. (March2023). https://lmsys.org/blog/2023-03-30-vicuna/Google ScholarGoogle Scholar
  115. [115] Im Jinbae, Kim Moonki, Lee Hoyeop, Cho Hyunsouk, and Chung Sehee. 2021. Self-supervised multimodal opinion summarization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).Google ScholarGoogle ScholarCross RefCross Ref
  116. [116] Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou. 2021. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2579–2591.Google ScholarGoogle Scholar
  117. [117] Li Yulin, Qian Yuxi, Yu Yuechen, Qin Xiameng, Zhang Chengquan, Liu Yan, Yao Kun, Han Junyu, Liu Jingtuo, and Ding Errui. 2021. StrucTexT: Structured text understanding with multi-modal transformers. In Proceedings of the 29th ACM International Conference on Multimedia. 19121920.Google ScholarGoogle ScholarDigital LibraryDigital Library
  118. [118] Huang Zheng, Chen Kai, He Jianhua, Bai Xiang, Karatzas Dimosthenis, Lu Shijian, and Jawahar C. V.. 2019. ICDAR2019 competition on scanned receipt OCR and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 15161520.Google ScholarGoogle ScholarCross RefCross Ref
  119. [119] Jaume Guillaume, Ekenel Hazim Kemal, and Thiran Jean-Philippe. 2019. FUNSD: A dataset for form understanding in noisy scanned documents. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Vol. 2. IEEE, 16.Google ScholarGoogle ScholarCross RefCross Ref
  120. [120] Gu Zhangxuan, Meng Changhua, Wang Ke, Lan Jun, Wang Weiqiang, Gu Ming, and Zhang Liqing. 2022. XYLayoutLM: Towards layout-aware multimodal networks for visually-rich document understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 45834592.Google ScholarGoogle ScholarCross RefCross Ref
  121. [121] Chu Xiangxiang, Zhang Bo, Tian Zhi, Wei Xiaolin, and Xia Huaxia. 2021. Do we really need explicit position encodings for vision transformers. arXiv preprint arXiv:2102.10882 3, 8 (2021).Google ScholarGoogle Scholar
  122. [122] Liu Nayu, Sun Xian, Yu Hongfeng, Zhang Wenkai, and Xu Guangluan. 2020. Multistage fusion with forget gate for multimodal summarization in open-domain videos. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 18341845.Google ScholarGoogle ScholarCross RefCross Ref
  123. [123] Palaskar Shruti, Libovickỳ Jindřich, Gella Spandana, and Metze Florian. 2019. Multimodal abstractive summarization for how2 videos. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 65876596.Google ScholarGoogle ScholarCross RefCross Ref
  124. [124] Yu Tiezheng, Dai Wenliang, Liu Zihan, and Fung Pascale Ngan. 2021. Vision guided generative pre-trained language models for multimodal abstractive summarization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.Google ScholarGoogle ScholarCross RefCross Ref
  125. [125] Sanabria Ramon, Caglayan Ozan, Palaskar Shruti, Elliott Desmond, Barrault Loïc, Specia Lucia, and Metze Florian. 2018. How2: A large-scale dataset for multimodal language understanding. In Conference on Neural Information Processing Systems.Google ScholarGoogle Scholar
  126. [126] Ling Shaoshi and Liu Yuzong. 2020. DeCoAR 2.0: Deep contextualized acoustic representations with vector quantization. arXiv preprint arXiv:2012.06659 (2020).Google ScholarGoogle Scholar
  127. [127] Afouras Triantafyllos, Chung Joon Son, and Zisserman Andrew. 2018. LRS3-TED: A large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018).Google ScholarGoogle Scholar
  128. [128] Makino Takaki, Liao Hank, Assael Yannis, Shillingford Brendan, Garcia Basilio, Braga Otavio, and Siohan Olivier. 2019. Recurrent neural network transducer for audio-visual speech recognition. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 905912.Google ScholarGoogle ScholarCross RefCross Ref
  129. [129] Prajwal K. R., Mukhopadhyay Rudrabha, Namboodiri Vinay P., and Jawahar C. V.. 2020. Learning individual speaking styles for accurate lip to speech synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1379613805.Google ScholarGoogle ScholarCross RefCross Ref
  130. [130] Rehr Robert and Gerkmann Timo. 2017. On the importance of super-Gaussian speech priors for machine-learning based speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26, 2 (2017), 357366.Google ScholarGoogle ScholarDigital LibraryDigital Library
  131. [131] Cootes Timothy F., Edwards Gareth J., and Taylor Christopher J.. 2001. Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 6 (2001), 681685.Google ScholarGoogle ScholarDigital LibraryDigital Library
  132. [132] Zhu Lingyu and Rahtu Esa. 2021. Leveraging category information for single-frame visual sound source separation. In 2021 9th European Workshop on Visual Information Processing (EUVIP). IEEE, 16.Google ScholarGoogle Scholar
  133. [133] Zhao Hang, Gan Chuang, Rouditchenko Andrew, Vondrick Carl, McDermott Josh, and Torralba Antonio. 2018. The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV). 570586.Google ScholarGoogle ScholarDigital LibraryDigital Library
  134. [134] Kim Erin Hea-Jin, Jeong Yoo Kyung, Kim Yuyoung, Kang Keun Young, and Song Min. 2016. Topic-based content and sentiment analysis of Ebola virus on Twitter and in the news. Journal of Information Science 42, 6 (2016), 763781.Google ScholarGoogle ScholarDigital LibraryDigital Library
  135. [135] Camacho-Collados Jose and Pilehvar Mohammad Taher. 2018. On the role of text preprocessing in neural network architectures: An evaluation study on text categorization and sentiment analysis. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. 4046.Google ScholarGoogle ScholarCross RefCross Ref
  136. [136] Wood Benjamin, Williams Owain, Nagarajan Vijaya, and Sacks Gary. 2021. Market strategies used by processed food manufacturers to increase and consolidate their power: A systematic review and document analysis. Globalization and Health 17, 1 (2021), 123.Google ScholarGoogle ScholarCross RefCross Ref
  137. [137] Chen Minping and Li Xia. 2020. SWAFN: Sentimental words aware fusion network for multimodal sentiment analysis. In Proceedings of the 28th International Conference on Computational Linguistics. 10671077.Google ScholarGoogle ScholarCross RefCross Ref
  138. [138] Hu Linmei, Zhang Bin, Hou Lei, and Li Juanzi. 2017. Adaptive online event detection in news streams. Knowledge-Based Systems 138 (2017), 105112.Google ScholarGoogle ScholarCross RefCross Ref
  139. [139] Algiriyage Nilani, Prasanna Raj, Stock Kristin, Doyle Emma E. H., and Johnston David. 2022. Multi-source multimodal data and deep learning for disaster response: A systematic review. SN Computer Science 3, 1 (2022), 129.Google ScholarGoogle ScholarDigital LibraryDigital Library
  140. [140] Alam Firoj, Ofli Ferda, and Imran Muhammad. 2018. CrisisMMD: Multimodal Twitter datasets from natural disasters. In Twelfth International AAAI Conference on Web and Social Media.Google ScholarGoogle ScholarCross RefCross Ref
  141. [141] Chen Qi, Wang Wei, Huang Kaizhu, De Suparna, and Coenen Frans. 2021. Multi-modal generative adversarial networks for traffic event detection in smart cities. Expert Systems with Applications 177 (2021), 114939.Google ScholarGoogle ScholarDigital LibraryDigital Library
  142. [142] Barrón-Cedeno Alberto, Jaradat Israa, Martino Giovanni Da San, and Nakov Preslav. 2019. Proppy: Organizing the news based on their propagandistic content. Information Processing & Management 56, 5 (2019), 18491864.Google ScholarGoogle ScholarDigital LibraryDigital Library
  143. [143] Martino Giovanni Da San, Yu Seunghak, Barrón-Cedeno Alberto, Petrov Rostislav, and Nakov Preslav. 2019. Fine-grained analysis of propaganda in news article. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 56365646.Google ScholarGoogle ScholarCross RefCross Ref
  144. [144] Jin Zhiwei, Cao Juan, Guo Han, Zhang Yongdong, and Luo Jiebo. 2017. Multimodal fusion with recurrent neural networks for rumor detection on microblogs. In Proceedings of the 25th ACM International Conference on Multimedia. 795816.Google ScholarGoogle ScholarDigital LibraryDigital Library
  145. [145] Zhou Xinyi, Wu Jindi, and Zafarani Reza. 2020.Similarity-aware multi-modal fake news detection. In Advances in Knowledge Discovery and Data Mining: 24th Pacific-Asia Conference, PAKDD 2020, Singapore, May 11–14, 2020, Proceedings, Part II. Springer, 354367.Google ScholarGoogle ScholarDigital LibraryDigital Library
  146. [146] Wang Yaqing, Ma Fenglong, Jin Zhiwei, Yuan Ye, Xun Guangxu, Jha Kishlay, Su Lu, and Gao Jing. 2018. EANN: Event adversarial neural networks for multi-modal fake news detection. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 849857.Google ScholarGoogle ScholarDigital LibraryDigital Library
  147. [147] Khattar Dhruv, Goud Jaipal Singh, Gupta Manish, and Varma Vasudeva. 2019. MVAE: Multimodal variational autoencoder for fake news detection. In The World Wide Web Conference. 29152921.Google ScholarGoogle ScholarDigital LibraryDigital Library
  148. [148] Wang Jingzi, Mao Hongyan, and Li Hongwei. 2022. FMFN: Fine-grained multimodal fusion networks for fake news detection. Applied Sciences 12, 3 (2022), 1093.Google ScholarGoogle ScholarCross RefCross Ref
  149. [149] Zellers Rowan, Bisk Yonatan, Farhadi Ali, and Choi Yejin. 2019. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 67206731.Google ScholarGoogle ScholarCross RefCross Ref
  150. [150] Song Dandan, Ma Siyi, Sun Zhanchen, Yang Sicheng, and Liao Lejian. 2021. KVL-BERT: Knowledge enhanced visual-and-linguistic BERT for visual commonsense reasoning. Knowledge-Based Systems 230 (2021), 107408.Google ScholarGoogle ScholarDigital LibraryDigital Library
  151. [151] Tan Hao and Bansal Mohit. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 51005111.Google ScholarGoogle ScholarCross RefCross Ref
  152. [152] Huang Zhicheng, Zeng Zhaoyang, Liu Bei, Fu Dongmei, and Fu Jianlong. 2020. Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 (2020).Google ScholarGoogle Scholar
  153. [153] Zhu Fengda, Zhu Yi, Chang Xiaojun, and Liang Xiaodan. 2020. Vision-language navigation with self-supervised auxiliary reasoning tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1001210022.Google ScholarGoogle ScholarCross RefCross Ref
  154. [154] Zhou Luowei, Palangi Hamid, Zhang Lei, Hu Houdong, Corso Jason, and Gao Jianfeng. 2020. Unified vision-language pre-training for image captioning and VQA. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 1304113049.Google ScholarGoogle ScholarCross RefCross Ref
  155. [155] Vinyals Oriol, Toshev Alexander, Bengio Samy, and Erhan Dumitru. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31563164.Google ScholarGoogle ScholarCross RefCross Ref
  156. [156] Chen Long, Zhang Hanwang, Xiao Jun, Nie Liqiang, Shao Jian, Liu Wei, and Chua Tat-Seng. 2017. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 56595667.Google ScholarGoogle ScholarCross RefCross Ref
  157. [157] Rennie Steven J., Marcheret Etienne, Mroueh Youssef, Ross Jerret, and Goel Vaibhava. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 70087024.Google ScholarGoogle ScholarCross RefCross Ref
  158. [158] Chappuis Christel, Lobry Sylvain, Kellenberger Benjamin Alexander, Saux Bertrand Le, and Tuia Devis. 2021. How to find a good image-text embedding for remote sensing visual question answering?. In European Conference on Machine Learning (ECML) Workshops.Google ScholarGoogle Scholar
  159. [159] Rahman Tanzila, Chou Shih-Han, Sigal Leonid, and Carenini Giuseppe. 2021. An improved attention for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16531662.Google ScholarGoogle ScholarCross RefCross Ref
  160. [160] Subramanian Sanjay, Singh Sameer, and Gardner Matt. 2019. Analyzing compositionality in visual question answering. Advances in Neural Information Processing Systems 7 (2019).Google ScholarGoogle Scholar
  161. [161] Marino Kenneth, Rastegari Mohammad, Farhadi Ali, and Mottaghi Roozbeh. 2019. OK-VQA: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf Conference on Computer Vision and Pattern Recognition. 31953204.Google ScholarGoogle ScholarCross RefCross Ref
  162. [162] Paul Pu Liang, Yiwei Lyu, Xiang Fan, Zetian Wu, Yun Cheng, Jason Wu, Leslie Yufan Chen, Peter Wu, Michelle A Lee, Yuke Zhu, Ruslan Salakhutdinov, and Louis-Philippe Morency. MultiBench: Multiscale Benchmarks for Multimodal Representation Learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).Google ScholarGoogle Scholar
  163. [163] Shi Xingjian, Mueller Jonas, Erickson Nick, Li Mu, and Smola Alex. Benchmarking multimodal AutoML for tabular data with text fields. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).Google ScholarGoogle Scholar
  164. [164] Park Dong Huk, Hendricks Lisa Anne, Akata Zeynep, Schiele Bernt, Darrell Trevor, and Rohrbach Marcus. 2016. Attentive explanations: Justifying decisions and pointing to the evidence. arXiv preprint arXiv:1612.04757 (2016).Google ScholarGoogle Scholar
  165. [165] Agrawal Aishwarya, Batra Dhruv, Parikh Devi, and Kembhavi Aniruddha. 2018. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 49714980.Google ScholarGoogle ScholarCross RefCross Ref
  166. [166] Reed Scott, Akata Zeynep, Yan Xinchen, Logeswaran Lajanugen, Schiele Bernt, and Lee Honglak. 2016. Generative adversarial text to image synthesis. In International Conference on Machine Learning. PMLR, 10601069.Google ScholarGoogle Scholar
  167. [167] Wah Catherine, Branson Steve, Welinder Peter, Perona Pietro, and Belongie Serge. 2011. The Caltech-UCSD Birds-200-2011 dataset. California Institute of Technology (2011).Google ScholarGoogle Scholar
  168. [168] Xu Tao, Zhang Pengchuan, Huang Qiuyuan, Zhang Han, Gan Zhe, Huang Xiaolei, and He Xiaodong. 2018. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 13161324.Google ScholarGoogle ScholarCross RefCross Ref
  169. [169] Qu Leyuan, Weber Cornelius, and Wermter Stefan. 2019. LipSound: Neural mel-spectrogram reconstruction for lip reading. In INTERSPEECH. 27682772.Google ScholarGoogle Scholar
  170. [170] Afouras Triantafyllos, Chung Joon Son, Senior Andrew, Vinyals Oriol, and Zisserman Andrew. 2018. Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 12 (2018), 87178727.Google ScholarGoogle ScholarCross RefCross Ref
  171. [171] Harte Naomi and Gillen Eoin. 2015. TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE Transactions on Multimedia 17, 5 (2015), 603615.Google ScholarGoogle ScholarDigital LibraryDigital Library
  172. [172] Ping Wei, Peng Kainan, Gibiansky Andrew, Arik Sercan O., Kannan Ajay, Narang Sharan, Raiman Jonathan, and Miller John. 2018. Deep voice 3: Scaling text-to-speech with convolutional sequence learning. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  173. [173] Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu. 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, 4779–4783.Google ScholarGoogle Scholar
  174. [174] Ephrat Ariel and Peleg Shmuel. 2017. Vid2Speech: Speech reconstruction from silent video. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 50955099.Google ScholarGoogle ScholarDigital LibraryDigital Library
  175. [175] Akbari Hassan, Arora Himani, Cao Liangliang, and Mesgarani Nima. 2018. Lip2AudSpec: Speech reconstruction from silent lip movements video. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 25162520.Google ScholarGoogle ScholarDigital LibraryDigital Library
  176. [176] Vougioukas Konstantinos, Ma Pingchuan, Petridis Stavros, and Pantic Maja. 2019. Video-driven speech reconstruction using generative adversarial networks. Proc. Interspeech 2019 (2019), 41254129.Google ScholarGoogle ScholarCross RefCross Ref
  177. [177] Su Weijie, Zhu Xizhou, Cao Yue, Li Bin, Lu Lewei, Wei Furu, and Dai Jifeng. 2020. VL-BERT: Pre-training of generic visual-linguistic representations. In International Conference on Learning Representations. https://openreview.net/forum?id=SygXPaEYvHGoogle ScholarGoogle Scholar
  178. [178] Weng Wei-Hung and Szolovits Peter. 2019. Representation learning for electronic health records. arXiv preprint arXiv:1909.09248 (2019).Google ScholarGoogle Scholar
  179. [179] Zhang Xianli, Qian Buyue, Li Yang, Liu Yang, Chen Xi, Guan Chong, and Li Chen. 2021. Learning robust patient representations from multi-modal electronic health records: A supervised deep learning approach. In Proceedings of the 2021 SIAM International Conference on Data Mining (SDM). SIAM, 585593.Google ScholarGoogle ScholarCross RefCross Ref
  180. [180] Chen Tao, Hong Richang, Guo Yanrong, Hao Shijie, and Hu Bin. 2022. \(\mathrm{MS}^2\)-GNN: Exploring GNN-based multimodal fusion network for depression detection. IEEE Transactions on Cybernetics (2022).Google ScholarGoogle ScholarCross RefCross Ref
  181. [181] Gao Jianliang, Lyu Tengfei, Xiong Fan, Wang Jianxin, Ke Weimao, and Li Zhao. 2021. Predicting the survival of cancer patients with multimodal graph neural network. IEEE/ACM Transactions on Computational Biology and Bioinformatics 19, 2 (2021), 699709.Google ScholarGoogle Scholar
  182. [182] Caglayan O., Madhyastha P., Specia L., and Barrault L.. 2019. Probing the need for visual context in multimodal machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 41594170.Google ScholarGoogle ScholarCross RefCross Ref
  183. [183] Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations, Bengio Yoshua and LeCun Yann (Eds.).Google ScholarGoogle Scholar
  184. [184] Su Jinsong, Chen Jinchang, Jiang Hui, Zhou Chulun, Lin Huan, Ge Yubin, Wu Qingqiang, and Lai Yongxuan. 2021. Multi-modal neural machine translation with deep semantic interactions. Information Sciences 554 (2021), 4760.Google ScholarGoogle ScholarCross RefCross Ref
  185. [185] Hodosh Micah, Young Peter, and Hockenmaier Julia. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research 47 (2013), 853899.Google ScholarGoogle ScholarDigital LibraryDigital Library
  186. [186] Zadeh Amir and Pu Paul. 2018. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers).Google ScholarGoogle ScholarCross RefCross Ref
  187. [187] Johnson Alistair E. W., Pollard Tom J., Shen Lu, Lehman Li-wei H., Feng Mengling, Ghassemi Mohammad, Moody Benjamin, Szolovits Peter, Celi Leo Anthony, and Mark Roger G.. 2016. MIMIC-III, a freely accessible critical care database. Scientific Data 3, 1 (2016), 19.Google ScholarGoogle ScholarCross RefCross Ref
  188. [188] Han Xintong. 2017. Fashion 200K Benchmark. https://github.com/xthan/fashion-200k. (2017). [Online; accessed 2017].Google ScholarGoogle Scholar
  189. [189] Silberman N. and Fergus R.. 2011. Indoor scene segmentation using a structured light sensor. In Proceedings of the International Conference on Computer Vision - Workshop on 3D Representation and Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  190. [190] Silberman Nathan, Hoiem Derek, Kohli Pushmeet, and Fergus Rob. 2012. Indoor segmentation and support inference from RGBD images. In Proceedings of the 12th European Conference on Computer Vision-Volume Part V. 746760.Google ScholarGoogle ScholarDigital LibraryDigital Library
  191. [191] Biten Ali Furkan, Gomez Lluis, Rusinol Marçal, and Karatzas Dimosthenis. 2019. Good news, everyone! Context driven entity-aware captioning for news images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1246612475.Google ScholarGoogle ScholarCross RefCross Ref
  192. [192] Xu Jun, Mei Tao, Yao Ting, and Rui Yong. 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 52885296.Google ScholarGoogle ScholarCross RefCross Ref
  193. [193] Xu Dejing, Zhao Zhou, Xiao Jun, Wu Fei, Zhang Hanwang, He Xiangnan, and Zhuang Yueting. 2017. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM International Conference on Multimedia. 16451653.Google ScholarGoogle ScholarDigital LibraryDigital Library
  194. [194] Jang Yunseok, Song Yale, Yu Youngjae, Kim Youngjin, and Kim Gunhee. 2017. TGIF-QA: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 27582766.Google ScholarGoogle ScholarCross RefCross Ref
  195. [195] Yu Licheng, Chen Xinlei, Gkioxari Georgia, Bansal Mohit, Berg Tamara L., and Batra Dhruv. 2019. Multi-target embodied question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 63096318.Google ScholarGoogle ScholarCross RefCross Ref
  196. [196] Cangea Catalina, Belilovsky Eugene, Liò Pietro, and Courville Aaron C.. 2019. VideoNavQA: Bridging the gap between visual and embodied question answering. In Visually Grounded Interaction and Language (ViGIL), NeurIPS 2019 Workshop.Google ScholarGoogle Scholar
  197. [197] Kafle Kushal and Kanan Christopher. 2017. An analysis of visual question answering algorithms. In Proceedings of the IEEE International Conference on Computer Vision. 19651973.Google ScholarGoogle ScholarCross RefCross Ref
  198. [198] Caesar Holger, Bankiti Varun, Lang Alex H., Vora Sourabh, Liong Venice Erin, Xu Qiang, Krishnan Anush, Pan Yu, Baldan Giancarlo, and Beijbom Oscar. 2020. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1162111631.Google ScholarGoogle ScholarCross RefCross Ref
  199. [199] Nilsback Maria-Elena and Zisserman Andrew. 2008. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing. IEEE, 722729.Google ScholarGoogle ScholarDigital LibraryDigital Library
  200. [200] Zadeh Amir, Zellers Rowan, Pincus Eli, and Morency Louis-Philippe. 2016. MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259 (2016).Google ScholarGoogle Scholar
  201. [201] Zhang Yuanhang, Yang Shuang, Xiao Jingyun, Shan Shiguang, and Chen Xilin. 2020. Can we read speech beyond the lips? Rethinking RoI selection for deep visual speech recognition. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). IEEE, 356363.Google ScholarGoogle ScholarDigital LibraryDigital Library
  202. [202] Fallon Maurice, Johannsson Hordur, Kaess Michael, and Leonard John J.. 2013. The MIT Stata Center Dataset. The International Journal of Robotics Research 32, 14 (2013), 16951699.Google ScholarGoogle ScholarDigital LibraryDigital Library
  203. [203] Baevski Alexei, Hsu Wei-Ning, Xu Qiantong, Babu Arun, Gu Jiatao, and Auli Michael. 2022. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning. PMLR, 12981312.Google ScholarGoogle Scholar
  204. [204] Singh Amanpreet, Hu Ronghang, Goswami Vedanuj, Couairon Guillaume, Galuba Wojciech, Rohrbach Marcus, and Kiela Douwe. 2022. FLAVA: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1563815650.Google ScholarGoogle ScholarCross RefCross Ref
  205. [205] Ektefaie Yasha, Dasoulas George, Noori Ayush, Farhat Maha, and Zitnik Marinka. 2023. Multimodal learning with graphs. Nature Machine Intelligence (2023), 111.Google ScholarGoogle Scholar
  206. [206] Ni Minheng, Huang Haoyang, Su Lin, Cui Edward, Bharti Taroon, Wang Lijuan, Zhang Dongdong, and Duan Nan. 2021. M3P: Learning universal representations via multitask multilingual multimodal pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 39773986.Google ScholarGoogle ScholarCross RefCross Ref
  207. [207] Jain Aashi, Guo Mandy, Srinivasan Krishna, Chen Ting, Kudugunta Sneha, Jia Chao, Yang Yinfei, and Baldridge Jason. 2021. MURAL: Multimodal, multitask retrieval across languages. arXiv preprint arXiv:2109.05125 (2021).Google ScholarGoogle Scholar
  208. [208] Zeng Yan, Zhou Wangchunshu, Luo Ao, and Zhang Xinsong. 2022. Cross-view language modeling: Towards unified cross-lingual cross-modal pre-training. arXiv preprint arXiv:2206.00621 (2022).Google ScholarGoogle Scholar
  209. [209] Li Junnan, Selvaraju Ramprasaath, Gotmare Akhilesh, Joty Shafiq, Xiong Caiming, and Hoi Steven Chu Hong. 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems 34 (2021), 96949705.Google ScholarGoogle Scholar
  210. [210] Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish V. Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, and Radu Soricut. 2022. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794 (2022).Google ScholarGoogle Scholar
  211. [211] Tsimpoukelli Maria, Menick Jacob L., Cabi Serkan, Eslami S. M., Vinyals Oriol, and Hill Felix. 2021. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34 (2021), 200212.Google ScholarGoogle Scholar
  212. [212] Zhuge Mingchen, Gao Dehong, Fan Deng-Ping, Jin Linbo, Chen Ben, Zhou Haoming, Qiu Minghui, and Shao Ling. 2021. Kaleido-BERT: Vision-language pre-training on fashion domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1264712657.Google ScholarGoogle ScholarCross RefCross Ref
  213. [213] Chen Fei-Long, Zhang Du-Zhen, Han Ming-Lun, Chen Xiu-Yi, Shi Jing, Xu Shuang, and Xu Bo. 2023. VLP: A survey on vision-language pre-training. Machine Intelligence Research 20, 1 (2023), 3856.Google ScholarGoogle ScholarCross RefCross Ref
  214. [214] Fang Zhiyuan, Wang Jianfeng, Hu Xiaowei, Wang Lijuan, Yang Yezhou, and Liu Zicheng. 2021. Compressing visual-linguistic model via knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14281438.Google ScholarGoogle ScholarCross RefCross Ref
  215. [215] Jia Chao, Yang Yinfei, Xia Ye, Chen Yi-Ting, Parekh Zarana, Pham Hieu, Le Quoc, Sung Yun-Hsuan, Li Zhen, and Duerig Tom. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning. PMLR, 49044916.Google ScholarGoogle Scholar
  216. [216] Nanyi Fei, Zhiwu Lu, Yizhao Gao, Guoxing Yang, Yuqi Huo, Jingyuan Wen, Haoyu Lu, Ruihua Song, Xin Gao, Tao Xiang, Haoran Sun, and Jiling Wen. 2022. Towards artificial general intelligence via a multimodal foundation model. Nature Communications 13, 1 (2022), 3094.Google ScholarGoogle Scholar
  217. [217] Michelsanti Daniel, Tan Zheng-Hua, Zhang Shi-Xiong, Xu Yong, Yu Meng, Yu Dong, and Jensen Jesper. 2021. An overview of deep-learning-based audio-visual speech enhancement and separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 13681396.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 3
        March 2024
        665 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3613614
        • Editor:
        • Abdulmotaleb El Saddik
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 23 October 2023
        • Online AM: 29 August 2023
        • Accepted: 10 August 2023
        • Revised: 14 June 2023
        • Received: 18 January 2023
        Published in tomm Volume 20, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • survey
      • Article Metrics

        • Downloads (Last 12 months)549
        • Downloads (Last 6 weeks)199

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text