Abstract
Vision-language pre-training has been an emerging and fast-developing research topic, which transfers multi-modal knowledge from rich-resource pre-training task to limited-resource downstream tasks. Unlike existing works that predominantly learn a single generic encoder, we present a pre-trainable Universal Encoder-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception (e.g., visual question answering) and generation (e.g., image captioning). Uni-EDEN is a two-stream Transformer-based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality and sentence decoder that enables both multi-modal reasoning and sentence generation via inter-modal interaction. Considering that the linguistic representations of each image can span different granularities in this hierarchy including, from simple to comprehensive, individual label, a phrase, and a natural sentence, we pre-train Uni-EDEN through multi-granular vision-language proxy tasks: Masked Object Classification, Masked Region Phrase Generation, Image-Sentence Matching, and Masked Sentence Generation. In this way, Uni-EDEN is endowed with the power of both multi-modal representation extraction and language modeling. Extensive experiments demonstrate the compelling generalizability of Uni-EDEN by fine-tuning it to four vision-language perception and generation downstream tasks.
- [1] . 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.Google Scholar
- [2] . 2015. Vqa: Visual question answering. In ICCV.Google Scholar
- [3] . 2019. Temporal deformable convolutional encoder-decoder networks for video captioning. In AAAI. Google Scholar
Digital Library
- [4] . 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv:1504.00325. Retrieved from https://arxiv.org/abs/1504.00325.Google Scholar
- [5] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In ECCV.Google Scholar
- [6] . 2018. Paying more attention to saliency: Image captioning with saliency and context attention. ACM Trans. Multimedia Comput. Commun. Appl. 14, 2 (2018), 1–21. Google Scholar
Digital Library
- [7] . 2009. ImageNet: A large-scale hierarchical image database. In CVPR.Google Scholar
- [8] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL.Google Scholar
- [9] . 2019. Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In CVPR.Google Scholar
- [10] . 2019. Attention on attention for image captioning. In ICCV.Google Scholar
- [11] Po-Yao Huang, Mandela Patrick, Junjie Hu, Graham Neubig, Florian Metze, and Alexander G. Hauptmann. 2021. Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models. In NAACL, 2443–2459.Google Scholar
- [12] . 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR.Google Scholar
- [13] . 2018. Bilinear attention networks. In NeurIPS. Google Scholar
Digital Library
- [14] . 2015. Adam: A method for stochastic optimization. In ICLR.Google Scholar
- [15] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision. (2017). Google Scholar
Digital Library
- [16] . 2018. Stacked cross attention for image-text matching. In ECCV.Google Scholar
- [17] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In ACL.Google Scholar
- [18] Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In AAAI.Google Scholar
- [19] . 2019. Visualbert: A simple and performant baseline for vision and language. arXiv:1908.03557. Retrieved from https://arxiv.org/abs/1908.03557.Google Scholar
- [20] . 2021. Scheduled sampling in vision-language pretraining with decoupled encoder-decoder network. In AAAI.Google Scholar
- [21] . 2018. Jointly localizing and describing events for dense video captioning. In CVPR.Google Scholar
- [22] . 2019. Pointing novel objects in image captioning. In CVPR.Google Scholar
- [23] . 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS.Google Scholar
- [24] . 2021. CoCo-BERT: Improving video-language pre-training with contrastive cross-modal matching and denoising. In ACM MM. Google Scholar
Digital Library
- [25] . 2019. Multimodal dialog system: Generating responses via adaptive decoders. In ACM MM. Google Scholar
Digital Library
- [26] . 2020. Auto-captions on GIF: A large-scale video-sentence dataset for vision-language pre-training. arXiv:2007.02375. Retrieved from https://arxiv.org/abs/2007.02375.Google Scholar
- [27] . 2016. Jointly modeling embedding and translation to bridge video and language. In CVPR.Google Scholar
- [28] . 2020. X-linear attention networks for image captioning. In CVPR.Google Scholar
- [29] . 2019. PyTorch: An imperative style, high-performance deep learning library. In NeurIPS.Google Scholar
- [30] . 2018. Deep contextualized word representations. In NAACL.Google Scholar
- [31] . 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV. Google Scholar
Digital Library
- [32] . 2020. Context-aware multi-view summarization network for image-text matching. In ACM MM. Google Scholar
Digital Library
- [33] . 2018. Improving Language Understanding by Generative Pre-training. Technical Report, OpenAI (2018).Google Scholar
- [34] . 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS. Google Scholar
Digital Library
- [35] . 2017. Self-critical sequence training for image captioning. In CVPR.Google Scholar
- [36] . 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL.Google Scholar
- [37] . 2019. MASS: Masked sequence to sequence pre-training for language generation. In ICML.Google Scholar
- [38] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In ICLR.Google Scholar
- [39] . 2019. Videobert: A joint model for video and language representation learning. In ICCV.Google Scholar
- [40] . 2019. LXMERT: Learning cross-modality encoder representations from transformers. In EMNLP-IJCNLP.Google Scholar
- [41] . 2017. Attention is all you need. In NeurIPS. Google Scholar
Digital Library
- [42] . 2015. Show and tell: A neural image caption generator. In CVPR.Google Scholar
- [43] . 2018. Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Trans. Multimedia Comput. Commun. Appl. 14, 2s (2018), 1–20. Google Scholar
Digital Library
- [44] . 2020. Visual commonsense r-cnn. In CVPR.Google Scholar
- [45] . 2019. Image captioning by asking questions. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2s (2019), 1–19. Google Scholar
Digital Library
- [46] . 2018. Exploring visual relationship for image captioning. In ECCV.Google Scholar
- [47] . 2019. Hierarchy parsing for image captioning. In ICCV.Google Scholar
- [48] . 2017. Boosting image captioning with attributes. In ICCV.Google Scholar
- [49] . 2019. Deep modular co-attention networks for visual question answering. In CVPR.Google Scholar
- [50] . 2019. From recognition to cognition: Visual commonsense reasoning. In CVPR.Google Scholar
- [51] . 2020. Unified vision-language pre-training for image captioning and vqa. In AAAI.Google Scholar
Index Terms
Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training
Recommendations
CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training
MM '22: Proceedings of the 30th ACM International Conference on MultimediaWith the flourishing of social media platforms, vision-language pre-training (VLP) recently has received great attention and many remarkable progresses have been achieved. The success of VLP largely benefits from the information complementation and ...
VideoTRM: Pre-training for Video Captioning Challenge 2020
MM '20: Proceedings of the 28th ACM International Conference on MultimediaThe Pre-training for Video Captioning Challenge 2020 mainly focuses on developing video captioning systems by pre-training on the newly released large-scale Auto-captions on GIF dataset and further transferring the pre-trained model to MSR-VTT ...






Comments