skip to main content
research-article

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Published:16 February 2022Publication History
Skip Abstract Section

Abstract

Vision-language pre-training has been an emerging and fast-developing research topic, which transfers multi-modal knowledge from rich-resource pre-training task to limited-resource downstream tasks. Unlike existing works that predominantly learn a single generic encoder, we present a pre-trainable Universal Encoder-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception (e.g., visual question answering) and generation (e.g., image captioning). Uni-EDEN is a two-stream Transformer-based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality and sentence decoder that enables both multi-modal reasoning and sentence generation via inter-modal interaction. Considering that the linguistic representations of each image can span different granularities in this hierarchy including, from simple to comprehensive, individual label, a phrase, and a natural sentence, we pre-train Uni-EDEN through multi-granular vision-language proxy tasks: Masked Object Classification, Masked Region Phrase Generation, Image-Sentence Matching, and Masked Sentence Generation. In this way, Uni-EDEN is endowed with the power of both multi-modal representation extraction and language modeling. Extensive experiments demonstrate the compelling generalizability of Uni-EDEN by fine-tuning it to four vision-language perception and generation downstream tasks.

REFERENCES

  1. [1] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.Google ScholarGoogle Scholar
  2. [2] Antol Stanislaw, Agrawal Aishwarya, Lu Jiasen, Mitchell Margaret, Batra Dhruv, Zitnick C. Lawrence, and Parikh Devi. 2015. Vqa: Visual question answering. In ICCV.Google ScholarGoogle Scholar
  3. [3] Chen Jingwen, Pan Yingwei, Li Yehao, Yao Ting, Chao Hongyang, and Mei Tao. 2019. Temporal deformable convolutional encoder-decoder networks for video captioning. In AAAI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Chen Xinlei, Fang Hao, Lin Tsung-Yi, Vedantam Ramakrishna, Gupta Saurabh, Dollár Piotr, and Zitnick C. Lawrence. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv:1504.00325. Retrieved from https://arxiv.org/abs/1504.00325.Google ScholarGoogle Scholar
  5. [5] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In ECCV.Google ScholarGoogle Scholar
  6. [6] Cornia Marcella, Baraldi Lorenzo, Serra Giuseppe, and Cucchiara Rita. 2018. Paying more attention to saliency: Image captioning with saliency and context attention. ACM Trans. Multimedia Comput. Commun. Appl. 14, 2 (2018), 121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In CVPR.Google ScholarGoogle Scholar
  8. [8] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL.Google ScholarGoogle Scholar
  9. [9] Gao Peng, Jiang Zhengkai, You Haoxuan, Lu Pan, Hoi Steven C. H., Wang Xiaogang, and Li Hongsheng. 2019. Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In CVPR.Google ScholarGoogle Scholar
  10. [10] Huang Lun, Wang Wenmin, Chen Jie, and Wei Xiao-Yong. 2019. Attention on attention for image captioning. In ICCV.Google ScholarGoogle Scholar
  11. [11] Po-Yao Huang, Mandela Patrick, Junjie Hu, Graham Neubig, Florian Metze, and Alexander G. Hauptmann. 2021. Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models. In NAACL, 2443–2459.Google ScholarGoogle Scholar
  12. [12] Karpathy Andrej and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR.Google ScholarGoogle Scholar
  13. [13] Kim Jin-Hwa, Jun Jaehyun, and Zhang Byoung-Tak. 2018. Bilinear attention networks. In NeurIPS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Kingma Diederik and Ba Jimmy. 2015. Adam: A method for stochastic optimization. In ICLR.Google ScholarGoogle Scholar
  15. [15] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision. (2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Lee Kuang-Huei, Chen Xi, Hua Gang, Hu Houdong, and He Xiaodong. 2018. Stacked cross attention for image-text matching. In ECCV.Google ScholarGoogle Scholar
  17. [17] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In ACL.Google ScholarGoogle Scholar
  18. [18] Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In AAAI.Google ScholarGoogle Scholar
  19. [19] Li Liunian Harold, Yatskar Mark, Yin Da, Hsieh Cho-Jui, and Chang Kai-Wei. 2019. Visualbert: A simple and performant baseline for vision and language. arXiv:1908.03557. Retrieved from https://arxiv.org/abs/1908.03557.Google ScholarGoogle Scholar
  20. [20] Li Yehao, Pan Yingwei, Yao Ting, Chen Jingwen, and Mei Tao. 2021. Scheduled sampling in vision-language pretraining with decoupled encoder-decoder network. In AAAI.Google ScholarGoogle Scholar
  21. [21] Li Yehao, Yao Ting, Pan Yingwei, Chao Hongyang, and Mei Tao. 2018. Jointly localizing and describing events for dense video captioning. In CVPR.Google ScholarGoogle Scholar
  22. [22] Li Yehao, Yao Ting, Pan Yingwei, Chao Hongyang, and Mei Tao. 2019. Pointing novel objects in image captioning. In CVPR.Google ScholarGoogle Scholar
  23. [23] Lu Jiasen, Batra Dhruv, Parikh Devi, and Lee Stefan. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS.Google ScholarGoogle Scholar
  24. [24] Luo Jianjie, Li Yehao, Pan Yingwei, Yao Ting, Chao Hongyang, and Mei Tao. 2021. CoCo-BERT: Improving video-language pre-training with contrastive cross-modal matching and denoising. In ACM MM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Nie Liqiang, Wang Wenjie, Hong Richang, Wang Meng, and Tian Qi. 2019. Multimodal dialog system: Generating responses via adaptive decoders. In ACM MM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Pan Yingwei, Li Yehao, Luo Jianjie, Xu Jun, Yao Ting, and Mei Tao. 2020. Auto-captions on GIF: A large-scale video-sentence dataset for vision-language pre-training. arXiv:2007.02375. Retrieved from https://arxiv.org/abs/2007.02375.Google ScholarGoogle Scholar
  27. [27] Pan Yingwei, Mei Tao, Yao Ting, Li Houqiang, and Rui Yong. 2016. Jointly modeling embedding and translation to bridge video and language. In CVPR.Google ScholarGoogle Scholar
  28. [28] Pan Yingwei, Yao Ting, Li Yehao, and Mei Tao. 2020. X-linear attention networks for image captioning. In CVPR.Google ScholarGoogle Scholar
  29. [29] Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, et al. 2019. PyTorch: An imperative style, high-performance deep learning library. In NeurIPS.Google ScholarGoogle Scholar
  30. [30] Peters Matthew E., Neumann Mark, Iyyer Mohit, Gardner Matt, Clark Christopher, Lee Kenton, and Zettlemoyer Luke. 2018. Deep contextualized word representations. In NAACL.Google ScholarGoogle Scholar
  31. [31] Plummer Bryan A., Wang Liwei, Cervantes Chris M., Caicedo Juan C., Hockenmaier Julia, and Lazebnik Svetlana. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Qu Leigang, Liu Meng, Cao Da, Nie Liqiang, and Tian Qi. 2020. Context-aware multi-view summarization network for image-text matching. In ACM MM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Radford Alec, Narasimhan Karthik, Salimans Tim, and Sutskever Ilya. 2018. Improving Language Understanding by Generative Pre-training. Technical Report, OpenAI (2018).Google ScholarGoogle Scholar
  34. [34] Ren Shaoqing, He Kaiming, Girshick Ross, and Sun Jian. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Rennie Steven J., Marcheret Etienne, Mroueh Youssef, Ross Jerret, and Goel Vaibhava. 2017. Self-critical sequence training for image captioning. In CVPR.Google ScholarGoogle Scholar
  36. [36] Sharma Piyush, Ding Nan, Goodman Sebastian, and Soricut Radu. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL.Google ScholarGoogle Scholar
  37. [37] Song Kaitao, Tan Xu, Qin Tao, Lu Jianfeng, and Liu Tie-Yan. 2019. MASS: Masked sequence to sequence pre-training for language generation. In ICML.Google ScholarGoogle Scholar
  38. [38] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In ICLR.Google ScholarGoogle Scholar
  39. [39] Sun Chen, Myers Austin, Vondrick Carl, Murphy Kevin, and Schmid Cordelia. 2019. Videobert: A joint model for video and language representation learning. In ICCV.Google ScholarGoogle Scholar
  40. [40] Tan Hao and Bansal Mohit. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In EMNLP-IJCNLP.Google ScholarGoogle Scholar
  41. [41] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In NeurIPS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Vinyals Oriol, Toshev Alexander, Bengio Samy, and Erhan Dumitru. 2015. Show and tell: A neural image caption generator. In CVPR.Google ScholarGoogle Scholar
  43. [43] Wang Cheng, Yang Haojin, and Meinel Christoph. 2018. Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Trans. Multimedia Comput. Commun. Appl. 14, 2s (2018), 120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Wang Tan, Huang Jianqiang, Zhang Hanwang, and Sun Qianru. 2020. Visual commonsense r-cnn. In CVPR.Google ScholarGoogle Scholar
  45. [45] Yang Xiaoshan and Xu Changsheng. 2019. Image captioning by asking questions. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2s (2019), 119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2018. Exploring visual relationship for image captioning. In ECCV.Google ScholarGoogle Scholar
  47. [47] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2019. Hierarchy parsing for image captioning. In ICCV.Google ScholarGoogle Scholar
  48. [48] Yao Ting, Pan Yingwei, Li Yehao, Qiu Zhaofan, and Mei Tao. 2017. Boosting image captioning with attributes. In ICCV.Google ScholarGoogle Scholar
  49. [49] Yu Zhou, Yu Jun, Cui Yuhao, Tao Dacheng, and Tian Qi. 2019. Deep modular co-attention networks for visual question answering. In CVPR.Google ScholarGoogle Scholar
  50. [50] Zellers Rowan, Bisk Yonatan, Farhadi Ali, and Choi Yejin. 2019. From recognition to cognition: Visual commonsense reasoning. In CVPR.Google ScholarGoogle Scholar
  51. [51] Zhou Luowei, Palangi Hamid, Zhang Lei, Hu Houdong, Corso Jason J., and Gao Jianfeng. 2020. Unified vision-language pre-training for image captioning and vqa. In AAAI.Google ScholarGoogle Scholar

Index Terms

  1. Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Multimedia Computing, Communications, and Applications
              ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 2
              May 2022
              494 pages
              ISSN:1551-6857
              EISSN:1551-6865
              DOI:10.1145/3505207
              Issue’s Table of Contents

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 16 February 2022
              • Accepted: 1 June 2021
              • Revised: 1 May 2021
              • Received: 1 December 2020
              Published in tomm Volume 18, Issue 2

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            Full Text

            View this article in Full Text.

            View Full Text

            HTML Format

            View this article in HTML Format .

            View HTML Format
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!