Abstract
Pre-training model on large-scale unlabeled web videos followed by task-specific fine-tuning is a canonical approach to learning video and language representations. However, the accompanying Automatic Speech Recognition (ASR) transcripts in these videos are directly transcribed from audio, which may be inconsistent with visual information and would impair the language modeling ability of the model. Meanwhile, previous V-L models fuse visual and language modality features using single- or dual-stream architectures, which are not suitable for the current situation. Besides, traditional V-L research focuses mainly on the interaction between vision and language modalities and leaves the modeling of relationships within modalities untouched. To address these issues and maintain a small manual labor cost, we add automatically extracted dense captions as a supplementary text and propose a new trilinear video-language interaction framework TEVL (Trilinear Encoder for Video-Language representation learning). TEVL contains three unimodal encoders, a TRIlinear encOder (TRIO) block, and a temporal Transformer. TRIO is specially designed to support effective text-vision-text interaction, which encourages inter-modal cooperation while maintaining intra-modal dependencies. We pre-train TEVL on the HowTo100M and TV datasets with four task objectives. Experimental results demonstrate that TEVL can learn powerful video-text representation and achieve competitive performance on three downstream tasks, including multimodal video captioning, video Question Answering (QA), as well as video and language inference. Implementation code is available at https://github.com/Gufrannn/TEVL.
- [1] . 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or [email protected] 2005. 65–72.Google Scholar
- [2] . 2020. iPerceive: Applying common-sense reasoning to multi-modal dense video captioning and video question answering. CoRR abs/2011.07735 (2020).Google Scholar
- [3] . 2021. Explainable video entailment with grounded visual evidence. In IEEE/CVF International Conference on Computer Vision. 2001–2010.Google Scholar
Cross Ref
- [4] . 2020. UNITER: UNiversal image-text representation learning. In 16th European Conference on Computer Vision. 104–120.Google Scholar
Digital Library
- [5] . 2009. ImageNet: A large-scale hierarchical image database. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 248–255.Google Scholar
Cross Ref
- [6] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171–4186.Google Scholar
- [7] . 2019. Compact trilinear interaction for visual question answering. In IEEE/CVF International Conference on Computer Vision. 392–401.Google Scholar
Cross Ref
- [8] . 2022. Reading-strategy inspired visual representation learning for text-to-video retrieval. IEEE Trans. Circ. Syst. Video Technol. 32, 8 (2022), 5680–5694.Google Scholar
Digital Library
- [9] . 2019. SlowFast networks for video recognition. In IEEE/CVF International Conference on Computer Vision. 6201–6210.Google Scholar
Cross Ref
- [10] . 2022. Hierarchical representation network with auxiliary tasks for video captioning and video question answering. IEEE Trans. Image Process. 31 (2022), 202–215.Google Scholar
Cross Ref
- [11] . 2020. COOT: Cooperative hierarchical transformer for video-text representation learning. In Annual Conference on Neural Information Processing Systems.Google Scholar
- [12] . 2021. Visual semantic-based representation learning using deep CNNs for scene recognition. ACM Trans. Multim. Comput. Commun. Appl. 17, 2 (2021), 53:1–53:24.Google Scholar
- [13] . 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Cross Ref
- [14] . 2019. Unicoder: A universal language encoder by pre-training with multiple cross-lingual tasks. In Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2485–2494.Google Scholar
Cross Ref
- [15] . 2016. DenseCap: Fully convolutional localization networks for dense captioning. In IEEE Conference on Computer Vision and Pattern Recognition. 4565–4574.Google Scholar
Cross Ref
- [16] . 2016. Exploring the limits of language modeling. CoRR abs/1602.02410 (2016).Google Scholar
- [17] . 2017. The kinetics human action video dataset. CoRR abs/1705.06950 (2017).Google Scholar
- [18] . 2019. Improving visual question answering by referring to generated paragraph captions. In 57th Conference of the Association for Computational Linguistics. 3606–3612.Google Scholar
Cross Ref
- [19] . 2020. Dense-caption matching and frame-selection gating for temporal localization in VideoQA. In 58th Annual Meeting of the Association for Computational Linguistics. 4812–4822.Google Scholar
- [20] . 2019. Gaining extra supervision via multi-task learning for multi-modal video question answering. In International Joint Conference on Neural Networks. 1–8.Google Scholar
- [21] . 2019. Progressive attention memory network for movie story question answering. In IEEE Conference on Computer Vision and Pattern Recognition. 8337–8346.Google Scholar
Cross Ref
- [22] . 2020. Video understanding as machine translation. CoRR abs/2006.07203 (2020).Google Scholar
- [23] . 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 1 (2017), 32–73.Google Scholar
Digital Library
- [24] . 2018. TVQA: Localized, compositional video question answering. In Conference on Empirical Methods in Natural Language Processing. 1369–1379.Google Scholar
Cross Ref
- [25] . 2020. TVQA+: Spatio-temporal grounding for video question answering. In 58th Annual Meeting of the Association for Computational Linguistics. 8211–8225.Google Scholar
- [26] . 2020. TVR: A large-scale dataset for video-subtitle moment retrieval. In 16th European Conference on Computer Vision. 447–463.Google Scholar
Digital Library
- [27] . 2020. Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. In 34th AAAI Conference on Artificial Intelligence, the 32nd Innovative Applications of Artificial Intelligence Conference, the 10th AAAI Symposium on Educational Advances in Artificial Intelligence. 11336–11344.Google Scholar
Cross Ref
- [28] . 2021. A CLIP-enhanced method for video-language understanding. CoRR abs/2110.07137 (2021).Google Scholar
- [29] . 2020. HERO: Hierarchical encoder for video+language omni-representation pre-training. In Conference on Empirical Methods in Natural Language Processing. 2046–2065.Google Scholar
Cross Ref
- [30] . 2021. VALUE: A multi-task benchmark for video-and-language understanding evaluation. In Annual Conference on Neural Information Processing Systems.Google Scholar
- [31] . 2019. VisualBERT: A simple and performant baseline for vision and language. CoRR abs/1908.03557 (2019).Google Scholar
- [32] . 2022. Order-constrained representation learning for instructional video prediction. IEEE Trans. Circ Syst. Video Technol. 32, 8 (2022), 5438–5452.Google Scholar
Digital Library
- [33] . 2022. Uni-EDEN: Universal encoder-decoder network by multi-granular vision-language pre-training. ACM Trans. Multim. Comput. Commun. Appl. 18, 2 (2022), 48:1–48:16.Google Scholar
- [34] . 2022. Self-supervised exclusive-inclusive interactive learning for multi-label facial expression recognition in the wild. IEEE Trans. Circ Syst. Video Technol. 32, 5 (2022), 3190–3202.Google Scholar
Digital Library
- [35] . 2017. Weakly supervised deep matrix factorization for social image understanding. IEEE Trans. Image Process. 26, 1 (2017), 276–288.Google Scholar
Digital Library
- [36] . 2019. Deep collaborative embedding for social image understanding. IEEE Trans. Pattern Anal. Mach. Intell. 41, 9 (2019), 2070–2083.Google Scholar
Cross Ref
- [37] . 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, Association for Computational Linguistics, 74–81.Google Scholar
- [38] . 2022. SwinBERT: End-to-end transformers with sparse attention for video captioning. In IEEE Conference on Computer Vision and Pattern Recognition. Association for Computational Linguistics, 17949–17958.Google Scholar
Cross Ref
- [39] . 2020. Violin: A large-scale dataset for video-and-language inference. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10897–10907.Google Scholar
Cross Ref
- [40] . 2019. RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019).Google Scholar
- [41] . 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations.Google Scholar
- [42] . 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Annual Conference on Neural Information Processing Systems. 13–23.Google Scholar
- [43] . 2021. CoCo-BERT: Improving video-language pre-training with contrastive cross-modal matching and denoising. In ACM Multimedia Conference, Virtual Event. 5600–5608.Google Scholar
Digital Library
- [44] . 2017. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In IEEE Conference on Computer Vision and Pattern Recognition. 7359–7368.Google Scholar
Cross Ref
- [45] . 2021. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans. Multim. Comput. Commun. Appl. 17, 4 (2021), 128:1–128:23.Google Scholar
- [46] . 2019. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In IEEE/CVF International Conference on Computer Vision. 2630–2640.Google Scholar
Cross Ref
- [47] . 2022. Auto-captions on GIF: A large-scale video-sentence dataset for vision-language pre-training. In 30th ACM International Conference on Multimedia. 7070–7074.Google Scholar
Digital Library
- [48] . 2002. Bleu: A method for automatic evaluation of machine translation. In 40th Annual Meeting of the Association for Computational Linguistics. 311–318.Google Scholar
- [49] . 2021. Learning transferable visual models from natural language supervision. In 38th International Conference on Machine Learning. 8748–8763.Google Scholar
- [50] . 2021. Winning the ICCV’2021 VALUE challenge: Task-aware ensemble and transfer learning with visual concepts. CoRR abs/2110.06476 (2021).Google Scholar
- [51] . 2019. Contrastive bidirectional transformer for temporal representation learning. CoRR abs/1906.05743 (2019).Google Scholar
- [52] . 2019. VideoBERT: A joint model for video and language representation learning. In IEEE/CVF International Conference on Computer Vision. 7463–7472.Google Scholar
Cross Ref
- [53] . 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 5099–5110.Google Scholar
Cross Ref
- [54] . 2019. Show, reward, and tell: Adversarial visual story generation. ACM Trans. Multim. Comput. Commun. Appl. 15, 2s (2019), 54:1–54:20.Google Scholar
- [55] . 2021. DeCEMBERT: Learning from noisy instructional videos via dense captions and entropy minimization. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2415–2426.Google Scholar
Cross Ref
- [56] . 2022. Depth estimation using a self-supervised network based on cross-layer feature fusion and the quadtree constraint. IEEE Trans. Circ. Syst. Video Technol. 32, 4 (2022), 1751–1766.Google Scholar
Cross Ref
- [57] . 2015. LSTM for punctuation restoration in speech transcripts. In 16th Annual Conference of the International Speech Communication Association. 683–687.Google Scholar
Cross Ref
- [58] . 2017. Attention is all you need. In Annual Conference on Neural Information Processing Systems. 5998–6008.Google Scholar
- [59] . 2015. CIDEr: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition. 4566–4575.Google Scholar
Cross Ref
- [60] . 2015. Show and tell: A neural image caption generator. In IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.Google Scholar
Cross Ref
- [61] . 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016).Google Scholar
- [62] . 2017. Dense captioning with joint inference and visual context. In IEEE Conference on Computer Vision and Pattern Recognition. 1978–1987.Google Scholar
Cross Ref
- [63] . 2020. BERT representations for video question answering. In IEEE Winter Conference on Applications of Computer Vision. 1545–1554.Google Scholar
- [64] . 2016. Stacked attention networks for image question answering. In IEEE Conference on Computer Vision and Pattern Recognition. 21–29.Google Scholar
Cross Ref
- [65] . 2018. MAttNet: Modular attention network for referring expression comprehension. In IEEE Conference on Computer Vision and Pattern Recognition. 1307–1315.Google Scholar
Cross Ref
- [66] . 2021. Adversarial multimodal network for movie story question answering. IEEE Trans. Multim. 23 (2021), 1744–1756.Google Scholar
Digital Library
- [67] . 2022. Video question answering with prior knowledge and object-sensitive learning. IEEE Trans. Image Process. 31 (2022).Google Scholar
Cross Ref
- [68] . 2022. Progressive tree-structured prototype network for end-to-end image captioning. In 30th ACM International Conference on Multimedia. 5210–5218.Google Scholar
Digital Library
- [69] . 2021. VinVL: Revisiting visual representations in vision-language models. In IEEE Conference on Computer Vision and Pattern Recognition. 5579–5588.Google Scholar
Cross Ref
- [70] . 2018. End-to-end dense video captioning with masked transformer. In IEEE Conference on Computer Vision and Pattern Recognition. 8739–8748.Google Scholar
Cross Ref
- [71] . 2023. Complementarity-aware space learning for video-text retrieval. IEEE Trans. Circ. Syst. Video Technol. (2023).
DOI: Google ScholarCross Ref
- [72] . 2020. ActBERT: Learning global-local video-text representations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8743–8752.Google Scholar
Cross Ref
- [73] . 2021. CrossCLR: Cross-modal contrastive learning for multi-modal video representations. In IEEE/CVF International Conference on Computer Vision. 1430–1439.Google Scholar
Cross Ref
Index Terms
TEVL: Trilinear Encoder for Video-language Representation Learning
Recommendations
Multi-modal Graph Contrastive Learning for Micro-video Recommendation
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information RetrievalRecently micro-videos have become more popular in social media platforms such as TikTok and Instagram. Engagements in these platforms are facilitated by multi-modal recommendation systems. Indeed, such multimedia content can involve diverse modalities, ...
GRMI: Graph Representation Learning of Multimodal Data with Incompleteness
Database Systems for Advanced ApplicationsAbstractMultimodal data can provide supplementary information of the subjects, which is of great potential for exploring the data-driven insights in various application scenarios. A large amount of researches focus on modal fusion to deriving quality ...
Block Matching Video Compression Based on Sparse Representation and Dictionary Learning
This work presents a video compression method based on sparse representation and dictionary learning algorithms. The proposed scheme achieves superb rate-distortion performance and decent subjective quality, compared to modern standards, especially at ...






Comments