skip to main content
research-article

TEVL: Trilinear Encoder for Video-language Representation Learning

Authors Info & Claims
Published:07 June 2023Publication History
Skip Abstract Section

Abstract

Pre-training model on large-scale unlabeled web videos followed by task-specific fine-tuning is a canonical approach to learning video and language representations. However, the accompanying Automatic Speech Recognition (ASR) transcripts in these videos are directly transcribed from audio, which may be inconsistent with visual information and would impair the language modeling ability of the model. Meanwhile, previous V-L models fuse visual and language modality features using single- or dual-stream architectures, which are not suitable for the current situation. Besides, traditional V-L research focuses mainly on the interaction between vision and language modalities and leaves the modeling of relationships within modalities untouched. To address these issues and maintain a small manual labor cost, we add automatically extracted dense captions as a supplementary text and propose a new trilinear video-language interaction framework TEVL (Trilinear Encoder for Video-Language representation learning). TEVL contains three unimodal encoders, a TRIlinear encOder (TRIO) block, and a temporal Transformer. TRIO is specially designed to support effective text-vision-text interaction, which encourages inter-modal cooperation while maintaining intra-modal dependencies. We pre-train TEVL on the HowTo100M and TV datasets with four task objectives. Experimental results demonstrate that TEVL can learn powerful video-text representation and achieve competitive performance on three downstream tasks, including multimodal video captioning, video Question Answering (QA), as well as video and language inference. Implementation code is available at https://github.com/Gufrannn/TEVL.

REFERENCES

  1. [1] Banerjee Satanjeev and Lavie Alon. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or [email protected] 2005. 6572.Google ScholarGoogle Scholar
  2. [2] Chadha Aman, Arora Gurneet, and Kaloty Navpreet. 2020. iPerceive: Applying common-sense reasoning to multi-modal dense video captioning and video question answering. CoRR abs/2011.07735 (2020).Google ScholarGoogle Scholar
  3. [3] Chen Junwen and Golisano Yu Kong. 2021. Explainable video entailment with grounded visual evidence. In IEEE/CVF International Conference on Computer Vision. 20012010.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Chen Yen-Chun, Li Linjie, Yu Licheng, Kholy Ahmed El, Ahmed Faisal, Gan Zhe, Cheng Yu, and Liu Jingjing. 2020. UNITER: UNiversal image-text representation learning. In 16th European Conference on Computer Vision. 104120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 248255.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 41714186.Google ScholarGoogle Scholar
  7. [7] Do Tuong, Tran Huy, Do Thanh-Toan, Tjiputra Erman, and Tran Quang D.. 2019. Compact trilinear interaction for visual question answering. In IEEE/CVF International Conference on Computer Vision. 392401.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Dong Jianfeng, Wang Yabing, Chen Xianke, Qu Xiaoye, Li Xirong, He Yuan, and Wang Xun. 2022. Reading-strategy inspired visual representation learning for text-to-video retrieval. IEEE Trans. Circ. Syst. Video Technol. 32, 8 (2022), 56805694.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Feichtenhofer Christoph, Fan Haoqi, Malik Jitendra, and He Kaiming. 2019. SlowFast networks for video recognition. In IEEE/CVF International Conference on Computer Vision. 62016210.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Gao Lianli, Lei Yu, Zeng Pengpeng, Song Jingkuan, Wang Meng, and Shen Heng Tao. 2022. Hierarchical representation network with auxiliary tasks for video captioning and video question answering. IEEE Trans. Image Process. 31 (2022), 202215.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Ging Simon, Zolfaghari Mohammadreza, Pirsiavash Hamed, and Brox Thomas. 2020. COOT: Cooperative hierarchical transformer for video-text representation learning. In Annual Conference on Neural Information Processing Systems.Google ScholarGoogle Scholar
  12. [12] Gupta Shikha, Sharma Krishan, Dinesh Dileep Aroor, and Thenkanidiyoor Veena. 2021. Visual semantic-based representation learning using deep CNNs for scene recognition. ACM Trans. Multim. Comput. Commun. Appl. 17, 2 (2021), 53:153:24.Google ScholarGoogle Scholar
  13. [13] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 770778.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Huang Haoyang, Liang Yaobo, Duan Nan, Gong Ming, Shou Linjun, Jiang Daxin, and Zhou Ming. 2019. Unicoder: A universal language encoder by pre-training with multiple cross-lingual tasks. In Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 24852494.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Johnson Justin, Karpathy Andrej, and Fei-Fei Li. 2016. DenseCap: Fully convolutional localization networks for dense captioning. In IEEE Conference on Computer Vision and Pattern Recognition. 45654574.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Józefowicz Rafal, Vinyals Oriol, Schuster Mike, Shazeer Noam, and Wu Yonghui. 2016. Exploring the limits of language modeling. CoRR abs/1602.02410 (2016).Google ScholarGoogle Scholar
  17. [17] Kay Will, Carreira João, Simonyan Karen, Zhang Brian, Hillier Chloe, Vijayanarasimhan Sudheendra, Viola Fabio, Green Tim, Back Trevor, Natsev Paul, Suleyman Mustafa, and Zisserman Andrew. 2017. The kinetics human action video dataset. CoRR abs/1705.06950 (2017).Google ScholarGoogle Scholar
  18. [18] Kim Hyounghun and Bansal Mohit. 2019. Improving visual question answering by referring to generated paragraph captions. In 57th Conference of the Association for Computational Linguistics. 36063612.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Kim Hyounghun, Tang Zineng, and Bansal Mohit. 2020. Dense-caption matching and frame-selection gating for temporal localization in VideoQA. In 58th Annual Meeting of the Association for Computational Linguistics. 48124822.Google ScholarGoogle Scholar
  20. [20] Kim Junyeong, Ma Minuk, Kim Kyungsu, Kim Sungjin, and Yoo Chang D.. 2019. Gaining extra supervision via multi-task learning for multi-modal video question answering. In International Joint Conference on Neural Networks. 18.Google ScholarGoogle Scholar
  21. [21] Kim Junyeong, Ma Minuk, Kim Kyungsu, Kim Sungjin, and Yoo Chang D.. 2019. Progressive attention memory network for movie story question answering. In IEEE Conference on Computer Vision and Pattern Recognition. 83378346.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Korbar Bruno, Petroni Fabio, Girdhar Rohit, and Torresani Lorenzo. 2020. Video understanding as machine translation. CoRR abs/2006.07203 (2020).Google ScholarGoogle Scholar
  23. [23] Krishna Ranjay, Zhu Yuke, Groth Oliver, Johnson Justin, Hata Kenji, Kravitz Joshua, Chen Stephanie, Kalantidis Yannis, Li Li-Jia, Shamma David A., Bernstein Michael S., and Fei-Fei Li. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 1 (2017), 3273.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Lei Jie, Yu Licheng, Bansal Mohit, and Berg Tamara L.. 2018. TVQA: Localized, compositional video question answering. In Conference on Empirical Methods in Natural Language Processing. 13691379.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Lei Jie, Yu Licheng, Berg Tamara L., and Bansal Mohit. 2020. TVQA+: Spatio-temporal grounding for video question answering. In 58th Annual Meeting of the Association for Computational Linguistics. 82118225.Google ScholarGoogle Scholar
  26. [26] Lei Jie, Yu Licheng, Berg Tamara L., and Bansal Mohit. 2020. TVR: A large-scale dataset for video-subtitle moment retrieval. In 16th European Conference on Computer Vision. 447463.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Li Gen, Duan Nan, Fang Yuejian, Gong Ming, and Jiang Daxin. 2020. Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. In 34th AAAI Conference on Artificial Intelligence, the 32nd Innovative Applications of Artificial Intelligence Conference, the 10th AAAI Symposium on Educational Advances in Artificial Intelligence. 1133611344.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Li Guohao, He Feng, and Feng Zhifan. 2021. A CLIP-enhanced method for video-language understanding. CoRR abs/2110.07137 (2021).Google ScholarGoogle Scholar
  29. [29] Li Linjie, Chen Yen-Chun, Cheng Yu, Gan Zhe, Yu Licheng, and Liu Jingjing. 2020. HERO: Hierarchical encoder for video+language omni-representation pre-training. In Conference on Empirical Methods in Natural Language Processing. 20462065.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Li Linjie, Lei Jie, Gan Zhe, Yu Licheng, Chen Yen-Chun, Pillai Rohit, Cheng Yu, Zhou Luowei, Wang Xin Eric, Wang William Yang, Berg Tamara Lee, Bansal Mohit, Liu Jingjing, Wang Lijuan, and Liu Zicheng. 2021. VALUE: A multi-task benchmark for video-and-language understanding evaluation. In Annual Conference on Neural Information Processing Systems.Google ScholarGoogle Scholar
  31. [31] Li Liunian Harold, Yatskar Mark, Yin Da, Hsieh Cho-Jui, and Chang Kai-Wei. 2019. VisualBERT: A simple and performant baseline for vision and language. CoRR abs/1908.03557 (2019).Google ScholarGoogle Scholar
  32. [32] Li Muheng, Chen Lei, Lu Jiwen, Feng Jianjiang, and Zhou Jie. 2022. Order-constrained representation learning for instructional video prediction. IEEE Trans. Circ Syst. Video Technol. 32, 8 (2022), 54385452.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Li Yehao, Fan Jiahao, Pan Yingwei, Yao Ting, Lin Weiyao, and Mei Tao. 2022. Uni-EDEN: Universal encoder-decoder network by multi-granular vision-language pre-training. ACM Trans. Multim. Comput. Commun. Appl. 18, 2 (2022), 48:148:16.Google ScholarGoogle Scholar
  34. [34] Li Yingjian, Gao Yingnan, Chen Bingzhi, Zhang Zheng, Lu Guangming, and Zhang David. 2022. Self-supervised exclusive-inclusive interactive learning for multi-label facial expression recognition in the wild. IEEE Trans. Circ Syst. Video Technol. 32, 5 (2022), 31903202.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Li Zechao and Tang Jinhui. 2017. Weakly supervised deep matrix factorization for social image understanding. IEEE Trans. Image Process. 26, 1 (2017), 276288.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Li Zechao, Tang Jinhui, and Mei Tao. 2019. Deep collaborative embedding for social image understanding. IEEE Trans. Pattern Anal. Mach. Intell. 41, 9 (2019), 20702083.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Lin Chin-Yew. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, Association for Computational Linguistics, 7481.Google ScholarGoogle Scholar
  38. [38] Lin Kevin, Li Linjie, Lin Chung-Ching, Ahmed Faisal, Gan Zhe, Liu Zicheng, Lu Yumao, and Wang Lijuan. 2022. SwinBERT: End-to-end transformers with sparse attention for video captioning. In IEEE Conference on Computer Vision and Pattern Recognition. Association for Computational Linguistics, 1794917958.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Liu Jingzhou, Chen Wenhu, Cheng Yu, Gan Zhe, Yu Licheng, Yang Yiming, and Liu Jingjing. 2020. Violin: A large-scale dataset for video-and-language inference. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1089710907.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Liu Yinhan, Ott Myle, Goyal Naman, Du Jingfei, Joshi Mandar, Chen Danqi, Levy Omer, Lewis Mike, Zettlemoyer Luke, and Stoyanov Veselin. 2019. RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019).Google ScholarGoogle Scholar
  41. [41] Loshchilov Ilya and Hutter Frank. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations.Google ScholarGoogle Scholar
  42. [42] Lu Jiasen, Batra Dhruv, Parikh Devi, and Lee Stefan. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Annual Conference on Neural Information Processing Systems. 1323.Google ScholarGoogle Scholar
  43. [43] Luo Jianjie, Li Yehao, Pan Yingwei, Yao Ting, Chao Hongyang, and Mei Tao. 2021. CoCo-BERT: Improving video-language pre-training with contrastive cross-modal matching and denoising. In ACM Multimedia Conference, Virtual Event. 56005608.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Maharaj Tegan, Ballas Nicolas, Rohrbach Anna, Courville Aaron C., and Pal Christopher Joseph. 2017. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In IEEE Conference on Computer Vision and Pattern Recognition. 73597368.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Messina Nicola, Amato Giuseppe, Esuli Andrea, Falchi Fabrizio, Gennaro Claudio, and Marchand-Maillet Stéphane. 2021. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans. Multim. Comput. Commun. Appl. 17, 4 (2021), 128:1128:23.Google ScholarGoogle Scholar
  46. [46] Miech Antoine, Zhukov Dimitri, Alayrac Jean-Baptiste, Tapaswi Makarand, Laptev Ivan, and Sivic Josef. 2019. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In IEEE/CVF International Conference on Computer Vision. 26302640.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Pan Yingwei, Li Yehao, Luo Jianjie, Xu Jun, Yao Ting, and Mei Tao. 2022. Auto-captions on GIF: A large-scale video-sentence dataset for vision-language pre-training. In 30th ACM International Conference on Multimedia. 70707074.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. Bleu: A method for automatic evaluation of machine translation. In 40th Annual Meeting of the Association for Computational Linguistics. 311318.Google ScholarGoogle Scholar
  49. [49] Radford Alec, Kim Jong Wook, Hallacy Chris, Ramesh Aditya, Goh Gabriel, Agarwal Sandhini, Sastry Girish, Askell Amanda, Mishkin Pamela, Clark Jack, Krueger Gretchen, and Sutskever Ilya. 2021. Learning transferable visual models from natural language supervision. In 38th International Conference on Machine Learning. 87488763.Google ScholarGoogle Scholar
  50. [50] Shin Minchul, Mun Jonghwan, On Kyoung-Woon, Kang Woo-Young, Han Gunsoo, and Kim Eun-Sol. 2021. Winning the ICCV’2021 VALUE challenge: Task-aware ensemble and transfer learning with visual concepts. CoRR abs/2110.06476 (2021).Google ScholarGoogle Scholar
  51. [51] Sun Chen, Baradel Fabien, Murphy Kevin, and Schmid Cordelia. 2019. Contrastive bidirectional transformer for temporal representation learning. CoRR abs/1906.05743 (2019).Google ScholarGoogle Scholar
  52. [52] Sun Chen, Myers Austin, Vondrick Carl, Murphy Kevin, and Schmid Cordelia. 2019. VideoBERT: A joint model for video and language representation learning. In IEEE/CVF International Conference on Computer Vision. 74637472.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Tan Hao and Bansal Mohit. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 50995110.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Tang Jinhui, Wang Jing, Li Zechao, Fu Jianlong, and Mei Tao. 2019. Show, reward, and tell: Adversarial visual story generation. ACM Trans. Multim. Comput. Commun. Appl. 15, 2s (2019), 54:154:20.Google ScholarGoogle Scholar
  55. [55] Tang Zineng, Lei Jie, and Bansal Mohit. 2021. DeCEMBERT: Learning from noisy instructional videos via dense captions and entropy minimization. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 24152426.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Tian Fangzheng, Gao Yongbin, Fang Zhijun, Fang Yuming, Gu Jia, Fujita Hamido, and Hwang Jenq-Neng. 2022. Depth estimation using a self-supervised network based on cross-layer feature fusion and the quadtree constraint. IEEE Trans. Circ. Syst. Video Technol. 32, 4 (2022), 17511766.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Tilk Ottokar and Alumäe Tanel. 2015. LSTM for punctuation restoration in speech transcripts. In 16th Annual Conference of the International Speech Communication Association. 683687.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. 2017. Attention is all you need. In Annual Conference on Neural Information Processing Systems. 59986008.Google ScholarGoogle Scholar
  59. [59] Vedantam Ramakrishna, Zitnick C. Lawrence, and Parikh Devi. 2015. CIDEr: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition. 45664575.Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Vinyals Oriol, Toshev Alexander, Bengio Samy, and Erhan Dumitru. 2015. Show and tell: A neural image caption generator. In IEEE Conference on Computer Vision and Pattern Recognition. 31563164.Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Wu Yonghui, Schuster Mike, Chen Zhifeng, Le Quoc V., Norouzi Mohammad, Macherey Wolfgang, Krikun Maxim, Cao Yuan, Gao Qin, Macherey Klaus, Klingner Jeff, Shah Apurva, Johnson Melvin, Liu Xiaobing, Kaiser Lukasz, Gouws Stephan, Kato Yoshikiyo, Kudo Taku, Kazawa Hideto, Stevens Keith, Kurian George, Patil Nishant, Wang Wei, Young Cliff, Smith Jason, Riesa Jason, Rudnick Alex, Vinyals Oriol, Corrado Greg, Hughes Macduff, and Dean Jeffrey. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016).Google ScholarGoogle Scholar
  62. [62] Yang Linjie, Tang Kevin D., Yang Jianchao, and Li Li-Jia. 2017. Dense captioning with joint inference and visual context. In IEEE Conference on Computer Vision and Pattern Recognition. 19781987.Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Yang Zekun, Garcia Noa, Chu Chenhui, Otani Mayu, Nakashima Yuta, and Takemura Haruo. 2020. BERT representations for video question answering. In IEEE Winter Conference on Applications of Computer Vision. 15451554.Google ScholarGoogle Scholar
  64. [64] Yang Zichao, He Xiaodong, Gao Jianfeng, Deng Li, and Smola Alexander J.. 2016. Stacked attention networks for image question answering. In IEEE Conference on Computer Vision and Pattern Recognition. 2129.Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Yu Licheng, Lin Zhe, Shen Xiaohui, Yang Jimei, Lu Xin, Bansal Mohit, and Berg Tamara L.. 2018. MAttNet: Modular attention network for referring expression comprehension. In IEEE Conference on Computer Vision and Pattern Recognition. 13071315.Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Yuan Zhaoquan, Sun Siyuan, Duan Lixin, Li Changsheng, Wu Xiao, and Xu Changsheng. 2021. Adversarial multimodal network for movie story question answering. IEEE Trans. Multim. 23 (2021), 17441756.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. [67] Zeng Pengpeng, Zhang Haonan, Gao Lianli, Song Jingkuan, and Shen Heng Tao. 2022. Video question answering with prior knowledge and object-sensitive learning. IEEE Trans. Image Process. 31 (2022).Google ScholarGoogle ScholarCross RefCross Ref
  68. [68] Zeng Pengpeng, Zhu Jinkuan, Song Jingkuan, and Gao Lianli. 2022. Progressive tree-structured prototype network for end-to-end image captioning. In 30th ACM International Conference on Multimedia. 52105218.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. [69] Zhang Pengchuan, Li Xiujun, Hu Xiaowei, Yang Jianwei, Zhang Lei, Wang Lijuan, Choi Yejin, and Gao Jianfeng. 2021. VinVL: Revisiting visual representations in vision-language models. In IEEE Conference on Computer Vision and Pattern Recognition. 55795588.Google ScholarGoogle ScholarCross RefCross Ref
  70. [70] Zhou Luowei, Zhou Yingbo, Corso Jason J., Socher Richard, and Xiong Caiming. 2018. End-to-end dense video captioning with masked transformer. In IEEE Conference on Computer Vision and Pattern Recognition. 87398748.Google ScholarGoogle ScholarCross RefCross Ref
  71. [71] Zhu Jinkuan, Zeng Pengpeng, Gao Lianli, Li Gongfu, Liao Dongliang, and Song Jingkuan. 2023. Complementarity-aware space learning for video-text retrieval. IEEE Trans. Circ. Syst. Video Technol. (2023). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  72. [72] Zhu Linchao and Yang Yi. 2020. ActBERT: Learning global-local video-text representations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 87438752.Google ScholarGoogle ScholarCross RefCross Ref
  73. [73] Zolfaghari Mohammadreza, Zhu Yi, Gehler Peter V., and Brox Thomas. 2021. CrossCLR: Cross-modal contrastive learning for multi-modal video representations. In IEEE/CVF International Conference on Computer Vision. 14301439.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. TEVL: Trilinear Encoder for Video-language Representation Learning

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 5s
      October 2023
      280 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3599694
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 7 June 2023
      • Online AM: 24 February 2023
      • Accepted: 21 February 2023
      • Revised: 13 February 2023
      • Received: 19 September 2022
      Published in tomm Volume 19, Issue 5s

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)97
      • Downloads (Last 6 weeks)24

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!