skip to main content
research-article

Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning

Published:06 February 2023Publication History
Skip Abstract Section

Abstract

Video captioning, which bridges vision and language, is a fundamental yet challenging task in computer vision. To generate accurate and comprehensive sentences, both visual and semantic information is quite important. However, most existing methods simply concatenate different types of features and ignore the interactions between them. In addition, there is a large semantic gap between visual feature space and semantic embedding space, making the task very challenging. To address these issues, we propose a framework named semantic embedding guided attention with Explicit visual Feature Fusion for vidEo CapTioning, EFFECT for short, in which we design an explicit visual-feature fusion (EVF) scheme to capture the pairwise interactions between multiple visual modalities and fuse multimodal visual features of videos in an explicit way. Furthermore, we propose a novel attention mechanism called semantic embedding guided attention (SEGA), which cooperates with the temporal attention to generate a joint attention map. Specifically, in SEGA, the semantic word embedding information is leveraged to guide the model to pay more attention to the most correlated visual features at each decoding stage. In this way, the semantic gap between visual and semantic space is alleviated to some extent. To evaluate the proposed model, we conduct extensive experiments on two widely used datasets, i.e., MSVD and MSR-VTT. The experimental results demonstrate that our approach achieves state-of-the-art results in terms of four evaluation metrics.

REFERENCES

  1. [1] Aafaq Nayyer, Akhtar Naveed, Liu Wei, Gilani Syed Zulqarnain, and Mian Ajmal. 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.1248712496.Google ScholarGoogle Scholar
  2. [2] Behnke Maximiliana and Heafield Kenneth. 2020. Losing heads in the lottery: Pruning transformer attention in neural machine translation. In Proc. Conf. Empirical Methods Natural Lang. Process.26642674.Google ScholarGoogle Scholar
  3. [3] Bennett Cynthia L., E Jane. Mott Martez, E., Cutrell Edward, and Morris Meredith Ringel. 2018. How teens with visual impairments take, edit, and share photos on social media. In Proc. CHI Conf. Hum. Factors Comput. Syst.76.Google ScholarGoogle Scholar
  4. [4] Carreira João and Zisserman Andrew. 2017. Quo Vadis, action recognition? A new model and the kinetics dataset. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.47244733.Google ScholarGoogle Scholar
  5. [5] Chen David L. and Dolan William B.. 2011. Collecting highly parallel data for paraphrase evaluation. In Proc. Annu. Meeting Assoc. Comput. Linguistics. 190200.Google ScholarGoogle Scholar
  6. [6] Chen Shaoxiang, Jiang Wenhao, Liu Wei, and Jiang Yu-Gang. 2020. Learning modality interaction for temporal sentence localization and event captioning in videos. In Proc. Eur. Conf. Comput. Vis.333351.Google ScholarGoogle Scholar
  7. [7] Chen Shaoxiang and Jiang Yu-Gang. 2019. Motion guided spatial attention for video captioning. In Proc. AAAI Conf. Artif. Intell.81918198.Google ScholarGoogle Scholar
  8. [8] Chen Shaoxiang and Jiang Yu-Gang. 2021. Motion guided region message passing for video captioning. In Proc. IEEE Int. Conf. Comput. Vis.15231532.Google ScholarGoogle Scholar
  9. [9] Chen Shaoxiang and Jiang Yu-Gang. 2021. Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.84258435.Google ScholarGoogle Scholar
  10. [10] Chen Xinlei, Fang Hao, Lin Tsung-Yi, Vedantam Ramakrishna, Gupta Saurabh, Dollár Piotr, and Zitnick C. Lawrence. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv:1504.00325. http://arxiv.org/abs/1504.00325.Google ScholarGoogle Scholar
  11. [11] Chen Yangyu, Wang Shuhui, Zhang Weigang, and Huang Qingming. 2018. Less is more: Picking informative frames for video captioning. In Proc. Eur. Conf. Comput. Vis.367384.Google ScholarGoogle Scholar
  12. [12] Deng Chaorui, Chen Shizhe, Chen Da, He Yuan, and Wu Qi. 2021. Sketch, ground, and refine: Top-down dense video captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.234243.Google ScholarGoogle Scholar
  13. [13] Denkowski Michael J. and Lavie Alon. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proc. 9th Workshop Stat. Mach. Transl.376380.Google ScholarGoogle Scholar
  14. [14] Fukui Akira, Park Dong Huk, Yang Daylen, Rohrbach Anna, Darrell Trevor, and Rohrbach Marcus. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proc. Conf. Empirical Methods Natural Lang. Process.457468.Google ScholarGoogle Scholar
  15. [15] Gao Lianli, Guo Zhao, Zhang Hanwang, Xu Xing, and Shen Heng Tao. 2017. Video captioning with attention-based LSTM and semantic consistency. IEEE Trans. Multim. 19, 9 (2017), 20452055.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Guadarrama Sergio, Krishnamoorthy Niveda, Malkarnenkar Girish, Venugopalan Subhashini, Mooney Raymond J., Darrell Trevor, and Saenko Kate. 2013. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proc. IEEE Int. Conf. Comput. Vis.27122719.Google ScholarGoogle Scholar
  17. [17] Guo Hao, Zheng Kang, Fan Xiaochuan, Yu Hongkai, and Wang Song. 2019. Visual attention consistency under image transforms for multi-label image classification. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.729739.Google ScholarGoogle Scholar
  18. [18] Gupta Shikha, Sharma Krishan, Dinesh Dileep Aroor, and Thenkanidiyoor Veena. 2021. Visual semantic-based representation learning using deep CNNs for scene recognition. ACM Trans. Multim. Comput. Commun. Appl. 17, 2 (2021), 53:1–53:24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Hori Chiori, Hori Takaaki, Lee Teng-Yok, Zhang Ziming, Harsham Bret, Hershey John R., Marks Tim K., and Sumi Kazuhiko. 2017. Attention-based multimodal fusion for video description. In Proc. IEEE Int. Conf. Comput. Vis.42034212.Google ScholarGoogle Scholar
  20. [20] Hou Jingyi, Wu Xinxiao, Zhao Wentian, Luo Jiebo, and Jia Yunde. 2019. Joint syntax representation learning and visual cue translation for video captioning. In Proc. IEEE Int. Conf. Comput. Vis.89178926.Google ScholarGoogle Scholar
  21. [21] Hu Yaosi, Chen Zhenzhong, Zha Zheng-Jun, and Wu Feng. 2019. Hierarchical global-local temporal modeling for video captioning. In Proc. ACM Multimedia Conf.774783.Google ScholarGoogle Scholar
  22. [22] Hussein Fairouz and Piccardi Massimo. 2017. V-JAUNE: A framework for joint action recognition and video summarization. ACM Trans. Multim. Comput. Commun. Appl. 13, 2 (2017), 20:1–20:19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Jin Tao, Huang Siyu, Li Yingming, and Zhang Zhongfei. 2019. Low-rank HOCA: Efficient high-order cross-modal attention for video captioning. In Proc. Conf. Empirical Methods Natural Lang. Process.20012011.Google ScholarGoogle Scholar
  24. [24] Kay Will, Carreira João, Simonyan Karen, Zhang Brian, Hillier Chloe, Vijayanarasimhan Sudheendra, Viola Fabio, Green Tim, Back Trevor, Natsev Paul, Suleyman Mustafa, and Zisserman Andrew. 2017. The kinetics human action video dataset. arXiv:1705.06950. http://arxiv.org/abs/1705.06950.Google ScholarGoogle Scholar
  25. [25] Li Linjie, Gan Zhe, Cheng Yu, and Liu Jingjing. 2019. Relation-aware graph attention network for visual question answering. In Proc. IEEE Int. Conf. Comput. Vis.1031210321.Google ScholarGoogle Scholar
  26. [26] Li Xuelong, Zhao Bin, and Lu Xiaoqiang. 2017. A general framework for edited video and raw video summarization. IEEE Trans. Image Process. 26, 8 (2017), 36523664.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Lin Chin-Yew and Och Franz Josef. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proc. 42nd Annu. Meeting Assoc. Comput. Linguistics. 605612.Google ScholarGoogle Scholar
  28. [28] Liu Sheng, Ren Zhou, and Yuan Junsong. 2018. SibNet: Sibling convolutional encoder for video captioning. In Proc. ACM Multimedia Conf.14251434.Google ScholarGoogle Scholar
  29. [29] Pan Boxiao, Cai Haoye, Huang De-An, Lee Kuan-Hui, Gaidon Adrien, Adeli Ehsan, and Niebles Juan Carlos. 2020. Spatio-temporal graph for video captioning with knowledge distillation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.1086710876.Google ScholarGoogle Scholar
  30. [30] Pan Yingwei, Mei Tao, Yao Ting, Li Houqiang, and Rui Yong. 2016. Jointly modeling embedding and translation to bridge video and language. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.45944602.Google ScholarGoogle Scholar
  31. [31] Pan Yingwei, Yao Ting, Li Yehao, and Mei Tao. 2020. X-linear attention networks for image captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.1096810977.Google ScholarGoogle Scholar
  32. [32] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. BLEU: A method for automatic evaluation of machine translation. In Proc. 40th Annu. Meeting Assoc. Comput. Linguistics. 311318.Google ScholarGoogle Scholar
  33. [33] Pei Wenjie, Zhang Jiyuan, Wang Xiangrong, Ke Lei, Shen Xiaoyong, and Tai Yu-Wing. 2019. Memory-attended recurrent network for video captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.83478356.Google ScholarGoogle Scholar
  34. [34] Pennington Jeffrey, Socher Richard, and Manning Christopher D.. 2014. GloVe: Global vectors for word representation. In Proc. Conf. Empirical Methods Natural Lang. Process.15321543.Google ScholarGoogle Scholar
  35. [35] Rohrbach Marcus, Qiu Wei, Titov Ivan, Thater Stefan, Pinkal Manfred, and Schiele Bernt. 2013. Translating video content to natural language descriptions. In Proc. IEEE Int. Conf. Comput. Vis.433440.Google ScholarGoogle Scholar
  36. [36] Russakovsky Olga, Deng Jia, Su Hao, Krause Jonathan, Satheesh Sanjeev, Ma Sean, Huang Zhiheng, Karpathy Andrej, Khosla Aditya, Bernstein Michael S., Berg Alexander C., and Li Fei-Fei. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211252.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Ryu Hobin, Kang Sunghun, Kang Haeyong, and Yoo Chang D.. 2021. Semantic grouping network for video captioning. In Proc. AAAI Conf. Artif. Intell.25142522.Google ScholarGoogle Scholar
  38. [38] Song Jingkuan, Gao Lianli, Guo Zhao, Liu Wu, Zhang Dongxiang, and Shen Heng Tao. 2017. Hierarchical LSTM with adjusted temporal attention for video captioning. In Proc. 26th Int. Joint Conf. Artif. Intell.27372743.Google ScholarGoogle Scholar
  39. [39] Song Yuqing, Chen Shizhe, and Jin Qin. 2021. Towards diverse paragraph captioning for untrimmed videos. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.1124511254.Google ScholarGoogle Scholar
  40. [40] Szegedy Christian, Ioffe Sergey, Vanhoucke Vincent, and Alemi Alexander A.. 2017. Inception-v4, inception-ResNet and the impact of residual connections on learning. In Proc. AAAI Conf. Artif. Intell.42784284.Google ScholarGoogle Scholar
  41. [41] Tan Ganchao, Liu Daqing, Wang Meng, and Zha Zheng-Jun. 2020. Learning to discretely compose reasoning module networks for video captioning. In Proc. Int. Joint Conf. Artif. Intell.745752.Google ScholarGoogle Scholar
  42. [42] Tang Pengjie, Wang Hanli, and Li Qinyu. 2019. Rich visual and language representation with complementary semantics for video captioning. ACM Trans. Multim. Comput. Commun. Appl. 15, 2 (2019), 31:1–31:23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Tu Yunbin, Zhang Xishan, Liu Bingtao, and Yan Chenggang. 2017. Video description with spatial-temporal attention. In Proc. ACM Multimedia Conf.10141022.Google ScholarGoogle Scholar
  44. [44] Vedantam Ramakrishna, Zitnick C. Lawrence, and Parikh Devi. 2015. CIDEr: Consensus-based image description evaluation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.45664575.Google ScholarGoogle Scholar
  45. [45] Venugopalan Subhashini, Hendricks Lisa Anne, Mooney Raymond J., and Saenko Kate. 2016. Improving LSTM-based video description with linguistic knowledge mined from text. In Proc. Conf. Empirical Methods Natural Lang. Process.19611966.Google ScholarGoogle Scholar
  46. [46] Venugopalan Subhashini, Rohrbach Marcus, Donahue Jeffrey, Mooney Raymond J., Darrell Trevor, and Saenko Kate. 2015. Sequence to sequence - video to text. In Proc. IEEE Int. Conf. Comput. Vis.45344542.Google ScholarGoogle Scholar
  47. [47] Venugopalan Subhashini, Xu Huijuan, Donahue Jeff, Rohrbach Marcus, Mooney Raymond J., and Saenko Kate. 2015. Translating videos to natural language using deep recurrent neural networks. In Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang. Technol.14941504.Google ScholarGoogle Scholar
  48. [48] Wang Bairui, Ma Lin, Zhang Wei, Jiang Wenhao, Wang Jingwen, and Liu Wei. 2019. Controllable video captioning with POS sequence guidance based on gated fusion network. In Proc. IEEE Int. Conf. Comput. Vis.26412650.Google ScholarGoogle Scholar
  49. [49] Wang Bairui, Ma Lin, Zhang Wei, and Liu Wei. 2018. Reconstruction network for video captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.76227631.Google ScholarGoogle Scholar
  50. [50] Wang Huiyun, Xu Youjiang, and Han Yahong. 2018. Spotting and aggregating salient regions for video captioning. In Proc. ACM Multimedia Conf.15191526.Google ScholarGoogle Scholar
  51. [51] Wang Junbo, Wang Wei, Huang Yan, Wang Liang, and Tan Tieniu. 2018. M3: Multimodal memory modelling for video captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.75127520.Google ScholarGoogle Scholar
  52. [52] Wang Qi, Gao Junyu, Lin Wei, and Yuan Yuan. 2019. Learning from synthetic data for crowd counting in the wild. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.81988207.Google ScholarGoogle Scholar
  53. [53] Wu Jie, Hu Haifeng, and Wu Yi. 2018. Image captioning via semantic guidance attention and consensus selection strategy. ACM Trans. Multim. Comput. Commun. Appl. 14, 4 (2018), 87:1–87:19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Xiong Caiming, Merity Stephen, and Socher Richard. 2016. Dynamic memory networks for visual and textual question answering. In Proc. Int. Conf. Mach. Learn.23972406.Google ScholarGoogle Scholar
  55. [55] Xu Huijuan, He Kun, Plummer Bryan A., Sigal Leonid, Sclaroff Stan, and Saenko Kate. 2019. Multilevel language and vision integration for text-to-clip retrieval. In Proc. AAAI Conf. Artif. Intell.90629069.Google ScholarGoogle Scholar
  56. [56] Xu Jun, Mei Tao, Yao Ting, and Rui Yong. 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.52885296.Google ScholarGoogle Scholar
  57. [57] Xu Jun, Yao Ting, Zhang Yongdong, and Mei Tao. 2017. Learning multimodal attention LSTM networks for video captioning. In Proc. ACM Multimedia Conf.537545.Google ScholarGoogle Scholar
  58. [58] Xu Youjiang, Han Yahong, Hong Richang, and Tian Qi. 2018. Sequential video VLAD: Training the aggregation locally and temporally. IEEE Trans. Image Process. 27, 10 (2018), 49334944.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Yang Zichao, He Xiaodong, Gao Jianfeng, Deng Li, and Smola Alexander J.. 2016. Stacked attention networks for image question answering. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.2129.Google ScholarGoogle Scholar
  60. [60] Yao Li, Torabi Atousa, Cho Kyunghyun, Ballas Nicolas, Pal Christopher J., Larochelle Hugo, and Courville Aaron C.. 2015. Describing videos by exploiting temporal structure. In Proc. IEEE Int. Conf. Comput. Vis.45074515.Google ScholarGoogle Scholar
  61. [61] Yu Zhou, Yu Jun, Fan Jianping, and Tao Dacheng. 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proc. IEEE Int. Conf. Comput. Vis.18391848.Google ScholarGoogle Scholar
  62. [62] Zha Zheng-Jun, Liu Jiawei, Yang Tianhao, and Zhang Yongdong. 2019. Spatiotemporal-textual co-attention network for video question answering. ACM Trans. Multim. Comput. Commun. Appl. 15, 2s (2019), 53:1–53:18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Zhang Junchao and Peng Yuxin. 2019. Object-aware aggregation with bidirectional temporal graph for video captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.83278336.Google ScholarGoogle Scholar
  64. [64] Zhang Xishan, Gao Ke, Zhang Yongdong, Zhang Dongming, Li Jintao, and Tian Qi. 2017. Task-driven dynamic fusion: Reducing ambiguity in video description. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.62506258.Google ScholarGoogle Scholar
  65. [65] Zhang Ziqi, Shi Yaya, Yuan Chunfeng, Li Bing, Wang Peijin, Hu Weiming, and Zha Zheng-Jun. 2020. Object relational graph with teacher-recommended learning for video captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.1327513285.Google ScholarGoogle Scholar
  66. [66] Zhao Bin, Li Xuelong, and Lu Xiaoqiang. 2019. CAM-RNN: Co-attention model based RNN for video captioning. IEEE Trans. Image Process. 28, 11 (2019), 55525565.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. [67] Zheng Qi, Wang Chaoyue, and Tao Dacheng. 2020. Syntax-aware action targeting for video captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.1309313102.Google ScholarGoogle Scholar
  68. [68] Zhou Bolei, Tian Yuandong, Sukhbaatar Sainbayar, Szlam Arthur, and Fergus Rob. 2015. Simple baseline for visual question answering. arXiv:1512.02167. http://arxiv.org/abs/1512.02167.Google ScholarGoogle Scholar

Index Terms

  1. Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 2
      March 2023
      540 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3572860
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 February 2023
      • Online AM: 22 July 2022
      • Accepted: 5 July 2022
      • Revised: 23 May 2022
      • Received: 27 October 2021
      Published in tomm Volume 19, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!