Abstract
Video captioning is a challenging task in the field of multimedia processing, which aims to generate informative natural language descriptions/captions to describe video contents. Previous video captioning approaches mainly focused on capturing visual information in videos using an encoder-decoder structure to generate video captions. Recently, a new encoder-decoder-reconstructor structure was proposed for video captioning, which captured the information in both videos and captions. Based on this, this article proposes a novel multi-instance multi-label dual learning approach (MIMLDL) to generate video captions based on the encoder-decoder-reconstructor structure. Specifically, MIMLDL contains two modules: caption generation and video reconstruction modules. The caption generation module utilizes a lexical fully convolutional neural network (Lexical FCN) with a weakly supervised multi-instance multi-label learning mechanism to learn a translatable mapping between video regions and lexical labels to generate video captions. Then the video reconstruction module synthesizes visual sequences to reproduce raw videos using the outputs of the caption generation module. A dual learning mechanism fine-tunes the two modules according to the gap between the raw and the reproduced videos. Thus, our approach can minimize the semantic gap between raw videos and the generated captions by minimizing the differences between the reproduced and the raw visual sequences. Experimental results on a benchmark dataset demonstrate that MIMLDL can improve the accuracy of video captioning.
- B. Wang, L. Ma, W. Zhang, and W. Liu. 2018. Reconstruction network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7622–7631.Google Scholar
- J. Wang, W. Wang, Y. Huang, L. Wang, and Tan. 2018. M3: Multimodal memory modelling for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7512–7520.Google Scholar
Cross Ref
- C. Yan, Y. Tu, X. Wang, Y. Zhang, X. Hao, Y. Zhang, and Q. Dai. 2019. STAT: Spatial-temporal attention mechanism for video captioning. IEEE Trans. Multim. 22, 1 (2019), 229–241.Google Scholar
Digital Library
- A. Wang, H. Hu, and L. Yang. 2018. Image captioning with affective guiding and selective attention. ACM Trans. Multim. Comput., Commun., Applic. 14, 3 (2018), 1–15. Google Scholar
Digital Library
- L. Yang, H. Hu, S. Xing, and X. Lu. 2020. Constrained LSTM and residual attention for image captioning. ACM Trans. Multim. Comput., Commun., Applic. 16, 3 (2020), 1–18. Google Scholar
Digital Library
- J. Wu, H. Hu, and L. Yang. 2019. Pseudo-3D attention transfer network with content-aware strategy for image captioning. ACM Trans. Multim. Comput., Commun., Applic. 15, 3 (2019), 1–19. Google Scholar
Digital Library
- A. Kojima, T. Tamura, and K. Fukunaga. 2002. Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. 50, 2 (2002), 171–184. Google Scholar
Digital Library
- M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. 2013. Translating video content to natural language descriptions. In Proceedings of the IEEE International Conference on Computer Vision. 433–440. Google Scholar
Digital Library
- R. Xu, C. Xiong, W. Chen, and J. J. Corso. 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In Proceedings of the 29th AAAI Conference on Artificial Intelligence. 1–7. Google Scholar
Digital Library
- J. Ma, R. Wang, W. Ji, H. Zheng, E Zhu, and J. Yin. 2019. Relational recurrent neural networks for polyphonic sound event detection. Multim. Tools Applic. 78, 20 (2019), 29509–29527.Google Scholar
Cross Ref
- Y. Wu, X. Ji, W. Ji, Y. Tian, and H. Zhou. 2020. CASR: A context-aware residual network for single-image superresolution. Neural Comput. Applic. 32, 6 (2020), 14533--14548.Google Scholar
Cross Ref
- Z. Liu, Z. Li, M. Zong, W. Ji, R. Wang, and Y. Tian. 2019. Spatiotemporal saliency based multi-stream networks for action recognition. In Proceedings of the Asian Conference on Pattern Recognition. 74–84.Google Scholar
- S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. 2015. Translating videos to natural language using deep recurrent neural networks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1494–1504.Google Scholar
- S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE International Conference on Computer Vision. 4534–4542. Google Scholar
Digital Library
- C. Zhang and Y. Tian. 2016. Automatic video description generation via LSTM with joint two-stream encoding. In Proceedings of the 23rd International Conference on Pattern Recognition. 2924–2929.Google Scholar
- L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. 4507–4515. Google Scholar
Digital Library
- Z. Wu, T. Yao, Y. Fu, and Y. Jiang. 2017. Deep learning for video classification and captioning. In Frontiers of Multimedia Research. ACM, 3–29. Google Scholar
Digital Library
- Z. Shen, J. Li, Z. Su, M. Li, Y. Chen, Y. Jiang, and X. Xue. 2017. Weakly supervised dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1916–1924.Google Scholar
- T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez. 1997. Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89, 1-2 (1997), 31–71. Google Scholar
Digital Library
- P. Shamsolmoali, M. Zareapoor, H. Zhou, and J. Yang. 2020. AMIL: Adversarial multi-instance learning for human pose estimation. ACM Trans. Multim. Comput., Commun., Applic. 16, 1 (2020), 1–23. Google Scholar
Digital Library
- X. Zhang, H. Shi, C. Li, and P. Li. 2020. Multi-instance multi-label action recognition and localization based on spatio-temporal pre-trimming for untrimmed videos. In Proceedings of the AAAI Conference on Artificial Intelligence. 12886–12893.Google Scholar
- P. Luo, G. Wang, L. Lin, and X. Wang. 2017. Deep dual learning for semantic image segmentation. In Proceedings of the IEEE International Conference on Computer Vision. 2718–2726.Google Scholar
- Y. Xia, J. Bian, T. Qin, N. Yu, and T. Liu. 2017. Dual inference for machine learning. In Proceedings of the International Joint Conferences on Artificial Intelligence. 3112–3118. Google Scholar
Digital Library
- Z. Yi, H. Zhang, P. Tan, and M. Gong. 2017. Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE International Conference on Computer Vision. 2849–2857.Google Scholar
- T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. 2017. Learning to discover cross-domain relations with generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning. 1857–1865. Google Scholar
Digital Library
- J. Zhu, T. Park, P. Isola, and A. A. Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision. 2223–2232.Google Scholar
- D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W. Ma. 2016. Dual learning for machine translation. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 820–828. Google Scholar
Digital Library
- Y. Wang, Y. Xia, L. Zhao, J. Bian, T. Qin, G. Liu, and T. Liu. 2018. Dual transfer learning for neural machine translation with marginal distribution regularization. In Proceedings of the AAAI Conference on Artificial Intelligence. 1–7.Google Scholar
- G. Lample, A. Conneau, L. Denoyer, and M. A. Ranzato. 2018. Unsupervised machine translation using monolingual corpora only. In Proceedings of the International Conference on Learning Representations. 1–14.Google Scholar
- M. Artetxe, G. Labaka, E. Agirre, and K. Cho. 2018. Unsupervised neural machine translation. In Proceedings of the International Conference on Learning Representations. 1–12.Google Scholar
- Y. Wang, Y. Xia, T. He, F. Tian, T. Qin, C. Zhai, and T. Liu. 2019. Multi-agent dual learning. In Proceedings of the International Conference on Learning Representations. 1–15.Google Scholar
- Z. Zhao, Y. Xia, T. Qin, and T. Liu. 2019. Dual learning: Theoretical study and algorithmic extensions. In Proceedings of the International Conference on Learning Representations. 1–16.Google Scholar
- Y. Xia, T. Qin, W. Chen, J. Bian, N. Yu, and T. Liu. 2017. Dual supervised learning. In Proceedings of the International Conference on Machine Learning. 3789–3798. Google Scholar
Digital Library
- Y. Xia, X. Tan, F. Tian, T. Qin, N. Yu, and T. Liu. 2018. Model-level dual learning. In Proceedings of the International Conference on Machine Learning. 5383–5392.Google Scholar
- W. Zhao, W. Xu, M. Yang, J. Ye, Z. Zhao, Y. Feng, and Y. Qiao. 2017. Dual learning for cross-domain image captioning. In Proceedings of the ACM on Conference on Information and Knowledge Management. 29–38. Google Scholar
Digital Library
- K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. 173–180. Google Scholar
Digital Library
- H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao et al. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473–1482.Google Scholar
- L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell. 2016. Deep compositional captioning: Describing novel object categories without paired training data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–10.Google Scholar
- D. Heckerman. 1990. A tractable inference algorithm for diagnosing multiple diseases. In Mach. Intell. Pattern Recog. 10 (1990), 163–171. Google Scholar
Digital Library
- M. Gygli, H. Grabner, and L. V. Gool. 2015. Video summarization by learning submodular mixtures of objectives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3090–3098.Google Scholar
- J. Xu, T. Mei, T. Yao, and Y. Rui. 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5288–5296.Google Scholar
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 5998–6008. Google Scholar
Digital Library
- S. Banerjee and A. Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72. Google Scholar
Digital Library
- K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Meeting on Association for Computational Linguistics. 311–318. Google Scholar
Digital Library
- C. Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out. Association for Computational Linguistics, 74–81.Google Scholar
- X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325. (2015).Google Scholar
- C. Tang, X. Liu, S. An, and P. Wang. 2020. BR2Net: Defocus blur detection via bidirectional channel attention residual refining network. IEEE Trans. Multim. DOI: 10.1109/TMM.2020.2985541.Google Scholar
- C. Tang, X. Liu, P. Wang, C. Zhang, M. Li, and L. Wang. 2019. Adaptive hypergraph embedded semi-supervised multi-label image annotation. IEEE Trans. Multim. 21, 11 (2019), 2837–2849.Google Scholar
Cross Ref
- X. Liu, L. Wang, J. Zhang, J. Yin, and H. Liu. 2013. Global and local structure preservation for feature selection. IEEE Trans. Neural Netw. Learn. Syst. 25, 6 (2013), 1083–1095.Google Scholar
- Y. Tian, X. Wang, J. Wu, R. Wang, and B. Yang. 2019. Multi-scale hierarchical residual network for dense captioning. J. Artif. Intell. Res. 64 (2019), 181–196. Google Scholar
Digital Library
Index Terms
A Multi-instance Multi-label Dual Learning Approach for Video Captioning
Recommendations
An attention based dual learning approach for video captioning
AbstractVideo captioning aims to generate sentences/captions to describe video contents. It is one of the key tasks in the field of multimedia processing. However, most of the current video captioning approaches utilize only the visual ...
Highlights- We propose a novel attention based dual learning approach for video captioning.
Real-time visual tracking via online weighted multiple instance learning
Adaptive tracking-by-detection methods have been widely studied with promising results. These methods first train a classifier in an online manner. Then, a sliding window is used to extract some samples from the local regions surrounding the former ...
Multi-Level Visual Representation with Semantic-Reinforced Learning for Video Captioning
MM '21: Proceedings of the 29th ACM International Conference on MultimediaThis paper describes our bronze-medal solution for the video captioning task of the ACMMM2021 Pre-Training for Video Understanding Challenge. We depart from the Bottom-Up-Top-Down model, with technical improvements on both video content encoding and ...






Comments