skip to main content
research-article

A Multi-instance Multi-label Dual Learning Approach for Video Captioning

Authors Info & Claims
Published:14 June 2021Publication History
Skip Abstract Section

Abstract

Video captioning is a challenging task in the field of multimedia processing, which aims to generate informative natural language descriptions/captions to describe video contents. Previous video captioning approaches mainly focused on capturing visual information in videos using an encoder-decoder structure to generate video captions. Recently, a new encoder-decoder-reconstructor structure was proposed for video captioning, which captured the information in both videos and captions. Based on this, this article proposes a novel multi-instance multi-label dual learning approach (MIMLDL) to generate video captions based on the encoder-decoder-reconstructor structure. Specifically, MIMLDL contains two modules: caption generation and video reconstruction modules. The caption generation module utilizes a lexical fully convolutional neural network (Lexical FCN) with a weakly supervised multi-instance multi-label learning mechanism to learn a translatable mapping between video regions and lexical labels to generate video captions. Then the video reconstruction module synthesizes visual sequences to reproduce raw videos using the outputs of the caption generation module. A dual learning mechanism fine-tunes the two modules according to the gap between the raw and the reproduced videos. Thus, our approach can minimize the semantic gap between raw videos and the generated captions by minimizing the differences between the reproduced and the raw visual sequences. Experimental results on a benchmark dataset demonstrate that MIMLDL can improve the accuracy of video captioning.

References

  1. B. Wang, L. Ma, W. Zhang, and W. Liu. 2018. Reconstruction network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7622–7631.Google ScholarGoogle Scholar
  2. J. Wang, W. Wang, Y. Huang, L. Wang, and Tan. 2018. M3: Multimodal memory modelling for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7512–7520.Google ScholarGoogle ScholarCross RefCross Ref
  3. C. Yan, Y. Tu, X. Wang, Y. Zhang, X. Hao, Y. Zhang, and Q. Dai. 2019. STAT: Spatial-temporal attention mechanism for video captioning. IEEE Trans. Multim. 22, 1 (2019), 229–241.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Wang, H. Hu, and L. Yang. 2018. Image captioning with affective guiding and selective attention. ACM Trans. Multim. Comput., Commun., Applic. 14, 3 (2018), 1–15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. Yang, H. Hu, S. Xing, and X. Lu. 2020. Constrained LSTM and residual attention for image captioning. ACM Trans. Multim. Comput., Commun., Applic. 16, 3 (2020), 1–18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Wu, H. Hu, and L. Yang. 2019. Pseudo-3D attention transfer network with content-aware strategy for image captioning. ACM Trans. Multim. Comput., Commun., Applic. 15, 3 (2019), 1–19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Kojima, T. Tamura, and K. Fukunaga. 2002. Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. 50, 2 (2002), 171–184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. 2013. Translating video content to natural language descriptions. In Proceedings of the IEEE International Conference on Computer Vision. 433–440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. Xu, C. Xiong, W. Chen, and J. J. Corso. 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In Proceedings of the 29th AAAI Conference on Artificial Intelligence. 1–7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Ma, R. Wang, W. Ji, H. Zheng, E Zhu, and J. Yin. 2019. Relational recurrent neural networks for polyphonic sound event detection. Multim. Tools Applic. 78, 20 (2019), 29509–29527.Google ScholarGoogle ScholarCross RefCross Ref
  11. Y. Wu, X. Ji, W. Ji, Y. Tian, and H. Zhou. 2020. CASR: A context-aware residual network for single-image superresolution. Neural Comput. Applic. 32, 6 (2020), 14533--14548.Google ScholarGoogle ScholarCross RefCross Ref
  12. Z. Liu, Z. Li, M. Zong, W. Ji, R. Wang, and Y. Tian. 2019. Spatiotemporal saliency based multi-stream networks for action recognition. In Proceedings of the Asian Conference on Pattern Recognition. 74–84.Google ScholarGoogle Scholar
  13. S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. 2015. Translating videos to natural language using deep recurrent neural networks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1494–1504.Google ScholarGoogle Scholar
  14. S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE International Conference on Computer Vision. 4534–4542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. Zhang and Y. Tian. 2016. Automatic video description generation via LSTM with joint two-stream encoding. In Proceedings of the 23rd International Conference on Pattern Recognition. 2924–2929.Google ScholarGoogle Scholar
  16. L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. 4507–4515. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Z. Wu, T. Yao, Y. Fu, and Y. Jiang. 2017. Deep learning for video classification and captioning. In Frontiers of Multimedia Research. ACM, 3–29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Z. Shen, J. Li, Z. Su, M. Li, Y. Chen, Y. Jiang, and X. Xue. 2017. Weakly supervised dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1916–1924.Google ScholarGoogle Scholar
  19. T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez. 1997. Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89, 1-2 (1997), 31–71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. Shamsolmoali, M. Zareapoor, H. Zhou, and J. Yang. 2020. AMIL: Adversarial multi-instance learning for human pose estimation. ACM Trans. Multim. Comput., Commun., Applic. 16, 1 (2020), 1–23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. X. Zhang, H. Shi, C. Li, and P. Li. 2020. Multi-instance multi-label action recognition and localization based on spatio-temporal pre-trimming for untrimmed videos. In Proceedings of the AAAI Conference on Artificial Intelligence. 12886–12893.Google ScholarGoogle Scholar
  22. P. Luo, G. Wang, L. Lin, and X. Wang. 2017. Deep dual learning for semantic image segmentation. In Proceedings of the IEEE International Conference on Computer Vision. 2718–2726.Google ScholarGoogle Scholar
  23. Y. Xia, J. Bian, T. Qin, N. Yu, and T. Liu. 2017. Dual inference for machine learning. In Proceedings of the International Joint Conferences on Artificial Intelligence. 3112–3118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Z. Yi, H. Zhang, P. Tan, and M. Gong. 2017. Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE International Conference on Computer Vision. 2849–2857.Google ScholarGoogle Scholar
  25. T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. 2017. Learning to discover cross-domain relations with generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning. 1857–1865. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Zhu, T. Park, P. Isola, and A. A. Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision. 2223–2232.Google ScholarGoogle Scholar
  27. D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W. Ma. 2016. Dual learning for machine translation. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 820–828. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Wang, Y. Xia, L. Zhao, J. Bian, T. Qin, G. Liu, and T. Liu. 2018. Dual transfer learning for neural machine translation with marginal distribution regularization. In Proceedings of the AAAI Conference on Artificial Intelligence. 1–7.Google ScholarGoogle Scholar
  29. G. Lample, A. Conneau, L. Denoyer, and M. A. Ranzato. 2018. Unsupervised machine translation using monolingual corpora only. In Proceedings of the International Conference on Learning Representations. 1–14.Google ScholarGoogle Scholar
  30. M. Artetxe, G. Labaka, E. Agirre, and K. Cho. 2018. Unsupervised neural machine translation. In Proceedings of the International Conference on Learning Representations. 1–12.Google ScholarGoogle Scholar
  31. Y. Wang, Y. Xia, T. He, F. Tian, T. Qin, C. Zhai, and T. Liu. 2019. Multi-agent dual learning. In Proceedings of the International Conference on Learning Representations. 1–15.Google ScholarGoogle Scholar
  32. Z. Zhao, Y. Xia, T. Qin, and T. Liu. 2019. Dual learning: Theoretical study and algorithmic extensions. In Proceedings of the International Conference on Learning Representations. 1–16.Google ScholarGoogle Scholar
  33. Y. Xia, T. Qin, W. Chen, J. Bian, N. Yu, and T. Liu. 2017. Dual supervised learning. In Proceedings of the International Conference on Machine Learning. 3789–3798. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Y. Xia, X. Tan, F. Tian, T. Qin, N. Yu, and T. Liu. 2018. Model-level dual learning. In Proceedings of the International Conference on Machine Learning. 5383–5392.Google ScholarGoogle Scholar
  35. W. Zhao, W. Xu, M. Yang, J. Ye, Z. Zhao, Y. Feng, and Y. Qiao. 2017. Dual learning for cross-domain image captioning. In Proceedings of the ACM on Conference on Information and Knowledge Management. 29–38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. 173–180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao et al. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473–1482.Google ScholarGoogle Scholar
  38. L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell. 2016. Deep compositional captioning: Describing novel object categories without paired training data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–10.Google ScholarGoogle Scholar
  39. D. Heckerman. 1990. A tractable inference algorithm for diagnosing multiple diseases. In Mach. Intell. Pattern Recog. 10 (1990), 163–171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. M. Gygli, H. Grabner, and L. V. Gool. 2015. Video summarization by learning submodular mixtures of objectives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3090–3098.Google ScholarGoogle Scholar
  41. J. Xu, T. Mei, T. Yao, and Y. Rui. 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5288–5296.Google ScholarGoogle Scholar
  42. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 5998–6008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. S. Banerjee and A. Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Meeting on Association for Computational Linguistics. 311–318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. C. Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out. Association for Computational Linguistics, 74–81.Google ScholarGoogle Scholar
  46. X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325. (2015).Google ScholarGoogle Scholar
  47. C. Tang, X. Liu, S. An, and P. Wang. 2020. BR2Net: Defocus blur detection via bidirectional channel attention residual refining network. IEEE Trans. Multim. DOI: 10.1109/TMM.2020.2985541.Google ScholarGoogle Scholar
  48. C. Tang, X. Liu, P. Wang, C. Zhang, M. Li, and L. Wang. 2019. Adaptive hypergraph embedded semi-supervised multi-label image annotation. IEEE Trans. Multim. 21, 11 (2019), 2837–2849.Google ScholarGoogle ScholarCross RefCross Ref
  49. X. Liu, L. Wang, J. Zhang, J. Yin, and H. Liu. 2013. Global and local structure preservation for feature selection. IEEE Trans. Neural Netw. Learn. Syst. 25, 6 (2013), 1083–1095.Google ScholarGoogle Scholar
  50. Y. Tian, X. Wang, J. Wu, R. Wang, and B. Yang. 2019. Multi-scale hierarchical residual network for dense captioning. J. Artif. Intell. Res. 64 (2019), 181–196. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Multi-instance Multi-label Dual Learning Approach for Video Captioning

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!