skip to main content
research-article

Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning

Authors Info & Claims
Published:25 April 2018Publication History
Skip Abstract Section

Abstract

Generating a novel and descriptive caption of an image is drawing increasing interests in computer vision, natural language processing, and multimedia communities. In this work, we propose an end-to-end trainable deep bidirectional LSTM (Bi-LSTM (Long Short-Term Memory)) model to address the problem. By combining a deep convolutional neural network (CNN) and two separate LSTM networks, our model is capable of learning long-term visual-language interactions by making use of history and future context information at high-level semantic space. We also explore deep multimodal bidirectional models, in which we increase the depth of nonlinearity transition in different ways to learn hierarchical visual-language embeddings. Data augmentation techniques such as multi-crop, multi-scale, and vertical mirror are proposed to prevent overfitting in training deep models. To understand how our models “translate” image to sentence, we visualize and qualitatively analyze the evolution of Bi-LSTM internal states over time. The effectiveness and generality of proposed models are evaluated on four benchmark datasets: Flickr8K, Flickr30K, MSCOCO, and Pascal1K datasets. We demonstrate that Bi-LSTM models achieve highly competitive performance on both caption generation and image-sentence retrieval even without integrating an additional mechanism (e.g., object detection, attention model). Our experiments also prove that multi-task learning is beneficial to increase model generality and gain performance. We also demonstrate the performance of transfer learning of the Bi-LSTM model significantly outperforms previous methods on the Pascal1K dataset.

References

  1. D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neural machine translation by jointly learning to align and translate. ICLR 2015.Google ScholarGoogle Scholar
  2. Rich Caruana. 1998. Multitask learning. In Learning to Learn. Springer, 95--133. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Xinlei Chen and C. Lawrence Zitnick. 2014. Learning a recurrent visual representation for image caption generation. arXiv:1411.5654.Google ScholarGoogle Scholar
  4. X. Chen and C. Lawrence Zitnick. 2015. Mind’s eye: A recurrent visual representation for image caption generation. In CVPR. 2422--2431.Google ScholarGoogle Scholar
  5. Kyunghyun Cho, Aaron Courville, and Yoshua Bengio. 2015. Describing multimedia content using attention-based encoder-decoder networks. IEEE Transactions on Multimedia 17, 11 (2015), 1875--1886.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP 2014.Google ScholarGoogle Scholar
  7. Junyoung Chung, Caglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. 2015. Gated feedback recurrent neural networks. In ICML. 2067--2075. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In CVPR. 2625--2634.Google ScholarGoogle Scholar
  9. H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, and J. Platt. 2015. From captions to visual concepts and back. In CVPR. 1473--1482.Google ScholarGoogle Scholar
  10. A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, and T. Mikolov. 2013. Devise: A deep visual-semantic embedding model. In NIPS. 2121--2129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Alex Graves, Marcus Liwicki, Santiago Fernández, Roman Bertolami, Horst Bunke, and Jürgen Schmidhuber. 2009. A novel connectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 5 (2009), 855--868. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Graves, A. Mohamed, and G. E. Hinton. 2013. Speech recognition with deep recurrent neural networks. In ICASSP. IEEE, 6645--6649.Google ScholarGoogle Scholar
  14. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  15. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research 47 (2013), 853--899. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In ACMMM. ACM, 675--678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Karpathy, A. Joulin, and F.-F. Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS. 1889--1897. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Karpathy and F.-F. Li. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR. 3128--3137.Google ScholarGoogle Scholar
  20. R. Kiros, R. Salakhutdinov, and R. Zemel. 2014a. Multimodal neural language models. In ICML. 595--603. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Kiros, R. Salakhutdinov, and R. Zemel. 2014b. Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539.Google ScholarGoogle Scholar
  22. A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS. 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2011. Baby talk: Understanding and generating simple image descriptions. In 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). IEEE, 1601--1608. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. Berg. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 35, 12 (2013), 2891--2903. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. P. Kuznetsova, V. Ordonez, A. C. Berg, T. Berg, and Y. Choi. 2012. Collective generation of natural image descriptions. In ACL, Vol. 1. ACL, 359--368. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. P. Kuznetsova, V. Ordonez, T. Berg, and Y. Choi. 2014. TREETALK: Composition and compression of trees for image descriptions.Transactions of the Association for Computational Linguistics (TACL) 2, 10 (2014), 351--362.Google ScholarGoogle Scholar
  27. M. Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. ACL (2014), 376.Google ScholarGoogle Scholar
  28. Y. LeCun, Y. Bengio, and G. E. Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436--444.Google ScholarGoogle Scholar
  29. S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. 2011. Composing simple image descriptions using web-scale n-grams. In CoNLL. ACL, 220--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. T-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV. Springer, 740--755.Google ScholarGoogle Scholar
  31. Xin Lu, Zhe Lin, Hailin Jin, Jianchao Yang, and James Z. Wang. 2014. Rapid: Rating pictorial aesthetics using deep learning. In Proceedings of the ACM International Conference on Multimedia. ACM, 457--466. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. H. Mao, W. Xu, Y. Yang, J. Wang, Z. H. Huang, and A. Yuille. 2015. Deep captioning with multimodal recurrent neural networks (m-rnn). ICLR 2015.Google ScholarGoogle Scholar
  33. T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur. 2010. Recurrent neural network based language model. In INTERSPEECH. 1045–1048.Google ScholarGoogle Scholar
  34. T. Mikolov, S. Kombrink, L. Burget, J. H. Černockỳ, and S. Khudanpur. 2011. Extensions of recurrent neural network language model. In ICASSP. IEEE, 5528--5531.Google ScholarGoogle Scholar
  35. M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg, K. Yamaguchi, T. Berg, K. Stratos, and H. Daumé III. 2012. Midge: Generating image descriptions from computer vision detections. In ACL. ACL, 747--756. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng. 2011. Multimodal deep learning. In ICML. 689--696. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, 10 (2010), 1345--1359. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In ACL. ACL, 311--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio. 2013. How to construct deep recurrent neural networks. arXiv:1312.6026.Google ScholarGoogle Scholar
  40. C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier. 2010. Collecting image annotations using Amazon’s Mechanical Turk. In NAACL HLT Workshop. Association for Computational Linguistics, 139--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Nikhil Rasiwasia, Pedro J. Moreno, and Nuno Vasconcelos. 2007. Bridging the gap: Query by semantic example. IEEE Transactions on Multimedia 9, 5 (2007), 923--938. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. K. Simonyan and A. Zisserman. 2014a. Two-stream convolutional networks for action recognition in videos. In NIPS. 568--576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. K. Simonyan and A. Zisserman. 2014b. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.Google ScholarGoogle Scholar
  44. R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics (TACL) 2 (2014), 207--218.Google ScholarGoogle ScholarCross RefCross Ref
  45. Richard Socher, Cliff C. Lin, Chris Manning, and Andrew Y. Ng. 2011. Parsing natural scenes and natural language with recursive neural networks. In 28th International Conference on Machine Learning (ICML’11). 129--136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. N. Srivastava and R. Salakhutdinov. 2012. Multimodal learning with deep boltzmann machines. In NIPS. 2222--2230. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS. 3104--3112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In CVPR. 1--9.Google ScholarGoogle Scholar
  49. R. Vedantam, Z. Lawrence, and D. Parikh. 2015. Cider: Consensus-based image description evaluation. In CVPR. 4566--4575.Google ScholarGoogle Scholar
  50. O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. 2015. Show and tell: A neural image caption generator. In CVPR. 3156--3164.Google ScholarGoogle Scholar
  51. Cheng Wang, Haojin Yang, Christian Bartz, and Christoph Meinel. 2016d. Image captioning with deep bidirectional LSTMs. arXiv:1604.00790. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Cheng Wang, Haojin Yang, and Christoph Meinel. 2016c. A deep semantic framework for multimodal representation learning. Multimedia Tools and Applications (2016), 1--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Wei Wang, Zhen Cui, Yan Yan, Jiashi Feng, Shuicheng Yan, Xiangbo Shu, and Nicu Sebe. 2016a. Recurrent face aging. In IEEE Conference on Computer Vision and Pattern Recognition. 2378--2386.Google ScholarGoogle ScholarCross RefCross Ref
  54. Wei Wang, Sergey Tulyakov, and Nicu Sebe. 2016b. Recurrent convolutional face alignment. In Asian Conference on Computer Vision. Springer, 104--120.Google ScholarGoogle Scholar
  55. Zhangyang Wang, Jianchao Yang, Hailin Jin, Eli Shechtman, Aseem Agarwala, Jonathan Brandt, and Thomas S. Huang. 2015. DeepFont: Identify your font from an image. In 23rd ACM International Conference on Multimedia (MM’15). ACM, New York, 451--459. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Paul J. Werbos. 1990. Backpropagation through time: What it does and how to do it. Proceedings of the IEEE 78, 10 (1990), 1550--1560.Google ScholarGoogle ScholarCross RefCross Ref
  57. Zhizheng Wu and Simon King. 2016. Investigating gated recurrent networks for speech synthesis. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). IEEE, 5140--5144.Google ScholarGoogle ScholarCross RefCross Ref
  58. K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. ICML 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Xuyong Yang, Tao Mei, Ying-Qing Xu, Yong Rui, and Shipeng Li. 2016. Automatic generation of visual-textual presentation layout. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 12, 2 (2016), 33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Xiaoshan Yang, Tianzhu Zhang, and Changsheng Xu. 2015. Cross-domain feature learning in multimedia. IEEE Transactions on Multimedia 17, 1 (2015), 64--78.Google ScholarGoogle ScholarCross RefCross Ref
  61. Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In CVPR. 4651–4659.Google ScholarGoogle Scholar
  62. P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics (TACL) 2 (2014), 67--78.Google ScholarGoogle ScholarCross RefCross Ref
  63. W. Zaremba and I. Sutskever. 2014. Learning to execute. arXiv:1410.4615.Google ScholarGoogle Scholar
  64. M. D. Zeiler and R. Fergus. 2014. Visualizing and understanding convolutional networks. In ECCV. Springer, 818--833.Google ScholarGoogle Scholar

Index Terms

  1. Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!