Abstract
Generating a novel and descriptive caption of an image is drawing increasing interests in computer vision, natural language processing, and multimedia communities. In this work, we propose an end-to-end trainable deep bidirectional LSTM (Bi-LSTM (Long Short-Term Memory)) model to address the problem. By combining a deep convolutional neural network (CNN) and two separate LSTM networks, our model is capable of learning long-term visual-language interactions by making use of history and future context information at high-level semantic space. We also explore deep multimodal bidirectional models, in which we increase the depth of nonlinearity transition in different ways to learn hierarchical visual-language embeddings. Data augmentation techniques such as multi-crop, multi-scale, and vertical mirror are proposed to prevent overfitting in training deep models. To understand how our models “translate” image to sentence, we visualize and qualitatively analyze the evolution of Bi-LSTM internal states over time. The effectiveness and generality of proposed models are evaluated on four benchmark datasets: Flickr8K, Flickr30K, MSCOCO, and Pascal1K datasets. We demonstrate that Bi-LSTM models achieve highly competitive performance on both caption generation and image-sentence retrieval even without integrating an additional mechanism (e.g., object detection, attention model). Our experiments also prove that multi-task learning is beneficial to increase model generality and gain performance. We also demonstrate the performance of transfer learning of the Bi-LSTM model significantly outperforms previous methods on the Pascal1K dataset.
- D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neural machine translation by jointly learning to align and translate. ICLR 2015.Google Scholar
- Rich Caruana. 1998. Multitask learning. In Learning to Learn. Springer, 95--133. Google Scholar
Digital Library
- Xinlei Chen and C. Lawrence Zitnick. 2014. Learning a recurrent visual representation for image caption generation. arXiv:1411.5654.Google Scholar
- X. Chen and C. Lawrence Zitnick. 2015. Mind’s eye: A recurrent visual representation for image caption generation. In CVPR. 2422--2431.Google Scholar
- Kyunghyun Cho, Aaron Courville, and Yoshua Bengio. 2015. Describing multimedia content using attention-based encoder-decoder networks. IEEE Transactions on Multimedia 17, 11 (2015), 1875--1886.Google Scholar
Digital Library
- K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP 2014.Google Scholar
- Junyoung Chung, Caglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. 2015. Gated feedback recurrent neural networks. In ICML. 2067--2075. Google Scholar
Digital Library
- J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In CVPR. 2625--2634.Google Scholar
- H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, and J. Platt. 2015. From captions to visual concepts and back. In CVPR. 1473--1482.Google Scholar
- A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, and T. Mikolov. 2013. Devise: A deep visual-semantic embedding model. In NIPS. 2121--2129. Google Scholar
Digital Library
- Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition. Google Scholar
Digital Library
- Alex Graves, Marcus Liwicki, Santiago Fernández, Roman Bertolami, Horst Bunke, and Jürgen Schmidhuber. 2009. A novel connectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 5 (2009), 855--868. Google Scholar
Digital Library
- A. Graves, A. Mohamed, and G. E. Hinton. 2013. Speech recognition with deep recurrent neural networks. In ICASSP. IEEE, 6645--6649.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google Scholar
Cross Ref
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780. Google Scholar
Digital Library
- Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research 47 (2013), 853--899. Google Scholar
Digital Library
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In ACMMM. ACM, 675--678. Google Scholar
Digital Library
- A. Karpathy, A. Joulin, and F.-F. Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS. 1889--1897. Google Scholar
Digital Library
- A. Karpathy and F.-F. Li. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR. 3128--3137.Google Scholar
- R. Kiros, R. Salakhutdinov, and R. Zemel. 2014a. Multimodal neural language models. In ICML. 595--603. Google Scholar
Digital Library
- R. Kiros, R. Salakhutdinov, and R. Zemel. 2014b. Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539.Google Scholar
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS. 1097--1105. Google Scholar
Digital Library
- Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2011. Baby talk: Understanding and generating simple image descriptions. In 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). IEEE, 1601--1608. Google Scholar
Digital Library
- G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. Berg. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 35, 12 (2013), 2891--2903. Google Scholar
Digital Library
- P. Kuznetsova, V. Ordonez, A. C. Berg, T. Berg, and Y. Choi. 2012. Collective generation of natural image descriptions. In ACL, Vol. 1. ACL, 359--368. Google Scholar
Digital Library
- P. Kuznetsova, V. Ordonez, T. Berg, and Y. Choi. 2014. TREETALK: Composition and compression of trees for image descriptions.Transactions of the Association for Computational Linguistics (TACL) 2, 10 (2014), 351--362.Google Scholar
- M. Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. ACL (2014), 376.Google Scholar
- Y. LeCun, Y. Bengio, and G. E. Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436--444.Google Scholar
- S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. 2011. Composing simple image descriptions using web-scale n-grams. In CoNLL. ACL, 220--228. Google Scholar
Digital Library
- T-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV. Springer, 740--755.Google Scholar
- Xin Lu, Zhe Lin, Hailin Jin, Jianchao Yang, and James Z. Wang. 2014. Rapid: Rating pictorial aesthetics using deep learning. In Proceedings of the ACM International Conference on Multimedia. ACM, 457--466. Google Scholar
Digital Library
- J. H. Mao, W. Xu, Y. Yang, J. Wang, Z. H. Huang, and A. Yuille. 2015. Deep captioning with multimodal recurrent neural networks (m-rnn). ICLR 2015.Google Scholar
- T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur. 2010. Recurrent neural network based language model. In INTERSPEECH. 1045–1048.Google Scholar
- T. Mikolov, S. Kombrink, L. Burget, J. H. Černockỳ, and S. Khudanpur. 2011. Extensions of recurrent neural network language model. In ICASSP. IEEE, 5528--5531.Google Scholar
- M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg, K. Yamaguchi, T. Berg, K. Stratos, and H. Daumé III. 2012. Midge: Generating image descriptions from computer vision detections. In ACL. ACL, 747--756. Google Scholar
Digital Library
- J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng. 2011. Multimodal deep learning. In ICML. 689--696. Google Scholar
Digital Library
- Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, 10 (2010), 1345--1359. Google Scholar
Digital Library
- K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In ACL. ACL, 311--318. Google Scholar
Digital Library
- R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio. 2013. How to construct deep recurrent neural networks. arXiv:1312.6026.Google Scholar
- C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier. 2010. Collecting image annotations using Amazon’s Mechanical Turk. In NAACL HLT Workshop. Association for Computational Linguistics, 139--147. Google Scholar
Digital Library
- Nikhil Rasiwasia, Pedro J. Moreno, and Nuno Vasconcelos. 2007. Bridging the gap: Query by semantic example. IEEE Transactions on Multimedia 9, 5 (2007), 923--938. Google Scholar
Digital Library
- K. Simonyan and A. Zisserman. 2014a. Two-stream convolutional networks for action recognition in videos. In NIPS. 568--576. Google Scholar
Digital Library
- K. Simonyan and A. Zisserman. 2014b. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.Google Scholar
- R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics (TACL) 2 (2014), 207--218.Google Scholar
Cross Ref
- Richard Socher, Cliff C. Lin, Chris Manning, and Andrew Y. Ng. 2011. Parsing natural scenes and natural language with recursive neural networks. In 28th International Conference on Machine Learning (ICML’11). 129--136. Google Scholar
Digital Library
- N. Srivastava and R. Salakhutdinov. 2012. Multimodal learning with deep boltzmann machines. In NIPS. 2222--2230. Google Scholar
Digital Library
- I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS. 3104--3112. Google Scholar
Digital Library
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In CVPR. 1--9.Google Scholar
- R. Vedantam, Z. Lawrence, and D. Parikh. 2015. Cider: Consensus-based image description evaluation. In CVPR. 4566--4575.Google Scholar
- O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. 2015. Show and tell: A neural image caption generator. In CVPR. 3156--3164.Google Scholar
- Cheng Wang, Haojin Yang, Christian Bartz, and Christoph Meinel. 2016d. Image captioning with deep bidirectional LSTMs. arXiv:1604.00790. Google Scholar
Digital Library
- Cheng Wang, Haojin Yang, and Christoph Meinel. 2016c. A deep semantic framework for multimodal representation learning. Multimedia Tools and Applications (2016), 1--22. Google Scholar
Digital Library
- Wei Wang, Zhen Cui, Yan Yan, Jiashi Feng, Shuicheng Yan, Xiangbo Shu, and Nicu Sebe. 2016a. Recurrent face aging. In IEEE Conference on Computer Vision and Pattern Recognition. 2378--2386.Google Scholar
Cross Ref
- Wei Wang, Sergey Tulyakov, and Nicu Sebe. 2016b. Recurrent convolutional face alignment. In Asian Conference on Computer Vision. Springer, 104--120.Google Scholar
- Zhangyang Wang, Jianchao Yang, Hailin Jin, Eli Shechtman, Aseem Agarwala, Jonathan Brandt, and Thomas S. Huang. 2015. DeepFont: Identify your font from an image. In 23rd ACM International Conference on Multimedia (MM’15). ACM, New York, 451--459. Google Scholar
Digital Library
- Paul J. Werbos. 1990. Backpropagation through time: What it does and how to do it. Proceedings of the IEEE 78, 10 (1990), 1550--1560.Google Scholar
Cross Ref
- Zhizheng Wu and Simon King. 2016. Investigating gated recurrent networks for speech synthesis. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). IEEE, 5140--5144.Google Scholar
Cross Ref
- K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. ICML 2015. Google Scholar
Digital Library
- Xuyong Yang, Tao Mei, Ying-Qing Xu, Yong Rui, and Shipeng Li. 2016. Automatic generation of visual-textual presentation layout. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 12, 2 (2016), 33. Google Scholar
Digital Library
- Xiaoshan Yang, Tianzhu Zhang, and Changsheng Xu. 2015. Cross-domain feature learning in multimedia. IEEE Transactions on Multimedia 17, 1 (2015), 64--78.Google Scholar
Cross Ref
- Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In CVPR. 4651–4659.Google Scholar
- P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics (TACL) 2 (2014), 67--78.Google Scholar
Cross Ref
- W. Zaremba and I. Sutskever. 2014. Learning to execute. arXiv:1410.4615.Google Scholar
- M. D. Zeiler and R. Fergus. 2014. Visualizing and understanding convolutional networks. In ECCV. Springer, 818--833.Google Scholar
Index Terms
Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning
Recommendations
A Comprehensive Survey of Deep Learning for Image Captioning
Generating a description of an image is called image captioning. Image captioning requires recognizing the important objects, their attributes, and their relationships in an image. It also needs to generate syntactically and semantically correct ...
Image Captioning with Deep Bidirectional LSTMs
MM '16: Proceedings of the 24th ACM international conference on MultimediaThis work presents an end-to-end trainable deep bidirectional LSTM (Long-Short Term Memory) model for image captioning. Our model builds on a deep convolutional neural network (CNN) and two separate LSTM networks. It is capable of learning long term ...
Learning Multimodal Attention LSTM Networks for Video Captioning
MM '17: Proceedings of the 25th ACM international conference on MultimediaAutomatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence ...






Comments