skip to main content
research-article

Recurrent Attention Network with Reinforced Generator for Visual Dialog

Authors Info & Claims
Published:05 July 2020Publication History
Skip Abstract Section

Abstract

In Visual Dialog, an agent has to parse temporal context in the dialog history and spatial context in the image to hold a meaningful dialog with humans. For example, to answer “what is the man on her left wearing?” the agent needs to (1) analyze the temporal context in the dialog history to infer who is being referred to as “her,” (2) parse the image to attend “her,” and (3) uncover the spatial context to shift the attention to “her left” and check the apparel of the man. In this article, we use a dialog network to memorize the temporal context and an attention processor to parse the spatial context. Since the question and the image are usually very complex, which makes it difficult for the question to be grounded with a single glimpse, the attention processor attends to the image multiple times to better collect visual information. In the Visual Dialog task, the generative decoder (G) is trained under the word-by-word paradigm, which suffers from the lack of sentence-level training. We propose to reinforce G at the sentence level using the discriminative model (D), which aims to select the right answer from a few candidates, to ameliorate the problem. Experimental results on the VisDial dataset demonstrate the effectiveness of our approach.

References

  1. Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Neural module networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 39--48. DOI:https://doi.org/10.1109/CVPR.2016.12Google ScholarGoogle ScholarCross RefCross Ref
  2. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering. In Proceedingsof the 2015 IEEE InternationalConference on Computer Vision (ICCV’15). 2425--2433. DOI:https://doi.org/10.1109/ICCV.2015.279Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.Google ScholarGoogle Scholar
  4. Yalong Bai, Jianlong Fu, Tiejun Zhao, and Tao Mei. 2018. Deep attention neural tensor network for visual question answering. In Proceedings of the 15th European Conference on Computer Vision (ECCV’18). 21--37. DOI:https://doi.org/10.1007/978-3-030-01258-8_2Google ScholarGoogle ScholarCross RefCross Ref
  5. Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, and Dhruv Batra. 2017. Visual dialog. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 1080--1089. DOI:https://doi.org/10.1109/CVPR.2017.121Google ScholarGoogle ScholarCross RefCross Ref
  6. Abhishek Das, Satwik Kottur, José M. F. Moura, Stefan Lee, and Dhruv Batra. 2017. Learning cooperative visual dialog agents with deep reinforcement learning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 2970--2979. DOI:https://doi.org/10.1109/ICCV.2017.321Google ScholarGoogle ScholarCross RefCross Ref
  7. Yuhang Ding, Hehe Fan, Mingliang Xu, and Yi Yang. 2020. Adaptive exploration for unsupervised person re-identification. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 1 (2020), Article 3. DOI:https://doi.org/10.1145/3369393Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Trevor Darrell. 2017. Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (2017), 677--691. DOI:https://doi.org/10.1109/TPAMI.2016.2599174Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Hehe Fan, Zhongwen Xu, Linchao Zhu, Chenggang Yan, Jianjun Ge, and Yi Yang. 2018. Watching a small portion could be as good as watching all: Towards efficient video classification. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18). 705--711. DOI:https://doi.org/10.24963/ijcai.2018/98Google ScholarGoogle ScholarCross RefCross Ref
  10. Hehe Fan, Liang Zheng, Chenggang Yan, and Yi Yang. 2018. Unsupervised person re-identification: Clustering and fine-tuning. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 4 (2018), Article 83. DOI:https://doi.org/10.1145/3243316Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Hao Fang, Saurabh Gupta, Forrest N. Iandola, Rupesh Kumar Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, et al. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1473--1482. DOI:https://doi.org/10.1109/CVPR.2015.7298754Google ScholarGoogle ScholarCross RefCross Ref
  12. Q. Feng, Y. Wu, H. Fan, C. Yan, M. Xu, and Y. Yang. 2020. Cascaded revision network for novel object captioning. IEEE Transactions on Circuits and Systems for Video Technology. Early Access. DOI:https://doi.org/10.1109/TCSVT.2020.2965966Google ScholarGoogle Scholar
  13. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780. DOI:https://doi.org/10.1162/neco.1997.9.8.1735Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. 2016. Segmentation from natural language expressions. In Proceedings of the 14th European Conference on Computer Vision (ECCV’16), Part I. 108--124. DOI:https://doi.org/10.1007/978-3-319-46448-0_7Google ScholarGoogle ScholarCross RefCross Ref
  15. Andrej Karpathy and Li Fei-Fei. 2017. Deep visual-semantic alignments for generating image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (2017), 664--676. DOI:https://doi.org/10.1109/TPAMI.2016.2598339Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Chen Kong, Dahua Lin, Mohit Bansal, Raquel Urtasun, and Sanja Fidler. 2014. What are you talking about? Text-to-image coreference. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 3558--3565. DOI:https://doi.org/10.1109/CVPR.2014.455Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. 2018. Jointly localizing and describing events for dense video captioning. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 7492--7500.Google ScholarGoogle ScholarCross RefCross Ref
  18. Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision (ECCV’14), Part V. 740--755. DOI:https://doi.org/10.1007/978-3-319-10602-1_48Google ScholarGoogle Scholar
  19. Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, and Dhruv Batra. 2017. Best of both worlds: Transferring knowledge from discriminative learning to a generative visual dialog model. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017. 313--323.Google ScholarGoogle ScholarCross RefCross Ref
  20. Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016. 289--297.Google ScholarGoogle Scholar
  21. Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2017. Ask your neurons: A deep learning approach to visual question answering. International Journal of Computer Vision 125, 1–3 (2017), 110--135. DOI:https://doi.org/10.1007/s11263-017-1038-2Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. Recurrent models of visual attention. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014. 2204--2212.Google ScholarGoogle Scholar
  23. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, et al. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529--533. DOI:https://doi.org/10.1038/nature14236Google ScholarGoogle Scholar
  24. Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 1029--1038. DOI:https://doi.org/10.1109/CVPR.2016.117Google ScholarGoogle ScholarCross RefCross Ref
  25. Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2017. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision 123, 1 (2017), 74--93. DOI:https://doi.org/10.1007/s11263-016-0965-7Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Vignesh Ramanathan, Armand Joulin, Percy Liang, and Fei-Fei Li. 2014. Linking people in videos with “their” names using coreference resolution. In Proceedings of the 13th European Conference on Computer Vision (ECCV’14), Part I. 95--110. DOI:https://doi.org/10.1007/978-3-319-10590-1_7Google ScholarGoogle ScholarCross RefCross Ref
  27. Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv:1511.06732.Google ScholarGoogle Scholar
  28. Mengye Ren, Ryan Kiros, and Richard S. Zemel. 2015. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015. 2953--2961.Google ScholarGoogle Scholar
  29. Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 1179--1195. DOI:https://doi.org/10.1109/CVPR.2017.131Google ScholarGoogle ScholarCross RefCross Ref
  30. Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt Schiele. 2016. Grounding of textual phrases in images by reconstruction. In Proceedings of the 14th European Conference on Computer Vision (ECCV’16), Part I. 817--834. DOI:https://doi.org/10.1007/978-3-319-46448-0_49Google ScholarGoogle ScholarCross RefCross Ref
  31. Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. 2015. A dataset for movie description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3202--3212. DOI:https://doi.org/10.1109/CVPR.2015.7298940Google ScholarGoogle ScholarCross RefCross Ref
  32. Paul Hongsuck Seo, Andreas Lehrmann, Bohyung Han, and Leonid Sigal. 2017. Visual reference resolution using attention memory for visual dialog. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems. 3722--3732.Google ScholarGoogle Scholar
  33. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.Google ScholarGoogle Scholar
  34. Anqi Wang, Haifeng Hu, and Liang Yang. 2018. Image captioning with affective guiding and selective attention. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 3 (2018), Article 73, 15 pages. DOI:https://doi.org/10.1145/3226037Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Christopher J. C. H. Watkins and Peter Dayan. 1992. Q-learning. Machine Learning 8 (1992), 279--292. DOI:https://doi.org/10.1007/BF00992698Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 (1992), 229--256. DOI:https://doi.org/10.1007/BF00992696Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Jie Wu, Haifeng Hu, and Yi Wu. 2018. Image captioning via semantic guidance attention and consensus selection strategy. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 4 (2018), Article 87. DOI:https://doi.org/10.1145/3271485Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony R. Dick, and Anton van den Hengel. 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 203--212.Google ScholarGoogle Scholar
  39. Qi Wu, Chunhua Shen, Peng Wang, Anthony R. Dick, and Anton van den Hengel. 2018. Image captioning and visual question answering based on attributes and external knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2018), 1367--1381. DOI:https://doi.org/10.1109/TPAMI.2017.2708709Google ScholarGoogle ScholarCross RefCross Ref
  40. Qi Wu, Peng Wang, Chunhua Shen, Ian D. Reid, and Anton van den Hengel. 2018. Are you talking to me? Reasoned visual dialog generation through adversarial learning. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google ScholarGoogle Scholar
  41. Yu Wu, Lu Jiang, and Yi Yang. 2020. Revisiting EmbodiedQA: A simple baseline and beyond. IEEE Transactions on Image Processing 29 (2020), 3984--3992. DOI:https://doi.org/10.1109/TIP.2020.2967584Google ScholarGoogle ScholarCross RefCross Ref
  42. Jun Xu, Ting Yao, Yongdong Zhang, and Tao Mei. 2017. Learning multimodal attention LSTM networks for video captioning. In Proceedings of the 2017 ACM Conference on Multimedia (MM’17). 537--545. DOI:https://doi.org/10.1145/3123266.3123448Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 2048--2057.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Yan Yan, Feiping Nie, Wen Li, Chenqiang Gao, Yi Yang, and Dong Xu. 2016. Image classification by cross-media active learning with privileged information. IEEE Transactions on Multimedia 18, 12 (2016), 2494--2502. DOI:https://doi.org/10.1109/TMM.2016.2602938Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 21--29. DOI:https://doi.org/10.1109/CVPR.2016.10Google ScholarGoogle ScholarCross RefCross Ref
  46. Dongfei Yu, Jianlong Fu, Tao Mei, and Yong Rui. 2017. Multi-level attention networks for visual question answering. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 4187--4195. DOI:https://doi.org/10.1109/CVPR.2017.446Google ScholarGoogle ScholarCross RefCross Ref
  47. Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. 2017. Uncovering the temporal context for video question answering. International Journal of Computer Vision 124, 3 (2017), 409--421. DOI:https://doi.org/10.1007/s11263-017-1033-7Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Yuke Zhu, Oliver Groth, Michael S. Bernstein, and Fei-Fei Li. 2016. Visual7W: Grounded question answering in images. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 4995--5004. DOI:https://doi.org/10.1109/CVPR.2016.540Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Recurrent Attention Network with Reinforced Generator for Visual Dialog

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 3
      August 2020
      364 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3409646
      Issue’s Table of Contents

      Copyright © 2020 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 5 July 2020
      • Online AM: 7 May 2020
      • Accepted: 1 March 2020
      • Revised: 1 October 2019
      • Received: 1 November 2018
      Published in tomm Volume 16, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!