Abstract
In Visual Dialog, an agent has to parse temporal context in the dialog history and spatial context in the image to hold a meaningful dialog with humans. For example, to answer “what is the man on her left wearing?” the agent needs to (1) analyze the temporal context in the dialog history to infer who is being referred to as “her,” (2) parse the image to attend “her,” and (3) uncover the spatial context to shift the attention to “her left” and check the apparel of the man. In this article, we use a dialog network to memorize the temporal context and an attention processor to parse the spatial context. Since the question and the image are usually very complex, which makes it difficult for the question to be grounded with a single glimpse, the attention processor attends to the image multiple times to better collect visual information. In the Visual Dialog task, the generative decoder (G) is trained under the word-by-word paradigm, which suffers from the lack of sentence-level training. We propose to reinforce G at the sentence level using the discriminative model (D), which aims to select the right answer from a few candidates, to ameliorate the problem. Experimental results on the VisDial dataset demonstrate the effectiveness of our approach.
- Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Neural module networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 39--48. DOI:https://doi.org/10.1109/CVPR.2016.12Google Scholar
Cross Ref
- Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering. In Proceedingsof the 2015 IEEE InternationalConference on Computer Vision (ICCV’15). 2425--2433. DOI:https://doi.org/10.1109/ICCV.2015.279Google Scholar
Digital Library
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.Google Scholar
- Yalong Bai, Jianlong Fu, Tiejun Zhao, and Tao Mei. 2018. Deep attention neural tensor network for visual question answering. In Proceedings of the 15th European Conference on Computer Vision (ECCV’18). 21--37. DOI:https://doi.org/10.1007/978-3-030-01258-8_2Google Scholar
Cross Ref
- Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, and Dhruv Batra. 2017. Visual dialog. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 1080--1089. DOI:https://doi.org/10.1109/CVPR.2017.121Google Scholar
Cross Ref
- Abhishek Das, Satwik Kottur, José M. F. Moura, Stefan Lee, and Dhruv Batra. 2017. Learning cooperative visual dialog agents with deep reinforcement learning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 2970--2979. DOI:https://doi.org/10.1109/ICCV.2017.321Google Scholar
Cross Ref
- Yuhang Ding, Hehe Fan, Mingliang Xu, and Yi Yang. 2020. Adaptive exploration for unsupervised person re-identification. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 1 (2020), Article 3. DOI:https://doi.org/10.1145/3369393Google Scholar
Digital Library
- Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Trevor Darrell. 2017. Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (2017), 677--691. DOI:https://doi.org/10.1109/TPAMI.2016.2599174Google Scholar
Digital Library
- Hehe Fan, Zhongwen Xu, Linchao Zhu, Chenggang Yan, Jianjun Ge, and Yi Yang. 2018. Watching a small portion could be as good as watching all: Towards efficient video classification. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18). 705--711. DOI:https://doi.org/10.24963/ijcai.2018/98Google Scholar
Cross Ref
- Hehe Fan, Liang Zheng, Chenggang Yan, and Yi Yang. 2018. Unsupervised person re-identification: Clustering and fine-tuning. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 4 (2018), Article 83. DOI:https://doi.org/10.1145/3243316Google Scholar
Digital Library
- Hao Fang, Saurabh Gupta, Forrest N. Iandola, Rupesh Kumar Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, et al. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1473--1482. DOI:https://doi.org/10.1109/CVPR.2015.7298754Google Scholar
Cross Ref
- Q. Feng, Y. Wu, H. Fan, C. Yan, M. Xu, and Y. Yang. 2020. Cascaded revision network for novel object captioning. IEEE Transactions on Circuits and Systems for Video Technology. Early Access. DOI:https://doi.org/10.1109/TCSVT.2020.2965966Google Scholar
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780. DOI:https://doi.org/10.1162/neco.1997.9.8.1735Google Scholar
Digital Library
- Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. 2016. Segmentation from natural language expressions. In Proceedings of the 14th European Conference on Computer Vision (ECCV’16), Part I. 108--124. DOI:https://doi.org/10.1007/978-3-319-46448-0_7Google Scholar
Cross Ref
- Andrej Karpathy and Li Fei-Fei. 2017. Deep visual-semantic alignments for generating image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (2017), 664--676. DOI:https://doi.org/10.1109/TPAMI.2016.2598339Google Scholar
Digital Library
- Chen Kong, Dahua Lin, Mohit Bansal, Raquel Urtasun, and Sanja Fidler. 2014. What are you talking about? Text-to-image coreference. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 3558--3565. DOI:https://doi.org/10.1109/CVPR.2014.455Google Scholar
Digital Library
- Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. 2018. Jointly localizing and describing events for dense video captioning. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 7492--7500.Google Scholar
Cross Ref
- Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision (ECCV’14), Part V. 740--755. DOI:https://doi.org/10.1007/978-3-319-10602-1_48Google Scholar
- Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, and Dhruv Batra. 2017. Best of both worlds: Transferring knowledge from discriminative learning to a generative visual dialog model. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017. 313--323.Google Scholar
Cross Ref
- Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016. 289--297.Google Scholar
- Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2017. Ask your neurons: A deep learning approach to visual question answering. International Journal of Computer Vision 125, 1–3 (2017), 110--135. DOI:https://doi.org/10.1007/s11263-017-1038-2Google Scholar
Digital Library
- Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. Recurrent models of visual attention. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014. 2204--2212.Google Scholar
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, et al. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529--533. DOI:https://doi.org/10.1038/nature14236Google Scholar
- Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 1029--1038. DOI:https://doi.org/10.1109/CVPR.2016.117Google Scholar
Cross Ref
- Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2017. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision 123, 1 (2017), 74--93. DOI:https://doi.org/10.1007/s11263-016-0965-7Google Scholar
Digital Library
- Vignesh Ramanathan, Armand Joulin, Percy Liang, and Fei-Fei Li. 2014. Linking people in videos with “their” names using coreference resolution. In Proceedings of the 13th European Conference on Computer Vision (ECCV’14), Part I. 95--110. DOI:https://doi.org/10.1007/978-3-319-10590-1_7Google Scholar
Cross Ref
- Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv:1511.06732.Google Scholar
- Mengye Ren, Ryan Kiros, and Richard S. Zemel. 2015. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015. 2953--2961.Google Scholar
- Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 1179--1195. DOI:https://doi.org/10.1109/CVPR.2017.131Google Scholar
Cross Ref
- Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt Schiele. 2016. Grounding of textual phrases in images by reconstruction. In Proceedings of the 14th European Conference on Computer Vision (ECCV’16), Part I. 817--834. DOI:https://doi.org/10.1007/978-3-319-46448-0_49Google Scholar
Cross Ref
- Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. 2015. A dataset for movie description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3202--3212. DOI:https://doi.org/10.1109/CVPR.2015.7298940Google Scholar
Cross Ref
- Paul Hongsuck Seo, Andreas Lehrmann, Bohyung Han, and Leonid Sigal. 2017. Visual reference resolution using attention memory for visual dialog. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems. 3722--3732.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.Google Scholar
- Anqi Wang, Haifeng Hu, and Liang Yang. 2018. Image captioning with affective guiding and selective attention. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 3 (2018), Article 73, 15 pages. DOI:https://doi.org/10.1145/3226037Google Scholar
Digital Library
- Christopher J. C. H. Watkins and Peter Dayan. 1992. Q-learning. Machine Learning 8 (1992), 279--292. DOI:https://doi.org/10.1007/BF00992698Google Scholar
Digital Library
- Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 (1992), 229--256. DOI:https://doi.org/10.1007/BF00992696Google Scholar
Digital Library
- Jie Wu, Haifeng Hu, and Yi Wu. 2018. Image captioning via semantic guidance attention and consensus selection strategy. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 4 (2018), Article 87. DOI:https://doi.org/10.1145/3271485Google Scholar
Digital Library
- Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony R. Dick, and Anton van den Hengel. 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 203--212.Google Scholar
- Qi Wu, Chunhua Shen, Peng Wang, Anthony R. Dick, and Anton van den Hengel. 2018. Image captioning and visual question answering based on attributes and external knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2018), 1367--1381. DOI:https://doi.org/10.1109/TPAMI.2017.2708709Google Scholar
Cross Ref
- Qi Wu, Peng Wang, Chunhua Shen, Ian D. Reid, and Anton van den Hengel. 2018. Are you talking to me? Reasoned visual dialog generation through adversarial learning. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google Scholar
- Yu Wu, Lu Jiang, and Yi Yang. 2020. Revisiting EmbodiedQA: A simple baseline and beyond. IEEE Transactions on Image Processing 29 (2020), 3984--3992. DOI:https://doi.org/10.1109/TIP.2020.2967584Google Scholar
Cross Ref
- Jun Xu, Ting Yao, Yongdong Zhang, and Tao Mei. 2017. Learning multimodal attention LSTM networks for video captioning. In Proceedings of the 2017 ACM Conference on Multimedia (MM’17). 537--545. DOI:https://doi.org/10.1145/3123266.3123448Google Scholar
Digital Library
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 2048--2057.Google Scholar
Digital Library
- Yan Yan, Feiping Nie, Wen Li, Chenqiang Gao, Yi Yang, and Dong Xu. 2016. Image classification by cross-media active learning with privileged information. IEEE Transactions on Multimedia 18, 12 (2016), 2494--2502. DOI:https://doi.org/10.1109/TMM.2016.2602938Google Scholar
Digital Library
- Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 21--29. DOI:https://doi.org/10.1109/CVPR.2016.10Google Scholar
Cross Ref
- Dongfei Yu, Jianlong Fu, Tao Mei, and Yong Rui. 2017. Multi-level attention networks for visual question answering. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 4187--4195. DOI:https://doi.org/10.1109/CVPR.2017.446Google Scholar
Cross Ref
- Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. 2017. Uncovering the temporal context for video question answering. International Journal of Computer Vision 124, 3 (2017), 409--421. DOI:https://doi.org/10.1007/s11263-017-1033-7Google Scholar
Digital Library
- Yuke Zhu, Oliver Groth, Michael S. Bernstein, and Fei-Fei Li. 2016. Visual7W: Grounded question answering in images. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 4995--5004. DOI:https://doi.org/10.1109/CVPR.2016.540Google Scholar
Cross Ref
Index Terms
Recurrent Attention Network with Reinforced Generator for Visual Dialog
Recommendations
Multimodal Fusion of Visual Dialog: A Survey
RICAI '20: Proceedings of the 2020 2nd International Conference on Robotics, Intelligent Control and Artificial IntelligenceVisual Dialog: aiming at holding a meaningful conversation with humans based on natural images, is a 'high-level' AI task of multimodal fusion. Since the challenge for visual dialog was proposed in 2017, multimodal fusion has been developed and made ...
Unsupervised and Pseudo-Supervised Vision-Language Alignment in Visual Dialog
MM '22: Proceedings of the 30th ACM International Conference on MultimediaVisual dialog requires models to give reasonable answers according to a series of coherent questions and related visual concepts in images. However, most current work either focuses on attention-based fusion or pre-training on large-scale image-text ...
Aligning vision-language for graph inference in visual dialog
Highlights- Visual dialog needs to construct semantic dependencies between visual and textual contents.
AbstractAs a cross-media intelligence task, visual dialog calls for answering a sequence of questions based on an image, using the dialog history as context. To acquire correct answers, the exploration of the semantic dependencies among ...






Comments