Abstract
Visual attention in Visual Question Answering (VQA) targets at locating the right image regions regarding the answer prediction, offering a powerful technique to promote multi-modal understanding. However, recent studies have pointed out that the highlighted image regions from the visual attention are often irrelevant to the given question and answer, leading to model confusion for correct visual reasoning. To tackle this problem, existing methods mostly resort to aligning the visual attention weights with human attentions. Nevertheless, gathering such human data is laborious and expensive, making it burdensome to adapt well-developed models across datasets. To address this issue, in this article, we devise a novel visual attention regularization approach, namely, AttReg, for better visual grounding in VQA. Specifically, AttReg first identifies the image regions that are essential for question answering yet unexpectedly ignored (i.e., assigned with low attention weights) by the backbone model. And then a mask-guided learning scheme is leveraged to regularize the visual attention to focus more on these ignored key regions. The proposed method is very flexible and model-agnostic, which can be integrated into most visual attention-based VQA models and require no human attention supervision. Extensive experiments over three benchmark datasets, i.e., VQA-CP v2, VQA-CP v1, and VQA v2, have been conducted to evaluate the effectiveness of AttReg. As a by-product, when incorporating AttReg into the strong baseline LMH, our approach can achieve a new state-of-the-art accuracy of 60.00% with an absolute performance gain of 7.01% on the VQA-CP v2 benchmark dataset. In addition to the effectiveness validation, we recognize that the faithfulness of the visual attention in VQA has not been well explored in literature. In the light of this, we propose to empirically validate such property of visual attention and compare it with the prevalent gradient-based approaches.
- [1] . 2018. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4971–4980.Google Scholar
Cross Ref
- [2] . 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6077–6086.Google Scholar
Cross Ref
- [3] . 2016. Deep compositional question answering with neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 39–48.Google Scholar
- [4] . 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2425–2433.Google Scholar
Cross Ref
- [5] . 2019. MuRel: Multimodal relational reasoning for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1989–1998.Google Scholar
Cross Ref
- [6] . 2019. RUBi: Reducing unimodal biases in visual question answering. In Proceedings of the Conference on Advances in Neural Information Processing Systems. Curran Associates, Inc., 839–850.Google Scholar
- [7] . 2020. Counterfactual samples synthesizing for robust visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 10800–10809.Google Scholar
Cross Ref
- [8] . 2014. Learning phrase representations using RNN Encoder-Decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1724–1734.Google Scholar
Cross Ref
- [9] . 2019. Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 4069–4082.Google Scholar
Cross Ref
- [10] . 2016. Human attention in visual question answering: Do humans and deep networks look at the same regions? In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 932–937.Google Scholar
Cross Ref
- [11] . 2020. Recurrent attention network with reinforced generator for visual dialog. ACM Trans. Multim. Comput. Commun. Appl. 16, 3 (2020), 1–16.Google Scholar
Digital Library
- [12] . 2019. BTDP: Toward sparse fusion with block term decomposition pooling for visual question answering. ACM Trans. Multim. Comput. Commun. Appl. 15, 50 (2019), 1–21.Google Scholar
Digital Library
- [13] . 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 457–468.Google Scholar
Cross Ref
- [14] . 2020. Removing bias in multi-modal classifiers: Regularization by maximizing functional entropies. In Proceedings of the Conference on Advances in Neural Information Processing Systems. Curran Associates, Inc., 3197–3208.Google Scholar
- [15] . 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6325–6334.Google Scholar
Cross Ref
- [16] . 2019. Quantifying and alleviating the language prior problem in visual question answering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 75–84.Google Scholar
Digital Library
- [17] . 2021. AdaVQA: Overcoming language priors with adapted margin cosine loss. In Proceedings of the International Joint Conference on Artificial Intelligence. ijcai.org, 708–714.Google Scholar
Cross Ref
- [18] . 2020. Loss-rescaling VQA: Revisiting language prior problem from a class-imbalance view. arXiv preprint arXiv:2010.16010 (2020).Google Scholar
- [19] . 2020. spaCy: Industrial-strength Natural Language Processing in Python.
DOI: Google ScholarCross Ref
- [20] . 2019. Attention is not explanation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. ACL, 3543–3556.Google Scholar
- [21] . 2020. Overcoming language priors in VQA via decomposed linguistic representations. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 11181–11188.Google Scholar
Cross Ref
- [22] . 2019. Unified spatio-temporal attention networks for action recognition in videos. IEEE Trans. Multim. 21, 2 (2019), 416–428.Google Scholar
Digital Library
- [23] . 2018. Scale-aware fast R-CNN for pedestrian detection. IEEE Trans. Multim. 20, 4 (2018), 985–996.Google Scholar
Digital Library
- [24] . 2019. Semantic concept network and deep walk-based visual question answering. ACM Trans. Multim. Comput. Commun. Appl. 15, 49 (2019), 1–19.Google Scholar
Digital Library
- [25] . 2019. COCO-CN for cross-lingual image tagging, captioning, and retrieval. IEEE Trans. Multim. 21, 9 (2019), 2347–2360.Google Scholar
Cross Ref
- [26] . 2019. Focal visual-text attention for visual question answering. IEEE Trans. Pattern Anal. Mach. Intell. 41, 8 (2019), 1893–1908.Google Scholar
Cross Ref
- [27] . 2020. Learning to contrast the counterfactual samples for robust visual question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 3285–3292.Google Scholar
Cross Ref
- [28] . 2019. Erasing-based attention learning for visual question answering. In ACM Multimedia. ACM, 1175–1183.Google Scholar
- [29] 2016. Hierarchical question-image co-attention for visual question answering. In Proceedings of the Conference on Advances in Neural Information Processing Systems. Curran Associates, Inc., 289–297.Google Scholar
- [30] . 2018. Learning visual question answering by bootstrapping hard attention. In Proceedings of the European Conference on Computer Vision. Springer, 3–20.Google Scholar
Cross Ref
- [31] . 2020. Explanation vs attention: A two-player game to obtain attention for VQA. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 11848–11855.Google Scholar
Cross Ref
- [32] . 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1532–1543.Google Scholar
Cross Ref
- [33] . 2018. Exploring human-like attention supervision in visual question answering. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 7300–7307.Google Scholar
Cross Ref
- [34] . 2018. Overcoming language priors in visual question answering with adversarial regularization. In Proceedings of the Conference on Advances in Neural Information Processing Systems. Curran Associates, Inc., 1548–1558.Google Scholar
- [35] . 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. Curran Associates, Inc., 91–99.Google Scholar
- [36] . 2017. Right for the right reasons: Training differentiable models by constraining their explanations. In Proceedings of the International Joint Conference on Artificial Intelligence. ijcai.org, 2662–2670.Google Scholar
Cross Ref
- [37] . 2017. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 618–626.Google Scholar
Cross Ref
- [38] . 2019. Taking a hint: Leveraging explanations to make vision and language models more grounded. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2591–2600.Google Scholar
Cross Ref
- [39] . 2016. Where to look: Focus regions for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4613–4621.Google Scholar
Cross Ref
- [40] . 2020. A negative case analysis of visual grounding methods for VQA. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 8172–8181.Google Scholar
Cross Ref
- [41] . 2020. On the value of out-of-distribution testing: An example of Goodhart’s law. In Proceedings of the Conference on Advances in Neural Information Processing Systems. Curran Associates, Inc., 407–417.Google Scholar
- [42] . 2019. Self-critical reasoning for robust visual question answering. In Proceedings of the Conference on Advances in Neural Information Processing Systems. Curran Associates, Inc., 8601–8611.Google Scholar
- [43] . 2017. Visual question answering: A survey of methods and datasets. Comput. Vis. Image Underst. 163 (2017), 21–40.Google Scholar
Digital Library
- [44] . 2017. Video question answering via gradually refined attention over appearance and motion. In ACM Multim. ACM, 1645–1653.Google Scholar
- [45] . 2003. VideoQA: Question answering on news video. In ACM Multim. ACM, 632–641.Google Scholar
- [46] . 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 21–29.Google Scholar
Cross Ref
- [47] . 2019. Multi-source multi-level attention networks for visual question answering. ACM Trans. Multim. Comput. Commun. Appl. 15, 51 (2019), 1–20.Google Scholar
Digital Library
- [48] . 2016. Yin and yang: Balancing and answering binary visual questions. In Proceedings of the Conference on Advances in Neural Information Processing Systems. Curran Associates, Inc., 5014–5022.Google Scholar
Cross Ref
- [49] . 2019. Interpretable visual question answering by visual grounding from attention supervision mining. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. IEEE, 349–357.Google Scholar
Cross Ref
- [50] . 2017. Video question answering via hierarchical dual-level attention network learning. In ACM Multim. ACM, 1050–1058.Google Scholar
- [51] . 2017. Watch what you just said: Image captioning with text-conditional attention. In ACM Multim. ACM, 305–313.Google Scholar
- [52] . 2020. Overcoming language priors with self-supervised learning for visual question answering. In Proceedings of the International Joint Conference on Artificial Intelligence. AAAI Press, 1083–1089.Google Scholar
Cross Ref
- [53] . 2016. Visual7W: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4995–5004.Google Scholar
Cross Ref
Index Terms
Answer Questions with Right Image Regions: A Visual Attention Regularization Approach
Recommendations
Visual question answering model based on graph neural network and contextual attention
AbstractVisual Question Answering (VQA) has recently appeared as a hot research area in the field of computer vision and natural language processing. A VQA model uses both image and question features and fuses them to predict an answer for a ...
Highlights- Graph neural network model is used to capture the implicit relationship among semantic objects or significant-image regions.
Human Attention in Visual Question Answering
Multiple game-inspired novel interfaces for collecting human attention maps of where humans choose to look to answer questions from the large-scale VQA dataset (Antoletal., 2015).Qualitative and quantitative comparison of the maps generated by state-of-...
Word-to-region attention network for visual question answering
Visual attention, which allows more concentration on the image regions that are relevant to a reference question, brings remarkable performance improvement in Visual Question Answering (VQA). Most VQA attention models employ the entire reference ...






Comments