skip to main content
research-article

Answer Questions with Right Image Regions: A Visual Attention Regularization Approach

Authors Info & Claims
Published:04 March 2022Publication History
Skip Abstract Section

Abstract

Visual attention in Visual Question Answering (VQA) targets at locating the right image regions regarding the answer prediction, offering a powerful technique to promote multi-modal understanding. However, recent studies have pointed out that the highlighted image regions from the visual attention are often irrelevant to the given question and answer, leading to model confusion for correct visual reasoning. To tackle this problem, existing methods mostly resort to aligning the visual attention weights with human attentions. Nevertheless, gathering such human data is laborious and expensive, making it burdensome to adapt well-developed models across datasets. To address this issue, in this article, we devise a novel visual attention regularization approach, namely, AttReg, for better visual grounding in VQA. Specifically, AttReg first identifies the image regions that are essential for question answering yet unexpectedly ignored (i.e., assigned with low attention weights) by the backbone model. And then a mask-guided learning scheme is leveraged to regularize the visual attention to focus more on these ignored key regions. The proposed method is very flexible and model-agnostic, which can be integrated into most visual attention-based VQA models and require no human attention supervision. Extensive experiments over three benchmark datasets, i.e., VQA-CP v2, VQA-CP v1, and VQA v2, have been conducted to evaluate the effectiveness of AttReg. As a by-product, when incorporating AttReg into the strong baseline LMH, our approach can achieve a new state-of-the-art accuracy of 60.00% with an absolute performance gain of 7.01% on the VQA-CP v2 benchmark dataset. In addition to the effectiveness validation, we recognize that the faithfulness of the visual attention in VQA has not been well explored in literature. In the light of this, we propose to empirically validate such property of visual attention and compare it with the prevalent gradient-based approaches.

REFERENCES

  1. [1] Agrawal Aishwarya, Batra Dhruv, Parikh Devi, and Kembhavi Aniruddha. 2018. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 49714980.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 60776086.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Andreas Jacob, Rohrbach Marcus, Darrell Trevor, and Klein Dan. 2016. Deep compositional question answering with neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3948.Google ScholarGoogle Scholar
  4. [4] Antol Stanislaw, Agrawal Aishwarya, Lu Jiasen, Mitchell Margaret, Batra Dhruv, Zitnick C. Lawrence, and Parikh Devi. 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 24252433.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Cadene Remi, Ben-Younes Hedi, Cord Matthieu, and Thome Nicolas. 2019. MuRel: Multimodal relational reasoning for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 19891998.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Cadene Remi, Dancette Corentin, Ben-Younes Hedi, Cord Matthieu, and Parikh Devi. 2019. RUBi: Reducing unimodal biases in visual question answering. In Proceedings of the Conference on Advances in Neural Information Processing Systems. Curran Associates, Inc., 839850.Google ScholarGoogle Scholar
  7. [7] Chen Long, Yan Xin, Xiao Jun, Zhang Hanwang, Pu Shiliang, and Zhuang Yueting. 2020. Counterfactual samples synthesizing for robust visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1080010809.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Cho Kyunghyun, Merrienboer Bart van, Gulcehre Caglar, Bahdanau Dzmitry, Bougares Fethi, Schwenk Holger, and Bengio Yoshua. 2014. Learning phrase representations using RNN Encoder-Decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 17241734.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Clark Christopher, Yatskar Mark, and Zettlemoyer Luke. 2019. Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 40694082.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Das Abhishek, Agrawal Harsh, Zitnick C. Lawrence, Parikh Devi, and Batra Dhruv. 2016. Human attention in visual question answering: Do humans and deep networks look at the same regions? In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 932937.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Fan Hehe, Zhu Linchao, Yang Yi, and Wu Fei. 2020. Recurrent attention network with reinforced generator for visual dialog. ACM Trans. Multim. Comput. Commun. Appl. 16, 3 (2020), 116.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Fang Zhiwei, Liu Jing, Liu Xueliang, Tang Qu, Li Yong, and Lu Hanqing. 2019. BTDP: Toward sparse fusion with block term decomposition pooling for visual question answering. ACM Trans. Multim. Comput. Commun. Appl. 15, 50 (2019), 121.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Fukui Akira, Park Dong Huk, Yang Daylen, Rohrbach Anna, Darrell Trevor, and Rohrbach Marcus. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 457468.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Gat Itai, Schwartz Idan, Schwing Alexander G., and Hazan Tamir. 2020. Removing bias in multi-modal classifiers: Regularization by maximizing functional entropies. In Proceedings of the Conference on Advances in Neural Information Processing Systems. Curran Associates, Inc., 31973208.Google ScholarGoogle Scholar
  15. [15] Goyal Yash, Khot Tejas, Summers-Stay Douglas, Batra Dhruv, and Parikh Devi. 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 63256334.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Guo Yangyang, Cheng Zhiyong, Nie Liqiang, Liu Yibing, Wang Yinglong, and Kankanhalli Mohan. 2019. Quantifying and alleviating the language prior problem in visual question answering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 7584.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Guo Yangyang, Nie Liqiang, Cheng Zhiyong, Ji Feng, Zhang Ji, and Bimbo Alberto Del. 2021. AdaVQA: Overcoming language priors with adapted margin cosine loss. In Proceedings of the International Joint Conference on Artificial Intelligence. ijcai.org, 708714.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Guo Yangyang, Nie Liqiang, Cheng Zhiyong, and Tian Qi. 2020. Loss-rescaling VQA: Revisiting language prior problem from a class-imbalance view. arXiv preprint arXiv:2010.16010 (2020).Google ScholarGoogle Scholar
  19. [19] Honnibal Matthew, Montani Ines, Landeghem Sofie Van, and Boyd Adriane. 2020. spaCy: Industrial-strength Natural Language Processing in Python. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Jain Sarthak and Wallace Byron C.. 2019. Attention is not explanation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. ACL, 35433556.Google ScholarGoogle Scholar
  21. [21] Jing Chenchen, Wu Yuwei, Zhang Xiaoxun, Jia Yunde, and Wu Qi. 2020. Overcoming language priors in VQA via decomposed linguistic representations. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 1118111188.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Li Dong, Yao Ting, Duan Ling-Yu, Mei Tao, and Rui Yong. 2019. Unified spatio-temporal attention networks for action recognition in videos. IEEE Trans. Multim. 21, 2 (2019), 416428.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Li Jianan, Liang Xiaodan, Shen Shengmei, Xu Tingfa, Feng Jiashi, and Yan Shuicheng. 2018. Scale-aware fast R-CNN for pedestrian detection. IEEE Trans. Multim. 20, 4 (2018), 985996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Li Qun, Xiao Fu, An Le, Long Xianzhong, and Sun Xiaochuan. 2019. Semantic concept network and deep walk-based visual question answering. ACM Trans. Multim. Comput. Commun. Appl. 15, 49 (2019), 119.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Li Xirong, Xu Chaoxi, Wang Xiaoxu, Lan Weiyu, Jia Zhengxiong, Yang Gang, and Xu Jieping. 2019. COCO-CN for cross-lingual image tagging, captioning, and retrieval. IEEE Trans. Multim. 21, 9 (2019), 23472360.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Liang Junwei, Jiang Lu, Cao Liangliang, Li Li-Jia, and Hauptmann Alexander. 2019. Focal visual-text attention for visual question answering. IEEE Trans. Pattern Anal. Mach. Intell. 41, 8 (2019), 18931908.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Liang Zujie, Jiang Weitao, Hu Haifeng, and Zhu Jiaying. 2020. Learning to contrast the counterfactual samples for robust visual question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 32853292.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Liu Fei, Liu Jing, Hong Richang, and Lu Hanqing. 2019. Erasing-based attention learning for visual question answering. In ACM Multimedia. ACM, 11751183.Google ScholarGoogle Scholar
  29. [29] Lu Jiasen, Yang Jianwei, Batra Dhruv, and Parikh. Devi2016. Hierarchical question-image co-attention for visual question answering. In Proceedings of the Conference on Advances in Neural Information Processing Systems. Curran Associates, Inc., 289297.Google ScholarGoogle Scholar
  30. [30] Malinowski Mateusz, Doersch Carl, Santoro Adam, and Battaglia Peter. 2018. Learning visual question answering by bootstrapping hard attention. In Proceedings of the European Conference on Computer Vision. Springer, 320.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Patro Badri N., Anupriy, and Namboodiri Vinay P.. 2020. Explanation vs attention: A two-player game to obtain attention for VQA. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 1184811855.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Pennington Jeffrey, Socher Richard, and Manning Christopher. 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 15321543.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Qiao Tingting, Dong Jianfeng, and Xu Duanqing. 2018. Exploring human-like attention supervision in visual question answering. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 73007307.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Ramakrishnan Sainandan, Agrawal Aishwarya, and Lee Stefan. 2018. Overcoming language priors in visual question answering with adversarial regularization. In Proceedings of the Conference on Advances in Neural Information Processing Systems. Curran Associates, Inc., 15481558.Google ScholarGoogle Scholar
  35. [35] Ren Shaoqing, He Kaiming, Girshick Ross, and Sun Jian. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. Curran Associates, Inc., 9199.Google ScholarGoogle Scholar
  36. [36] Ross Andrew Slavin, Hughes Michael C., and Doshi-Velez Finale. 2017. Right for the right reasons: Training differentiable models by constraining their explanations. In Proceedings of the International Joint Conference on Artificial Intelligence. ijcai.org, 26622670.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Selvaraju Ramprasaath R., Cogswell Michael, Das Abhishek, Vedantam Ramakrishna, Parikh Devi, and Batra Dhruv. 2017. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 618626.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Selvaraju Ramprasaath R., Lee Stefan, Shen Yilin, Jin Hongxia, Ghosh Shalini, Heck Larry, Batra Dhruv, and Parikh Devi. 2019. Taking a hint: Leveraging explanations to make vision and language models more grounded. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 25912600.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Shih Kevin J., Singh Saurabh, and Hoiem Derek. 2016. Where to look: Focus regions for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 46134621.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Shrestha Robik, Kafle Kushal, and Kanan Christopher. 2020. A negative case analysis of visual grounding methods for VQA. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 81728181.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Teney Damien, Abbasnejad Ehsan, Kafle Kushal, Shrestha Robik, Kanan Christopher, and Hengel Anton van den. 2020. On the value of out-of-distribution testing: An example of Goodhart’s law. In Proceedings of the Conference on Advances in Neural Information Processing Systems. Curran Associates, Inc., 407417.Google ScholarGoogle Scholar
  42. [42] Wu Jialin and Mooney Raymond J.. 2019. Self-critical reasoning for robust visual question answering. In Proceedings of the Conference on Advances in Neural Information Processing Systems. Curran Associates, Inc., 86018611.Google ScholarGoogle Scholar
  43. [43] Wu Qi, Teney Damien, Wang Peng, Shen Chunhua, Dick Anthony, and Hengel Anton van den. 2017. Visual question answering: A survey of methods and datasets. Comput. Vis. Image Underst. 163 (2017), 2140.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Xu Dejing, Zhao Zhou, Xiao Jun, Wu Fei, Zhang Hanwang, He Xiangnan, and Zhuang Yueting. 2017. Video question answering via gradually refined attention over appearance and motion. In ACM Multim. ACM, 16451653.Google ScholarGoogle Scholar
  45. [45] Yang Hui, Chaisorn Lekha, Zhao Yunlong, Neo Shi-Yong, and Chua Tat-Seng. 2003. VideoQA: Question answering on news video. In ACM Multim. ACM, 632641.Google ScholarGoogle Scholar
  46. [46] Yang Zichao, He Xiaodong, Gao Jianfeng, Deng Li, and Smola Alex. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2129.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Yu Dongfei, Fu Jianlong, Tian Xinmei, and Mei Tao. 2019. Multi-source multi-level attention networks for visual question answering. ACM Trans. Multim. Comput. Commun. Appl. 15, 51 (2019), 120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Zhang Peng, Goyal Yash, Summers-Stay Douglas, Batra Dhruv, and Parikh Devi. 2016. Yin and yang: Balancing and answering binary visual questions. In Proceedings of the Conference on Advances in Neural Information Processing Systems. Curran Associates, Inc., 50145022.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Zhang Yundong, Niebles Juan Carlos, and Soto Alvaro. 2019. Interpretable visual question answering by visual grounding from attention supervision mining. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. IEEE, 349357.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Zhao Zhou, Lin Jinghao, Jiang Xinghua, Cai Deng, He Xiaofei, and Zhuang Yueting. 2017. Video question answering via hierarchical dual-level attention network learning. In ACM Multim. ACM, 10501058.Google ScholarGoogle Scholar
  51. [51] Zhou Luowei, Xu Chenliang, Koch Parker A., and Corso Jason J.. 2017. Watch what you just said: Image captioning with text-conditional attention. In ACM Multim. ACM, 305313.Google ScholarGoogle Scholar
  52. [52] Zhu Xi, Mao Zhendong, Liu Chunxiao, Zhang Peng, Wang Bin, and Zhang Yongdong. 2020. Overcoming language priors with self-supervised learning for visual question answering. In Proceedings of the International Joint Conference on Artificial Intelligence. AAAI Press, 10831089.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Zhu Yuke, Groth Oliver, Bernstein Michael, and Fei-Fei Li. 2016. Visual7W: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 49955004.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Answer Questions with Right Image Regions: A Visual Attention Regularization Approach

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Multimedia Computing, Communications, and Applications
          ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 4
          November 2022
          497 pages
          ISSN:1551-6857
          EISSN:1551-6865
          DOI:10.1145/3514185
          • Editor:
          • Abdulmotaleb El Saddik
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 4 March 2022
          • Accepted: 1 November 2021
          • Revised: 1 October 2021
          • Received: 1 February 2021
          Published in tomm Volume 18, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!