Abstract
Recently, many Visual Question Answering (VQA) models rely on the correlations between questions and answers yet neglect those between the visual information and the textual information. They would perform badly if the handled data distribute differently from the training data (i.e., out-of-distribution (OOD) data). Towards this end, we propose a two-stage unbiased VQA approach that addresses the unbiased issue from a causal perspective. In the causal inference stage, we mark the spurious correlation on the causal graph, explore the counterfactual causality, and devise a causal target based on the inherent correlations between the conventional and counterfactual VQA models. In the distillation stage, we introduce the causal target into the training process and leverages distilling as well as curriculum learning to capture the unbiased model. Since Causal Inference with Knowledge Distilling and Curriculum Learning (CKCL) reinforces the contribution of the visual information and eliminates the impact of the spurious correlation by distilling the knowledge in causal inference to the VQA model, it contributes to the good performance on both the standard data and out-of-distribution data. The extensive experimental results on VQA-CP v2 dataset demonstrate the superior performance of the proposed method compared to the state-of-the-art (SotA) methods.
- [1] . 2020. Counterfactual vision and language learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10044–10054.Google Scholar
Cross Ref
- [2] . 2020. Towards causal VQA: Revealing and reducing spurious correlations by invariant and covariant semantic editing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9690–9698.Google Scholar
Cross Ref
- [3] . 2018. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4971–4980.Google Scholar
Cross Ref
- [4] . 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.Google Scholar
Cross Ref
- [5] . 2018. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3674–3683.Google Scholar
Cross Ref
- [6] . 2015. VQA: Visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2425–2433.Google Scholar
Digital Library
- [7] . 2019. BLOCK: Bilinear superdiagonal fusion for visual question answering and visual relationship detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8102–8109.Google Scholar
Digital Library
- [8] . 2009. Curriculum learning. In Proceedings of the International Conference on Machine Learning. 41–48.Google Scholar
Digital Library
- [9] . 2019. RUBi: Reducing unimodal biases in visual question answering. In Proceedings of the Advances in Neural Information Processing Systems. 839–850.Google Scholar
- [10] . 2020. Counterfactual samples synthesizing for robust visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10800–10809.Google Scholar
Cross Ref
- [11] Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. NIPS Deep Learning Workshop (2014).Google Scholar
- [12] . 2019. Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 4067–4080.Google Scholar
Cross Ref
- [13] Qi Dou, Quande Liu, Pheng Ann Heng, and Ben Glocker. 2020. Unpaired multi-modal segmentation via knowledge sistillation. IEEE Transactions on MedicalIimaging 39, 7 (2020), 2415–2425.Google Scholar
- [14] . 2018. Born again neural networks. In Proceedings of the International Conference on Machine Learning. 1602–1611.Google Scholar
- [15] . 2020. MUTANT: A training paradigm for out-of-distribution generalization in visual question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 878–892.Google Scholar
Cross Ref
- [16] . 2020. Generative adversarial networks. Communications of the ACM 63, 11 (2020), 139–144.Google Scholar
Digital Library
- [17] . 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6904–6913.Google Scholar
Cross Ref
- [18] . 2002. Training products of experts by minimizing contrastive divergence. Neural Computation 14, 8 (2002), 1771–1800.Google Scholar
Digital Library
- [19] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2014. Distilling the knowledge in a neural network. In NIPS 2014 Deep Learning Workshop.Google Scholar
- [20] . 2020. In defense of grid features for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10267–10276.Google Scholar
Cross Ref
- [21] . 2017. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1988–1997.Google Scholar
Cross Ref
- [22] . 2018. Bilinear attention networks. In Proceedings of the Advances in Neural Information Processing Systems. 1564–1574.Google Scholar
- [23] . 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [24] . 2020. Reducing language biases in visual question answering with visually-grounded question encoder. In Proceedings of the European Conference on Computer Vision. 18–34.Google Scholar
Digital Library
- [25] . 2020. Towards cross-modality medical image segmentation with online mutual knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence.Google Scholar
Cross Ref
- [26] Zechao Li and Jinhui Tang. 2021. Semi-supervised local feature selection for data classification. Science China Information Sciences 64, 9 (2021), 1–12.Google Scholar
- [27] . 2020. Weakly-supervised semantic guided hashing for social image retrieval. International Journal of Computer Vision 128, 8 (2020), 2265–2278.Google Scholar
Digital Library
- [28] . 2017. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision. 2980–2988.Google Scholar
Cross Ref
- [29] Fei Liu, Jing Liu, Zhiwei Fang, Richang Hong, and Hanqing Lu. 2021. Visual question answering with dense onter-and intra-modality interactions. IEEE Transactions on Multimedia 23 (2021), 3518–3529.Google Scholar
- [30] . 2019. Erasing-based attention learning for visual question answering. In Proceedings of the 27th ACM International Conference on Multimedia. 1175–1183.Google Scholar
Digital Library
- [31] . 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 32. 13–23.Google Scholar
- [32] . 2019. Explicit bias discovery in visual question answering models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9562–9571.Google Scholar
Cross Ref
- [33] . 2019. OK-VQA: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3195–3204.Google Scholar
Cross Ref
- [34] . 2021. MoVie: Revisiting modulated convolutions for visual counting and beyond. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [35] . 2021. Counterfactual VQA: A cause-effect look at language bias. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12700–12710.Google Scholar
Cross Ref
- [36] . 2021. Distilling knowledge in causal inference for unbiased visual question answering. In Proceedings of the 2nd ACM International Conference on Multimedia in Asia (MMAsia’20). Association for Computing Machinery, New York, NY, Article 3, 7 pages.
DOI :DOI: Google ScholarDigital Library
- [37] . 2019. Relational knowledge distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3967–3976.Google Scholar
Cross Ref
- [38] . 2001. Direct and indirect effects. In Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence. 411–420.Google Scholar
- [39] . 2016. Causal Inference in Statistics: A Primer. John Wiley & Sons.Google Scholar
- [40] . 2018. The Book of Why: The New Science of Cause and Effect. Basic Books.Google Scholar
Digital Library
- [41] . 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1532–1543.Google Scholar
Cross Ref
- [42] . 2018. Overcoming language priors in visual question answering with adversarial regularization. In Proceedings of the Advances in Neural Information Processing Systems. 1541–1551.Google Scholar
- [43] . 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6 (2017), 1137–1149.Google Scholar
Digital Library
- [44] . 2015. FitNets: Hints for thin deep nets. In Proceedings of the ICLR 2015 : International Conference on Learning Representations 2015.Google Scholar
- [45] . 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 5099–5110.Google Scholar
Cross Ref
- [46] . 2020. Unbiased scene graph generation from biased training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3716–3725.Google Scholar
Cross Ref
- [47] . 2020. Learning what makes a difference from counterfactual examples and gradient supervision. In Proceedings of the European Conference on Computer Vision. 580–599.Google Scholar
Digital Library
- [48] . 2020. Contrastive representation distillation. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [49] . 2011. Unbiased look at dataset bias. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1521–1528.Google Scholar
Digital Library
- [50] . 2017. FVQA: Fact-based visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 10 (2017), 2413–2427.Google Scholar
Digital Library
- [51] . 2020. Visual commonsense R-CNN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10760–10770.Google Scholar
Cross Ref
- [52] . 2017. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7130–7138.Google Scholar
Cross Ref
- [53] . 2020. Deep multimodal neural architecture search. In Proceedings of the 28th ACM International Conference on Multimedia. 3743–3752.Google Scholar
Digital Library
- [54] . 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6281–6290.Google Scholar
Cross Ref
- [55] . 2018. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Transactions on Neural Networks 29, 12 (2018), 5947–5959.Google Scholar
Cross Ref
- [56] . 2016. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In Proceedings of the International Conference on Learning Representations (Poster).Google Scholar
- [57] . 2019. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6720–6731.Google Scholar
Cross Ref
Index Terms
Causal Inference with Knowledge Distilling and Curriculum Learning for Unbiased VQA
Recommendations
Distilling knowledge in causal inference for unbiased visual question answering
MMAsia '20: Proceedings of the 2nd ACM International Conference on Multimedia in AsiaCurrent Visual Question Answering (VQA) models mainly explore the statistical correlations between answers and questions, which fail to capture the relationship between the visual information and answers. The performance dramatically decreases when the ...
Inference in multi-agent causal models
In this article, we demonstrate the usefulness of causal Bayesian networks as probabilistic reasoning systems. The biggest advantage of causal Bayesian networks over traditional probabilistic Bayesian networks is that they sometimes allow to perform ...
Causal inference through a witness protection program
One of the most fundamental problems in causal inference is the estimation of a causal effect when treatment and outcome are confounded. This is difficult in an observational study, because one has no direct evidence that all confounders have been ...






Comments