skip to main content
research-article

Causal Inference with Knowledge Distilling and Curriculum Learning for Unbiased VQA

Authors Info & Claims
Published:04 March 2022Publication History
Skip Abstract Section

Abstract

Recently, many Visual Question Answering (VQA) models rely on the correlations between questions and answers yet neglect those between the visual information and the textual information. They would perform badly if the handled data distribute differently from the training data (i.e., out-of-distribution (OOD) data). Towards this end, we propose a two-stage unbiased VQA approach that addresses the unbiased issue from a causal perspective. In the causal inference stage, we mark the spurious correlation on the causal graph, explore the counterfactual causality, and devise a causal target based on the inherent correlations between the conventional and counterfactual VQA models. In the distillation stage, we introduce the causal target into the training process and leverages distilling as well as curriculum learning to capture the unbiased model. Since Causal Inference with Knowledge Distilling and Curriculum Learning (CKCL) reinforces the contribution of the visual information and eliminates the impact of the spurious correlation by distilling the knowledge in causal inference to the VQA model, it contributes to the good performance on both the standard data and out-of-distribution data. The extensive experimental results on VQA-CP v2 dataset demonstrate the superior performance of the proposed method compared to the state-of-the-art (SotA) methods.

REFERENCES

  1. [1] Abbasnejad Ehsan, Teney Damien, Parvaneh Amin, Shi Javen, and Hengel Anton van den. 2020. Counterfactual vision and language learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1004410054.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Agarwal Vedika, Shetty Rakshith, and Fritz Mario. 2020. Towards causal VQA: Revealing and reducing spurious correlations by invariant and covariant semantic editing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 96909698.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Agrawal Aishwarya, Batra Dhruv, Parikh Devi, and Kembhavi Aniruddha. 2018. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 49714980.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 60776086.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Anderson Peter, Wu Qi, Teney Damien, Bruce Jake, Johnson Mark, Sünderhauf Niko, Reid Ian, Gould Stephen, and Hengel Anton van den. 2018. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 36743683.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Antol Stanislaw, Agrawal Aishwarya, Lu Jiasen, Mitchell Margaret, Batra Dhruv, Zitnick C. Lawrence, and Parikh Devi. 2015. VQA: Visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 24252433.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Ben-younes Hedi, Cadene Rémi, Thome Nicolas, and Cord Matthieu. 2019. BLOCK: Bilinear superdiagonal fusion for visual question answering and visual relationship detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 81028109.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Bengio Yoshua, Louradour Jérôme, Collobert Ronan, and Weston Jason. 2009. Curriculum learning. In Proceedings of the International Conference on Machine Learning. 4148.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Cadene Remi, Dancette Corentin, younes Hedi Ben, Cord Matthieu, and Parikh Devi. 2019. RUBi: Reducing unimodal biases in visual question answering. In Proceedings of the Advances in Neural Information Processing Systems. 839850.Google ScholarGoogle Scholar
  10. [10] Chen Long, Yan Xin, Xiao Jun, Zhang Hanwang, Pu Shiliang, and Zhuang Yueting. 2020. Counterfactual samples synthesizing for robust visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1080010809.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. NIPS Deep Learning Workshop (2014).Google ScholarGoogle Scholar
  12. [12] Clark Christopher, Yatskar Mark, and Zettlemoyer Luke. 2019. Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 40674080.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Qi Dou, Quande Liu, Pheng Ann Heng, and Ben Glocker. 2020. Unpaired multi-modal segmentation via knowledge sistillation. IEEE Transactions on MedicalIimaging 39, 7 (2020), 2415–2425.Google ScholarGoogle Scholar
  14. [14] Furlanello Tommaso, Lipton Zachary Chase, Tschannen Michael, Itti Laurent, and Anandkumar Anima. 2018. Born again neural networks. In Proceedings of the International Conference on Machine Learning. 16021611.Google ScholarGoogle Scholar
  15. [15] Gokhale Tejas, Banerjee Pratyay, Baral Chitta, and Yang Yezhou. 2020. MUTANT: A training paradigm for out-of-distribution generalization in visual question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 878892.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Goodfellow Ian, Pouget-Abadie Jean, Mirza Mehdi, Xu Bing, Warde-Farley David, Ozair Sherjil, Courville Aaron, and Bengio Yoshua. 2020. Generative adversarial networks. Communications of the ACM 63, 11 (2020), 139144.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Goyal Yash, Khot Tejas, Summers-Stay Douglas, Batra Dhruv, and Parikh Devi. 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 69046913.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Hinton Geoffrey E.. 2002. Training products of experts by minimizing contrastive divergence. Neural Computation 14, 8 (2002), 17711800.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2014. Distilling the knowledge in a neural network. In NIPS 2014 Deep Learning Workshop.Google ScholarGoogle Scholar
  20. [20] Jiang Huaizu, Misra Ishan, Rohrbach Marcus, Learned-Miller Erik, and Chen Xinlei. 2020. In defense of grid features for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1026710276.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Johnson Justin, Hariharan Bharath, Maaten Laurens van der, Fei-Fei Li, Zitnick C. Lawrence, and Girshick Ross. 2017. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 19881997.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Kim Jin-Hwa, Jun Jaehyun, and Zhang Byoung-Tak. 2018. Bilinear attention networks. In Proceedings of the Advances in Neural Information Processing Systems. 15641574.Google ScholarGoogle Scholar
  23. [23] Kingma Diederik P. and Ba Jimmy Lei. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  24. [24] Kv Gouthaman and Mittal Anurag. 2020. Reducing language biases in visual question answering with visually-grounded question encoder. In Proceedings of the European Conference on Computer Vision. 1834.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Li Kang, Yu Lequan, Wang Shujun, and Heng Pheng-Ann. 2020. Towards cross-modality medical image segmentation with online mutual knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Zechao Li and Jinhui Tang. 2021. Semi-supervised local feature selection for data classification. Science China Information Sciences 64, 9 (2021), 1–12.Google ScholarGoogle Scholar
  27. [27] Li Zechao, Tang Jinhui, Zhang Liyan, and Yang Jian. 2020. Weakly-supervised semantic guided hashing for social image retrieval. International Journal of Computer Vision 128, 8 (2020), 22652278.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Lin Tsung-Yi, Goyal Priya, Girshick Ross, He Kaiming, and Dollár Piotr. 2017. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision. 29802988.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Fei Liu, Jing Liu, Zhiwei Fang, Richang Hong, and Hanqing Lu. 2021. Visual question answering with dense onter-and intra-modality interactions. IEEE Transactions on Multimedia 23 (2021), 3518–3529.Google ScholarGoogle Scholar
  30. [30] Liu Fei, Liu Jing, Hong Richang, and Lu Hanqing. 2019. Erasing-based attention learning for visual question answering. In Proceedings of the 27th ACM International Conference on Multimedia. 11751183.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Lu Jiasen, Batra Dhruv, Parikh Devi, and Lee Stefan. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 32. 1323.Google ScholarGoogle Scholar
  32. [32] Manjunatha Varun, Saini Nirat, and Davis Larry S.. 2019. Explicit bias discovery in visual question answering models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 95629571.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Marino Kenneth, Rastegari Mohammad, Farhadi Ali, and Mottaghi Roozbeh. 2019. OK-VQA: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31953204.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Nguyen Duy Kien, Goswami Vedanuj, and Chen Xinlei. 2021. MoVie: Revisiting modulated convolutions for visual counting and beyond. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  35. [35] Niu Yulei, Tang Kaihua, Zhang Hanwang, Lu Zhiwu, Hua Xian-Sheng, and Wen Ji-Rong. 2021. Counterfactual VQA: A cause-effect look at language bias. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1270012710.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Pan Yonghua, Li Zechao, Zhang Liyan, and Tang Jinhui. 2021. Distilling knowledge in causal inference for unbiased visual question answering. In Proceedings of the 2nd ACM International Conference on Multimedia in Asia (MMAsia’20). Association for Computing Machinery, New York, NY, Article 3, 7 pages. DOI: DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Park Wonpyo, Kim Dongju, Lu Yan, and Cho Minsu. 2019. Relational knowledge distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 39673976.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Pearl Judea. 2001. Direct and indirect effects. In Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence. 411420.Google ScholarGoogle Scholar
  39. [39] Pearl Judea, Glymour Madelyn, and Jewell Nicholas P.. 2016. Causal Inference in Statistics: A Primer. John Wiley & Sons.Google ScholarGoogle Scholar
  40. [40] Pearl Judea and Mackenzie Dana. 2018. The Book of Why: The New Science of Cause and Effect. Basic Books.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Pennington Jeffrey, Socher Richard, and Manning Christopher. 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 15321543.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Ramakrishnan Sainandan, Agrawal Aishwarya, and Lee Stefan. 2018. Overcoming language priors in visual question answering with adversarial regularization. In Proceedings of the Advances in Neural Information Processing Systems. 15411551.Google ScholarGoogle Scholar
  43. [43] Ren Shaoqing, He Kaiming, Girshick Ross, and Sun Jian. 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6 (2017), 11371149.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Romero Adriana, Ballas Nicolas, Kahou Samira Ebrahimi, Chassang Antoine, Gatta Carlo, and Bengio Yoshua. 2015. FitNets: Hints for thin deep nets. In Proceedings of the ICLR 2015 : International Conference on Learning Representations 2015.Google ScholarGoogle Scholar
  45. [45] Tan Hao and Bansal Mohit. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 50995110.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Tang Kaihua, Niu Yulei, Huang Jianqiang, Shi Jiaxin, and Zhang Hanwang. 2020. Unbiased scene graph generation from biased training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 37163725.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Teney Damien, Abbasnedjad Ehsan, and Hengel Anton van den. 2020. Learning what makes a difference from counterfactual examples and gradient supervision. In Proceedings of the European Conference on Computer Vision. 580599.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Tian Yonglong, Krishnan Dilip, and Isola Phillip. 2020. Contrastive representation distillation. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  49. [49] Torralba Antonio and Efros Alexei A.. 2011. Unbiased look at dataset bias. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 15211528.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Wang Peng, Wu Qi, Shen Chunhua, Dick Anthony, and Hengel Anton Van Den. 2017. FVQA: Fact-based visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 10 (2017), 24132427.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Wang Tan, Huang Jianqiang, Zhang Hanwang, and Sun Qianru. 2020. Visual commonsense R-CNN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1076010770.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Yim Junho, Joo Donggyu, Bae Jihoon, and Kim Junmo. 2017. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 71307138.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Yu Zhou, Cui Yuhao, Yu Jun, Wang Meng, Tao Dacheng, and Tian Qi. 2020. Deep multimodal neural architecture search. In Proceedings of the 28th ACM International Conference on Multimedia. 37433752.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Yu Zhou, Yu Jun, Cui Yuhao, Tao Dacheng, and Tian Qi. 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 62816290.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Yu Zhou, Yu Jun, Xiang Chenchao, Fan Jianping, and Tao Dacheng. 2018. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Transactions on Neural Networks 29, 12 (2018), 59475959.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Zagoruyko Sergey and Komodakis Nikos. 2016. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In Proceedings of the International Conference on Learning Representations (Poster).Google ScholarGoogle Scholar
  57. [57] Zellers Rowan, Bisk Yonatan, Farhadi Ali, and Choi Yejin. 2019. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 67206731.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Causal Inference with Knowledge Distilling and Curriculum Learning for Unbiased VQA

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Multimedia Computing, Communications, and Applications
          ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 3
          August 2022
          478 pages
          ISSN:1551-6857
          EISSN:1551-6865
          DOI:10.1145/3505208
          Issue’s Table of Contents

          Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 4 March 2022
          • Accepted: 1 August 2021
          • Revised: 1 July 2021
          • Received: 1 April 2021
          Published in tomm Volume 18, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!