skip to main content
10.1145/3394171.3413998acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

K-armed Bandit based Multi-Modal Network Architecture Search for Visual Question Answering

Authors Info & Claims
Published:12 October 2020Publication History

ABSTRACT

In this paper, we propose a cross-modal network architecture search (NAS) algorithm for VQA, termed as k-Armed Bandit based NAS (KAB-NAS). KAB-NAS regards the design of each layer as a k-armed bandit problem and updates the preference of each candidate via numerous samplings in a single-shot search framework. To establish an effective search space, we further propose a new architecture termed Automatic Graph Attention Network (AGAN), and extend the popular self-attention layer with three graph structures, denoted as dense-graph, co-graph and separate-graph.These graph layers are used to form the direction of information propagation in the graph network, and their optimal combinations are searched by KAB-NAS. To evaluate KAB-NAS and AGAN, we conduct extensive experiments on two VQA benchmark datasets, i.e., VQA2.0 and GQA, and also test AGAN with the popular BERT-style pre-training. The experimental results show that with the help of KAB-NAS, AGAN can achieve the state-of-the-art performance on both benchmark datasets with much fewer parameters and computations.

Skip Supplemental Material Section

Supplemental Material

3394171.3413998.mp4

mp4

43.9 MB

References

  1. Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. 2018. Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering. CVPR (2018).Google ScholarGoogle Scholar
  2. Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.Google ScholarGoogle Scholar
  3. Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Learning to compose neural networks for question answering. NAACL (2016).Google ScholarGoogle Scholar
  4. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In ICCV.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Gabriel Bender, Pieterjan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc V Le. [n.d.]. Understanding and Simplifying One-Shot Architecture Search. ICML ( [n.,d.]).Google ScholarGoogle Scholar
  6. Remi Cadene, Hedi Benyounes, Matthieu Cord, and Nicolas Thome. 2019. MUREL: Multimodal Relational Reasoning for Visual Question Answering. CVPR (2019).Google ScholarGoogle Scholar
  7. Han Cai, Ligeng Zhu, and Song Han. 2019. ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware. ICLR (2019).Google ScholarGoogle Scholar
  8. Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).Google ScholarGoogle Scholar
  9. Jacob Devlin, Mingwei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL (2019), 4171--4186.Google ScholarGoogle Scholar
  10. Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016).Google ScholarGoogle Scholar
  11. Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven C H Hoi, Xiaogang Wang, and Hongsheng Li. 2019 a. Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering. CVPR (2019), 6639--6648.Google ScholarGoogle Scholar
  12. Peng Gao, Haoxuan You, Zhanpeng Zhang, Xiaogang Wang, and Hongsheng Li. 2019 b. Multi-modality Latent Interaction Network for Visual Question Answering. (2019), 5825--5835.Google ScholarGoogle Scholar
  13. Yash Goyal, Tejas Khot, Douglas Summersstay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In CVPR.Google ScholarGoogle Scholar
  14. Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural Computation (1997).Google ScholarGoogle Scholar
  15. Drew A Hudson and Christopher D Manning. 2018. Compositional Attention Networks for Machine Reasoning. ICLR (2018).Google ScholarGoogle Scholar
  16. Drew A Hudson and Christopher D Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. CVPR (2019).Google ScholarGoogle Scholar
  17. Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, and Devi Parikh. 2018. Pythia v0.1: the Winning Entry to the VQA Challenge 2018. arXiv preprint arXiv: 1807.09956 (2018).Google ScholarGoogle Scholar
  18. Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Feifei, C Lawrence Zitnick, and Ross B Girshick. 2017. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. CVPR (2017).Google ScholarGoogle Scholar
  19. Kushal Kafle and Christopher Kanan. 2017. An Analysis of Visual Question Answering Algorithms. ICCV (2017).Google ScholarGoogle Scholar
  20. Kushal Kafle, Brian L Price, Scott D Cohen, and Christopher Kanan. 2018. DVQA: Understanding Data Visualizations via Question Answering. CVPR (2018).Google ScholarGoogle Scholar
  21. Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara L Berg. 2014. ReferItGame: Referring to Objects in Photographs of Natural Scenes. (2014), 787--798.Google ScholarGoogle Scholar
  22. Jinhwa Kim, Jaehyun Jun, and Byoungtak Zhang. 2018. Bilinear Attention Networks. NIPS (2018).Google ScholarGoogle Scholar
  23. Jinhwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, Jungwoo Ha, and Byoungtak Zhang. 2017. Hadamard Product for Low-rank Bilinear Pooling. ICLR (2017).Google ScholarGoogle Scholar
  24. Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  25. Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).Google ScholarGoogle Scholar
  26. Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2016. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332 (2016).Google ScholarGoogle Scholar
  27. Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou. 2019 a. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066 (2019).Google ScholarGoogle Scholar
  28. Linjie Li, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019 b. Relation-aware Graph Attention Network for Visual Question Answering. arXiv preprint arXiv: 1903.12314 (2019).Google ScholarGoogle Scholar
  29. Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019 c. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019).Google ScholarGoogle Scholar
  30. Mingbao Lin, Rongrong Ji, Yan Wang, Yichen Zhang, Baochang Zhang, Yonghong Tian, and Ling Shao. 2020. HRank: Filter Pruning using High-Rank Feature Map. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1529--1538.Google ScholarGoogle ScholarCross RefCross Ref
  31. Daqing Liu, Hanwang Zhang, Zhengjun Zha, and Feng Wu. 2019 b. Learning to Assemble Neural Module Tree Networks for Visual Grounding. ICCV (2019), 4673--4682.Google ScholarGoogle Scholar
  32. Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. 2018. Hierarchical Representations for Efficient Architecture Search. ICLR (2018).Google ScholarGoogle Scholar
  33. Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2019 a. DARTS: Differentiable Architecture Search. ICLR (2019).Google ScholarGoogle Scholar
  34. Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019 a. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems. 13--23.Google ScholarGoogle Scholar
  35. Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 2019 b. 12-in-1: Multi-Task Vision and Language Representation Learning. arXiv preprint arXiv:1912.02315.Google ScholarGoogle Scholar
  36. Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In NeurIPS.Google ScholarGoogle Scholar
  37. Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng, and Ji Rongrong. 2020. Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation. In CVPR.Google ScholarGoogle Scholar
  38. Hyeonseob Nam, Jung Woo Ha, and Jeonghee Kim. 2017. Dual Attention Networks for Multimodal Reasoning and Matching. (2017).Google ScholarGoogle Scholar
  39. Duykien Nguyen and Takayuki Okatani. 2018. Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering. CVPR (2018).Google ScholarGoogle Scholar
  40. Will Norcliffe-Brown, Efstathios Vafeias, and Sarah Parisot. 2018. Learning Conditioned Graph Structures for Interpretable Visual Question Answering. arXiv preprint arXiv:1806.07243 (2018).Google ScholarGoogle Scholar
  41. Badri Patro and Vinay P Namboodiri. 2018. Differential Attention for Visual Question Answering. In CVPR.Google ScholarGoogle Scholar
  42. Gao Peng, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven C H Hoi, Xiaogang Wang, and Hongsheng Li. 2018. Dynamic Fusion with Intra- and Inter- Modality Attention Flow for Visual Question Answering. arXiv preprint arXiv: 1812.05252 (2018).Google ScholarGoogle Scholar
  43. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In EMNLP.Google ScholarGoogle Scholar
  44. Juan-Manuel Pérez-Rúa, Valentin Vielzeuf, Stéphane Pateux, Moez Baccouche, and Frédéric Jurie. 2019. Mfas: Multimodal fusion architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6966--6975.Google ScholarGoogle ScholarCross RefCross Ref
  45. Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. 2019. Regularized Evolution for Image Classifier Architecture Search. AAAI (2019).Google ScholarGoogle Scholar
  46. Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Exploring models and data for image question answering. In NeurIPS.Google ScholarGoogle Scholar
  47. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. [n.d.]. Faster R-CNN: towards real-time object detection with region proposal networks. TPAMI ( [n.,d.]).Google ScholarGoogle Scholar
  48. Meet Shah, Xinlei Chen, Marcus Rohrbach, and Devi Parikh. 2019. Cycle-Consistency for Robust Visual Question Answering. arXiv preprint arXiv: 1902.05660 (2019).Google ScholarGoogle Scholar
  49. Robik Shrestha, Kushal Kafle, and Christopher Kanan. 2019. Answer Them All! Toward Universal Visual Question Answering Models. CVPR (2019).Google ScholarGoogle Scholar
  50. Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019).Google ScholarGoogle Scholar
  51. Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019).Google ScholarGoogle Scholar
  52. Damien Teney, Peter Anderson, Xiaodong He, and Anton Van Den Hengel. 2018. Tips and Tricks for Visual Question Answering: Learnings From the 2017 Challenge. CVPR (2018).Google ScholarGoogle Scholar
  53. Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. Yfcc100m: The new data in multimedia research. COMMUN ACM (2016).Google ScholarGoogle Scholar
  54. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. NIPS (2017).Google ScholarGoogle Scholar
  55. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. CVPR (2015).Google ScholarGoogle Scholar
  56. Lingxi Xie and Alan L Yuille. 2017. Genetic CNN. ICCV (2017).Google ScholarGoogle Scholar
  57. Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In CVPR.Google ScholarGoogle Scholar
  58. Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. 2018a. MAttNet: Modular Attention Network for Referring Expression Comprehension. CVPR (2018), 1307--1315.Google ScholarGoogle Scholar
  59. Licheng Yu, Hao Tan, Mohit Bansal, and Tamara L Berg. 2017a. A Joint Speaker-Listener-Reinforcer Model for Referring Expressions. CVPR (2017), 3521--3529.Google ScholarGoogle Scholar
  60. Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep Modular Co-Attention Networks for Visual Question Answering. CVPR (2019).Google ScholarGoogle Scholar
  61. Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017b. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proc. IEEE Int. Conf. Comp. Vis.Google ScholarGoogle ScholarCross RefCross Ref
  62. Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. 2018b. Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering. TNN (2018).Google ScholarGoogle Scholar
  63. Zhou Yu, Jun Yu, Chenchao Xiang, Zhou Zhao, Qi Tian, and Dacheng Tao. 2018c. Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding. international joint conference on artificial intelligence (2018), 1114--1120.Google ScholarGoogle Scholar
  64. Yan Zhang, Jonathon S Hare, and Adam Prugelbennett. 2018. Learning to Count Objects in Natural Images for Visual Question Answering. ICLR (2018).Google ScholarGoogle Scholar
  65. Xiawu Zheng, Rongrong Ji, Lang Tang, Baochang Zhang, Jianzhuang Liu, and Qi Tian. 2019. Multinomial Distribution Learning for Effective Neural Architecture Search. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).Google ScholarGoogle ScholarCross RefCross Ref
  66. Xiawu Zheng, Rongrong Ji, Qiang Wang, Qixiang Ye, Zhenguo Li, Yonghong Tian, and Qi Tian. 2020. Rethinking Performance Estimation in Neural Architecture Search. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  67. Yiyi Zhou, Rongrong Ji, Jinsong Su, Xiaoshuai Sun, and Weiqiu Chen. 2019. Dynamic Capsule Attention for Visual Question Answering. AAAI 2019 : Thirty-Third AAAI Conference on Artificial Intelligence, Vol. 33, 1 (2019), 9324--9331.Google ScholarGoogle Scholar
  68. Yiyi Zhou, Rongrong Ji, Jinsong Su, Xiaoshuai Sun, and Xiangming Li. 2019. Free VQA Models from Knowledge Inertia by Pairwise Inconformity Learning. AAAI (2019).Google ScholarGoogle Scholar
  69. Barret Zoph and Quoc V Le. 2017. Neural Architecture Search with Reinforcement Learning. ICLR (2017).Google ScholarGoogle Scholar
  70. Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. 2018. Learning Transferable Architectures for Scalable Image Recognition. (2018).Google ScholarGoogle Scholar

Index Terms

  1. K-armed Bandit based Multi-Modal Network Architecture Search for Visual Question Answering

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '20: Proceedings of the 28th ACM International Conference on Multimedia
      October 2020
      4889 pages
      ISBN:9781450379885
      DOI:10.1145/3394171

      Copyright © 2020 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 October 2020

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader