ABSTRACT
In this paper, we propose a cross-modal network architecture search (NAS) algorithm for VQA, termed as k-Armed Bandit based NAS (KAB-NAS). KAB-NAS regards the design of each layer as a k-armed bandit problem and updates the preference of each candidate via numerous samplings in a single-shot search framework. To establish an effective search space, we further propose a new architecture termed Automatic Graph Attention Network (AGAN), and extend the popular self-attention layer with three graph structures, denoted as dense-graph, co-graph and separate-graph.These graph layers are used to form the direction of information propagation in the graph network, and their optimal combinations are searched by KAB-NAS. To evaluate KAB-NAS and AGAN, we conduct extensive experiments on two VQA benchmark datasets, i.e., VQA2.0 and GQA, and also test AGAN with the popular BERT-style pre-training. The experimental results show that with the help of KAB-NAS, AGAN can achieve the state-of-the-art performance on both benchmark datasets with much fewer parameters and computations.
Supplemental Material
- Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. 2018. Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering. CVPR (2018).Google Scholar
- Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.Google Scholar
- Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Learning to compose neural networks for question answering. NAACL (2016).Google Scholar
- Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In ICCV.Google Scholar
Digital Library
- Gabriel Bender, Pieterjan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc V Le. [n.d.]. Understanding and Simplifying One-Shot Architecture Search. ICML ( [n.,d.]).Google Scholar
- Remi Cadene, Hedi Benyounes, Matthieu Cord, and Nicolas Thome. 2019. MUREL: Multimodal Relational Reasoning for Visual Question Answering. CVPR (2019).Google Scholar
- Han Cai, Ligeng Zhu, and Song Han. 2019. ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware. ICLR (2019).Google Scholar
- Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).Google Scholar
- Jacob Devlin, Mingwei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL (2019), 4171--4186.Google Scholar
- Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016).Google Scholar
- Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven C H Hoi, Xiaogang Wang, and Hongsheng Li. 2019 a. Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering. CVPR (2019), 6639--6648.Google Scholar
- Peng Gao, Haoxuan You, Zhanpeng Zhang, Xiaogang Wang, and Hongsheng Li. 2019 b. Multi-modality Latent Interaction Network for Visual Question Answering. (2019), 5825--5835.Google Scholar
- Yash Goyal, Tejas Khot, Douglas Summersstay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In CVPR.Google Scholar
- Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural Computation (1997).Google Scholar
- Drew A Hudson and Christopher D Manning. 2018. Compositional Attention Networks for Machine Reasoning. ICLR (2018).Google Scholar
- Drew A Hudson and Christopher D Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. CVPR (2019).Google Scholar
- Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, and Devi Parikh. 2018. Pythia v0.1: the Winning Entry to the VQA Challenge 2018. arXiv preprint arXiv: 1807.09956 (2018).Google Scholar
- Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Feifei, C Lawrence Zitnick, and Ross B Girshick. 2017. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. CVPR (2017).Google Scholar
- Kushal Kafle and Christopher Kanan. 2017. An Analysis of Visual Question Answering Algorithms. ICCV (2017).Google Scholar
- Kushal Kafle, Brian L Price, Scott D Cohen, and Christopher Kanan. 2018. DVQA: Understanding Data Visualizations via Question Answering. CVPR (2018).Google Scholar
- Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara L Berg. 2014. ReferItGame: Referring to Objects in Photographs of Natural Scenes. (2014), 787--798.Google Scholar
- Jinhwa Kim, Jaehyun Jun, and Byoungtak Zhang. 2018. Bilinear Attention Networks. NIPS (2018).Google Scholar
- Jinhwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, Jungwoo Ha, and Byoungtak Zhang. 2017. Hadamard Product for Low-rank Bilinear Pooling. ICLR (2017).Google Scholar
- Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).Google Scholar
- Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2016. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332 (2016).Google Scholar
- Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou. 2019 a. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066 (2019).Google Scholar
- Linjie Li, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019 b. Relation-aware Graph Attention Network for Visual Question Answering. arXiv preprint arXiv: 1903.12314 (2019).Google Scholar
- Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019 c. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019).Google Scholar
- Mingbao Lin, Rongrong Ji, Yan Wang, Yichen Zhang, Baochang Zhang, Yonghong Tian, and Ling Shao. 2020. HRank: Filter Pruning using High-Rank Feature Map. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1529--1538.Google Scholar
Cross Ref
- Daqing Liu, Hanwang Zhang, Zhengjun Zha, and Feng Wu. 2019 b. Learning to Assemble Neural Module Tree Networks for Visual Grounding. ICCV (2019), 4673--4682.Google Scholar
- Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. 2018. Hierarchical Representations for Efficient Architecture Search. ICLR (2018).Google Scholar
- Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2019 a. DARTS: Differentiable Architecture Search. ICLR (2019).Google Scholar
- Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019 a. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems. 13--23.Google Scholar
- Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 2019 b. 12-in-1: Multi-Task Vision and Language Representation Learning. arXiv preprint arXiv:1912.02315.Google Scholar
- Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In NeurIPS.Google Scholar
- Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng, and Ji Rongrong. 2020. Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation. In CVPR.Google Scholar
- Hyeonseob Nam, Jung Woo Ha, and Jeonghee Kim. 2017. Dual Attention Networks for Multimodal Reasoning and Matching. (2017).Google Scholar
- Duykien Nguyen and Takayuki Okatani. 2018. Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering. CVPR (2018).Google Scholar
- Will Norcliffe-Brown, Efstathios Vafeias, and Sarah Parisot. 2018. Learning Conditioned Graph Structures for Interpretable Visual Question Answering. arXiv preprint arXiv:1806.07243 (2018).Google Scholar
- Badri Patro and Vinay P Namboodiri. 2018. Differential Attention for Visual Question Answering. In CVPR.Google Scholar
- Gao Peng, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven C H Hoi, Xiaogang Wang, and Hongsheng Li. 2018. Dynamic Fusion with Intra- and Inter- Modality Attention Flow for Visual Question Answering. arXiv preprint arXiv: 1812.05252 (2018).Google Scholar
- Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In EMNLP.Google Scholar
- Juan-Manuel Pérez-Rúa, Valentin Vielzeuf, Stéphane Pateux, Moez Baccouche, and Frédéric Jurie. 2019. Mfas: Multimodal fusion architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6966--6975.Google Scholar
Cross Ref
- Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. 2019. Regularized Evolution for Image Classifier Architecture Search. AAAI (2019).Google Scholar
- Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Exploring models and data for image question answering. In NeurIPS.Google Scholar
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. [n.d.]. Faster R-CNN: towards real-time object detection with region proposal networks. TPAMI ( [n.,d.]).Google Scholar
- Meet Shah, Xinlei Chen, Marcus Rohrbach, and Devi Parikh. 2019. Cycle-Consistency for Robust Visual Question Answering. arXiv preprint arXiv: 1902.05660 (2019).Google Scholar
- Robik Shrestha, Kushal Kafle, and Christopher Kanan. 2019. Answer Them All! Toward Universal Visual Question Answering Models. CVPR (2019).Google Scholar
- Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019).Google Scholar
- Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019).Google Scholar
- Damien Teney, Peter Anderson, Xiaodong He, and Anton Van Den Hengel. 2018. Tips and Tricks for Visual Question Answering: Learnings From the 2017 Challenge. CVPR (2018).Google Scholar
- Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. Yfcc100m: The new data in multimedia research. COMMUN ACM (2016).Google Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. NIPS (2017).Google Scholar
- Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. CVPR (2015).Google Scholar
- Lingxi Xie and Alan L Yuille. 2017. Genetic CNN. ICCV (2017).Google Scholar
- Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In CVPR.Google Scholar
- Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. 2018a. MAttNet: Modular Attention Network for Referring Expression Comprehension. CVPR (2018), 1307--1315.Google Scholar
- Licheng Yu, Hao Tan, Mohit Bansal, and Tamara L Berg. 2017a. A Joint Speaker-Listener-Reinforcer Model for Referring Expressions. CVPR (2017), 3521--3529.Google Scholar
- Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep Modular Co-Attention Networks for Visual Question Answering. CVPR (2019).Google Scholar
- Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017b. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proc. IEEE Int. Conf. Comp. Vis.Google Scholar
Cross Ref
- Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. 2018b. Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering. TNN (2018).Google Scholar
- Zhou Yu, Jun Yu, Chenchao Xiang, Zhou Zhao, Qi Tian, and Dacheng Tao. 2018c. Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding. international joint conference on artificial intelligence (2018), 1114--1120.Google Scholar
- Yan Zhang, Jonathon S Hare, and Adam Prugelbennett. 2018. Learning to Count Objects in Natural Images for Visual Question Answering. ICLR (2018).Google Scholar
- Xiawu Zheng, Rongrong Ji, Lang Tang, Baochang Zhang, Jianzhuang Liu, and Qi Tian. 2019. Multinomial Distribution Learning for Effective Neural Architecture Search. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).Google Scholar
Cross Ref
- Xiawu Zheng, Rongrong Ji, Qiang Wang, Qixiang Ye, Zhenguo Li, Yonghong Tian, and Qi Tian. 2020. Rethinking Performance Estimation in Neural Architecture Search. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Yiyi Zhou, Rongrong Ji, Jinsong Su, Xiaoshuai Sun, and Weiqiu Chen. 2019. Dynamic Capsule Attention for Visual Question Answering. AAAI 2019 : Thirty-Third AAAI Conference on Artificial Intelligence, Vol. 33, 1 (2019), 9324--9331.Google Scholar
- Yiyi Zhou, Rongrong Ji, Jinsong Su, Xiaoshuai Sun, and Xiangming Li. 2019. Free VQA Models from Knowledge Inertia by Pairwise Inconformity Learning. AAAI (2019).Google Scholar
- Barret Zoph and Quoc V Le. 2017. Neural Architecture Search with Reinforcement Learning. ICLR (2017).Google Scholar
- Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. 2018. Learning Transferable Architectures for Scalable Image Recognition. (2018).Google Scholar
Index Terms
- K-armed Bandit based Multi-Modal Network Architecture Search for Visual Question Answering
Recommendations
Multi-armed bandit problem with known trend
We consider a variant of the multi-armed bandit model, which we call multi-armed bandit problem with known trend, where the gambler knows the shape of the reward function of each arm but not its distribution. This new problem is motivated by different ...
Multi-armed Bandit with Additional Observations
SIGMETRICS '18We study multi-armed bandit (MAB) problems with additional observations, where in each round, the decision maker selects an arm to play and can also observe rewards of additional arms (within a given budget) by paying certain costs. We propose ...
Multi-armed Bandit with Additional Observations
SIGMETRICS '18: Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer SystemsWe study multi-armed bandit (MAB) problems with additional observations, where in each round, the decision maker selects an arm to play and can also observe rewards of additional arms (within a given budget) by paying certain costs. We propose ...





Comments