skip to main content
research-article

Double Attention Based on Graph Attention Network for Image Multi-Label Classification

Published:05 January 2023Publication History
Skip Abstract Section

Abstract

The task of image multi-label classification is to accurately recognize multiple objects in an input image. Most of the recent works need to leverage the label co-occurrence matrix counted from training data to construct the graph structure, which are inflexible and may degrade model generalizability. In addition, these methods fail to capture the semantic correlation between the channel feature maps to further improve model performance. To address these issues, we propose DA-GAT (a Double Attention framework based on the Graph Attention neTwork) to effectively learn the correlation between labels from training data. First, we devise a new channel attention mechanism to enhance the semantic correlation between channel feature maps, so as to implicitly capture the correlation between labels. Second, we propose a new label attention mechanism to avoid the adverse impact of a manually constructed label co-occurrence matrix. It only needs to leverage the label embedding as the input of network, then automatically constructs the label relation matrix to explicitly establish the correlation between labels. Finally, we effectively fuse the output of these two attention mechanisms to further improve model performance. Extensive experiments are conducted on three public multi-label classification benchmarks. Our DA-GAT model achieves mean average precision of 87.1%, 96.6%, and 64.3% on MS-COCO 2014, PASCAL VOC 2007, and NUS-WIDE, respectively, and obviously outperforms other existing state-of-the-art methods. In addition, visual analysis experiments demonstrate that each attention mechanism can capture the correlation between labels well and significantly promote the model performance.

REFERENCES

  1. [1] Ben-Baruch Emanuel, Ridnik Tal, Zamir Nadav, Noy Asaf, Friedman Itamar, Protter Matan, and Zelnik-Manor Lihi. 2020. Asymmetric loss for multi-label classification. arXiv preprint arXiv:2009.14119 (2020).Google ScholarGoogle Scholar
  2. [2] Cevikalp Hakan, Benligiray Burak, Gerek Ömer Nezih, and Saribas Hasan. 2020. Semi-supervised robust deep neural networks for multi-label classification. Pattern Recognition 100 (2020), 107164.Google ScholarGoogle Scholar
  3. [3] Chen Tianshui, Lin Liang, Hui Xiaolu, Chen Riquan, and Wu Hefeng. 2022. Knowledge-guided multi-label few-shot learning for general image recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 44 (2022), 1371–1384.Google ScholarGoogle Scholar
  4. [4] Chen Tianshui, Xu Muxin, Hui Xiaolu, Wu Hefeng, and Lin Liang. 2019. Learning semantic-specific graph representation for multi-label image recognition. In Proceedings of the IEEE International Conference on Computer Vision. 522531.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Chen Zhao-Min, Wei Xiu-Shen, Jin Xin, and Guo Yanwen. 2019. Multi-label image recognition with joint class-aware map disentangling and label correlation embedding. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME’19). IEEE, Los Alamitos, CA, 622627.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Chen Zhao-Min, Wei Xiu-Shen, Wang Peng, and Guo Yanwen. 2019. Multi-label image recognition with graph convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 51775186.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Chua Tat-Seng, Tang Jinhui, Hong Richang, Li Haojie, Luo Zhiping, and Zheng Yantao. 2009. NUS-WIDE: A real-world web image database from National University of Singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval. 19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 248255.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Ding Yao, Zhou Yanzhao, Zhu Yi, Ye Qixiang, and Jiao Jianbin. 2019. Selective sparse sampling for fine-grained image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 65996608.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Durand Thibaut, Mordan Taylor, Thome Nicolas, and Cord Matthieu. 2017. WILDCAT: Weakly supervised learning of deep ConvNets for image classification, pointwise localization and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 642651.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Durand Thibaut, Thome Nicolas, and Cord Matthieu. 2018. Exploiting negative evidence for deep latent structured models. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2 (2018), 337351.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Everingham Mark, Gool Luc Van, Williams Christopher K. I., Winn John, and Zisserman Andrew. 2010. The Pascal Visual Object Classes (VOC) challenge. International Journal of Computer Vision 88, 2 (2010), 303338.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Fu Jun, Liu Jing, Tian Haijie, Li Yong, Bao Yongjun, Fang Zhiwei, and Lu Hanqing. 2019. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 31463154.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Gao Bin-Bin and Zhou Hong-Yu. 2020. Multi-label image recognition with multi-class attentional regions. arXiv preprint arXiv:2007.01755 (2020).Google ScholarGoogle Scholar
  15. [15] Guo Hao, Zheng Kang, Fan Xiaochuan, Yu Hongkai, and Wang Song. 2019. Visual attention consistency under image transforms for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 729739.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Gupta Shikha, Sharma Krishan, Dinesh Dileep Aroor, and Thenkanidiyoor Veena. 2021. Visual semantic-based representation learning using deep CNNs for scene recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 2 (2021), 124.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Hassanin Mohammed, Radwan Ibrahim, Khan Salman, and Tahtali Murat. 2021. Learning discriminative representations for multi-label image recognition. arXiv preprint arXiv:2107.11159 (2021).Google ScholarGoogle Scholar
  18. [18] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770778.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Hu Jie, Shen Li, and Sun Gang. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 71327141.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Jin Jiren and Nakayama Hideki. 2016. Annotation order matters: Recurrent image annotator for arbitrary length image tagging. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR’16). IEEE, Los Alamitos, CA, 24522457.Google ScholarGoogle Scholar
  21. [21] Kipf Thomas N. and Welling Max. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).Google ScholarGoogle Scholar
  22. [22] Li Junbing, Zhang Changqing, Wang Xueman, and Du Ling. 2020. Multi-scale cross-modal spatial attention fusion for multi-label image recognition. In Proceedings of the International Conference on Artificial Neural Networks. 736747.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Li Liang, Zhu Xinge, Hao Yiming, Wang Shuhui, Gao Xingyu, and Huang Qingming. 2019. A hierarchical CNN-RNN approach for visual emotion classification. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 3s (2019), 117.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Li Qing, Peng Xiaojiang, Qiao Yu, and Peng Qiang. 2019. Learning category correlations for multi-label image recognition with graph networks. arXiv preprint arXiv:1909.13005 (2019).Google ScholarGoogle Scholar
  25. [25] Li Xiangtai, Li Xia, You Ansheng, Zhang Li, Cheng Guangliang, Yang Kuiyuan, Tong Yunhai, and Lin Zhouchen. 2021. Towards efficient scene understanding via squeeze reasoning. IEEE Transactions on Image Processing 30 (2021), 70507063.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Li Zhixin, Lin Lan, Zhang Canlong, Ma Huifang, Zhao Weizhong, and Shi Zhiping. 2021. A semi-supervised learning approach based on adaptive weighted fusion for automatic image annotation. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 1 (2021), 123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Li Zhaoju, Zhou Zongwei, Jiang Nan, Han Zhenjun, Xing Junliang, and Jiao Jianbin. 2020. Spatial preserved graph convolution networks for person re-identification. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 1s (2020), 114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740755.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Liu Lingqiao, Wang Peng, Shen Chunhua, Wang Lei, Hengel Anton Van Den, Wang Chao, and Shen Heng Tao. 2017. Compositional model based Fisher vector coding for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 12 (2017), 23352348.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Liu Shilong, Zhang Lei, Yang Xiao, Su Hang, and Zhu Jun. 2021. Query2Label: A simple transformer way to multi-label classification. arXiv preprint arXiv:2107.10834 (2021).Google ScholarGoogle Scholar
  31. [31] Lyu Fan, Wu Qi, Hu Fuyuan, Wu Qingyao, and Tan Mingkui. 2019. Attend and imagine: Multi-label image classification with visual attention and recurrent neural networks. IEEE Transactions on Multimedia 21, 8 (2019), 19711981.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Maaten Laurens van der and Hinton Geoffrey. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9 (Nov. 2008), 25792605.Google ScholarGoogle Scholar
  33. [33] Meng Quanling and Zhang Weigang. 2019. Multi-label image classification with attention mechanism and graph convolutional networks. In Proceedings of the ACM Multimedia Asia Conference (MMAsia’19). Article 41, 6 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Nguyen Hoang D., Vu Xuan-Son, and Le Duc-Trong. 2021. Modular graph transformer networks for multi-label image classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 90929100.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Paszke Adam, Gross Sam, Chintala Soumith, Chanan Gregory, Yang Edward, DeVito Zachary, Lin Zeming, Desmaison Alban, Antiga Luca, and Lerer Adam. 2017. Automatic differentiation in PyTorch. In Proceedings of the31st Conference on Neural Information Processing Systems (NIPS’17). 1–4.Google ScholarGoogle Scholar
  36. [36] Qi Siyuan, Wang Wenguan, Jia Baoxiong, Shen Jianbing, and Zhu Song-Chun. 2018. Learning human-object interactions by graph parsing neural networks. In Proceedings of the European Conference on Computer Vision (ECCV’18). 401417.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Qian Shengsheng, Hu Jun, Fang Quan, and Xu Changsheng. 2021. Knowledge-aware multi-modal adaptive graph convolutional networks for fake news detection. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 3 (2021), 123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Selvaraju Ramprasaath R., Cogswell Michael, Das Abhishek, Vedantam Ramakrishna, Parikh Devi, and Batra Dhruv. 2017. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision. 618626.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Srivastava Gargi and Srivastava Rajeev. 2020. Design, analysis, and implementation of efficient framework for image annotation. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 3 (2020), 124.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 59986008.Google ScholarGoogle Scholar
  41. [41] Veličković Petar, Cucurull Guillem, Casanova Arantxa, Romero Adriana, Lio Pietro, and Bengio Yoshua. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).Google ScholarGoogle Scholar
  42. [42] Vu Xuan-Son, Le Duc-Trong, Edlund Christoffer, Jiang Lili, and Nguyen Hoang D.. 2020. Privacy-preserving visual content tagging using graph transformer networks. In Proceedings of the 28th ACM International Conference on Multimedia. 22992307.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Wang Jiang, Yang Yi, Mao Junhua, Huang Zhiheng, Huang Chang, and Xu Wei. 2016. CNN-RNN: A unified framework for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 22852294.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Wang Qilong, Wu Banggu, Zhu Pengfei, Li Peihua, Zuo Wangmeng, and Hu Qinghua. 2020. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1153411542.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Wang Xiaolong, Girshick Ross, Gupta Abhinav, and He Kaiming. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 77947803.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Wang Yikai, Huang Wenbing, Sun Fuchun, Xu Tingyang, Rong Yu, and Huang Junzhou. 2020. Deep multimodal fusion by channel exchanging. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS’20). 1–11.Google ScholarGoogle Scholar
  47. [47] Wang Yuxin, Xie Hongtao, Zha Zhengjun, Tian Youliang, Fu Zilong, and Zhang Yongdong. 2020. R-Net: A relationship network for efficient and accurate scene text detection. IEEE Transactions on Multimedia 23 (2020), 13161329.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Wang Yangtao, Xie Yanzhao, Liu Yu, Zhou Ke, and Li Xiaocui. 2020. Fast graph convolution network based multi-label image recognition via cross-modal fusion. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management. 15751584.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Wang Zhouxia, Chen Tianshui, Li Guanbin, Xu Ruijia, and Lin Liang. 2017. Multi-label image recognition by recurrently discovering attentional regions. In Proceedings of the IEEE International Conference on Computer Vision. 464472.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Wen Keyu, Gu Xiaodong, and Cheng Qingrong. 2020. Learning dual semantic relations with graph attention for image-text matching. IEEE Transactions on Circuits and Systems for Video Technology PP, 99 (2020), 1.Google ScholarGoogle Scholar
  51. [51] Wen Shiping, Liu Weiwei, Yang Yin, Zhou Pan, Guo Zhenyuan, Yan Zheng, Chen Yiran, and Huang Tingwen. 2021. Multilabel image classification via feature/label co-projection. IEEE Transactions on Systems, Man, and Cybernetics: Systems 51, 11 (2021), 7250–7259.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Wu Xiangping, Chen Qingcai, Li Wei, Xiao Yulun, and Hu Baotian. 2020. AdaHGNN: Adaptive hypergraph neural networks for multi-label image classification. In Proceedings of the 28th ACM International Conference on Multimedia. 284293.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Xie Saining, Girshick Ross, Dollár Piotr, Tu Zhuowen, and He Kaiming. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 14921500.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Yan Zheng, Liu Weiwei, Wen Shiping, and Yang Yin. 2019. Multi-label image classification by feature attention network. IEEE Access 7 (2019), 9800598013.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Ye Jin, He Junjun, Peng Xiaojiang, Wu Wenhao, and Qiao Yu. 2020. Attention-driven dynamic graph convolutional network for multi-label image recognition. In Proceedings of the European Conference on Computer Vision. 649665.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. [56] You Renchun, Guo Zhiyao, Cui Lei, Long Xiang, Bao Yingze, and Wen Shilei. 2020. Cross-modality attention with semantic graph embedding for multi-label classification. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’20). 1270912716.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Yu Wan-Jin, Chen Zhen-Duo, Luo Xin, Liu Wu, and Xu Xin-Shun. 2019. DELTA: A deep dual-stream network for multi-label image classification. Pattern Recognition 91 (2019), 322331.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Zhang Junjie, Wu Qi, Shen Chunhua, Zhang Jian, and Lu Jianfeng. 2018. Multilabel image classification with regional latent semantic dependencies. IEEE Transactions on Multimedia 20, 10 (2018), 28012813.Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Zhang Kaihua, Li Tengpeng, Shen Shiwen, Liu Bo, Chen Jin, and Liu Qingshan. 2020. Adaptive graph convolutional network with attention graph clustering for co-saliency detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 90509059.Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Zhao Haiying, Zhou Wei, Hou Xiaogang, and Zhu Hui. 2020. Double attention for multi-label image classification. IEEE Access 8 (2020), 225539225550.Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Zhou Fengtao, Huang Sheng, and Xing Yun. 2020. Deep semantic dictionary learning for multi-label image classification. arXiv preprint arXiv:2012.12509 (2020).Google ScholarGoogle Scholar
  62. [62] Zhu Feng, Li Hongsheng, Ouyang Wanli, Yu Nenghai, and Wang Xiaogang. 2017. Learning spatial regularization with image-level supervisions for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 55135522.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Double Attention Based on Graph Attention Network for Image Multi-Label Classification

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 1
        January 2023
        505 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3572858
        • Editor:
        • Abdulmotaleb El Saddik
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 5 January 2023
        • Online AM: 12 March 2022
        • Accepted: 14 February 2022
        • Revised: 29 January 2022
        • Received: 4 September 2021
        Published in tomm Volume 19, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!