skip to main content
research-article

Graph Attention Transformer Network for Multi-label Image Classification

Published:27 February 2023Publication History
Skip Abstract Section

Abstract

Multi-label classification aims to recognize multiple objects or attributes from images. The key to solving this issue relies on effectively characterizing the inter-label correlations or dependencies, which bring the prevailing graph neural network. However, current methods often use the co-occurrence probability of labels based on the training set as the adjacency matrix to model this correlation, which is greatly limited by the dataset and affects the model’s generalization ability. This article proposes a Graph Attention Transformer Network, a general framework for multi-label image classification by mining rich and effective label correlation. First, we use the cosine similarity value of the pre-trained label word embedding as the initial correlation matrix, which can represent richer semantic information than the co-occurrence one. Subsequently, we propose the graph attention transformer layer to transfer this adjacency matrix to adapt to the current domain. Our extensive experiments have demonstrated that our proposed methods can achieve highly competitive performance on three datasets.

REFERENCES

  1. [1] Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google ScholarGoogle Scholar
  2. [2] Bruna Joan, Zaremba Wojciech, Szlam Arthur, and LeCun Yann. 2014. Spectral networks and locally connected networks on graphs. In Proceedings of the International Conference on Learning Representations (ICLR’14).Google ScholarGoogle Scholar
  3. [3] Chen Mark, Radford Alec, Child Rewon, Wu Jeffrey K., Jun Heewoo, Luan David, and Sutskever Ilya. 2020. Generative pretraining from pixels. In Proceedings of the 37th International Conference on Machine Learning (ICML’20), Vol. 1. 16911703.Google ScholarGoogle Scholar
  4. [4] Chen Shikai, Wang Jianfeng, Chen Yuedong, Shi Zhongchao, Geng Xin, and Rui Yong. 2020. Label distribution learning on auxiliary label space graphs for facial expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 1398413993.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Chen Shang-Fu, Chen Yi-Chen, Yeh Chih-Kuan, and Wang Yu-Chiang Frank. 2017. Order-free RNN with visual attention for multi-label classification. In Proceedings of the AAAI Annual Conference on Artificial Intelligence (AAAI’17). 67146721.Google ScholarGoogle Scholar
  6. [6] Tianshui Chen, Liang Lin, Xiaolu Hui, Riquan Chen, and Hefeng Wu. 2020. Knowledge-guided multi-label few-shot learning for general image recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 3 (2020), 1371–1384.Google ScholarGoogle Scholar
  7. [7] Chen Tianshui, Wang Zhouxia, Li Guanbin, and Lin Liang. 2017. Recurrent attentional reinforcement learning for multi-label image recognition. In Proceedings of the AAAI Annual Conference on Artificial Intelligence (AAAI’17). 67306737.Google ScholarGoogle Scholar
  8. [8] Chen Zhao-Min, Wei Xiu-Shen, Wang Peng, and Guo Yanwen. 2019. Multi-label recognition with graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 51775186.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Chua Tat-Seng, Tang Jinhui, Hong Richang, Li Haojie, Luo Zhiping, and Zheng Yantao. 2009. NUS-WIDE: A real-world web image database from national university of singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval. 48.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina N.. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 41714186.Google ScholarGoogle Scholar
  11. [11] Dosovitskiy Alexey, Beyer Lucas, Kolesnikov Alexander, Weissenborn Dirk, Zhai Xiaohua, Unterthiner Thomas, Dehghani Mostafa, Minderer Matthias, Heigold Georg, Gelly Sylvain, Uszkoreit Jakob, and Houlsby Neil. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR’21).Google ScholarGoogle Scholar
  12. [12] Everingham Mark, Gool Luc, Williams Christopher K., Winn John, and Zisserman Andrew. 2010. The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 2 (2010), 303338.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Galassi Andrea, Lippi Marco, and Torroni Paolo. 2020. Attention in natural language processing. IEEE Trans. Neural Netw. Learn. Syst. 32, 10 (2020), 42914308.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Gao Bin-Bin, Xing Chao, Xie Chen-Wei, Wu Jianxin, and Geng Xin. 2017. Deep label distribution learning with label ambiguity. IEEE Trans. Image Process. 26, 6 (2017), 28252838.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Ge Zongyuan, Mahapatra Dwarikanath, Sedai Suman, Garnavi Rahil, and Chakravorty Rajib. 2018. Chest x-rays classification: A multi-label and fine-grained problem. arXiv:1807.07247. Retrieved from https://arxiv.org/abs/1807.07247.Google ScholarGoogle Scholar
  16. [16] Grover Aditya and Leskovec Jure. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. 2016. 855864.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770778.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Huang Qinghua, Chen Bisheng, Wang Jingdong, and Mei Tao. 2014. Personalized video recommendation through graph propagation. ACM Trans. Multimedia Comput. Commun. Appl. 10, 4 (2014), 117.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Kipf Thomas N. and Welling Max. 2016. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations (ICLR’16).Google ScholarGoogle Scholar
  20. [20] Jack Lanchantin, Tianlu Wang, Vicente Ordonez, and Yanjun Qi. 2021. General multi-label image classification with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16478–16488.Google ScholarGoogle Scholar
  21. [21] Li Liang, Wang Shuhui, Jiang Shuqiang, and Huang Qingming. 2018. Attentive recurrent neural network for weak-supervised multi-label image classification. In Proceedings of the 26th ACM International Conference on Multimedia. 10921100.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Li Qing, Peng Xiaojiang, Qiao Yu, and Peng Qiang. 2019. Learning category correlations for multi-label image recognition with graph networks. arXiv:1909.13005. Retrieved from https://arxiv.org/abs/1909.13005.Google ScholarGoogle Scholar
  23. [23] Li Yining, Huang Chen, Loy Chen Change, and Tang Xiaoou. 2016. Human attribute recognition by deep hierarchical contexts. In Proceedings of the European Conference on Computer Vision. 684700.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Lin Tsung-Yi, Maire Michael, Belongie Serge J., Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740755.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Liu Feng, Xiang Tao, Hospedales Timothy M., Yang Wankou, and Sun Changyin. 2017. Semantic regularisation for recurrent image annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 41604168.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Liu Shilong, Zhang Lei, Yang Xiao, Su Hang, and Zhu Jun. 2021. Query2label: A simple transformer way to multi-label classification. arXiv:2107.10834. Retrieved from https://arxiv.org/abs/2107.10834.Google ScholarGoogle Scholar
  27. [27] Liu Weiwei and Tsang Ivor W.. 2015. On the optimality of classifier chain for multi-label classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15), Vol. 28. 712720.Google ScholarGoogle Scholar
  28. [28] Liu Yang, Zhang Yao, Wang Yixin, Hou Feng, Yuan Jin, Tian Jiang, Zhang Yang, Shi Zhongchao, Fan Jianping, and He Zhiqiang. 2021. A survey of visual transformers. arXiv:2111.06091. Retrieved from https://arxiv.org/abs/2111.06091.Google ScholarGoogle Scholar
  29. [29] Man Xin, Ouyang Deqiang, Li Xiangpeng, Song Jingkuan, and Shao Jie. 2022. Scenario-aware recurrent transformer for goal-directed video captioning. ACM Trans. Multimedia Comput. Commun. Appl. 18, 4 (2022), 117.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Messina Nicola, Amato Giuseppe, Esuli Andrea, Falchi Fabrizio, Gennaro Claudio, and Marchand-Maillet Stéphane. 2021. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans. Multimedia Comput. Commun. Appl. 17, 4 (2021), 123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Nam Jinseok, Kim Young-Bum, Mencia Eneldo Loza, Park Sunghyun, Sarikaya Ruhi, and Fürnkranz Johannes. 2019. Learning context-dependent label permutations for multi-label classification. In Proceedings of the International Conference on Machine Learning. 47334742.Google ScholarGoogle Scholar
  32. [32] Nath Nipun D., Chaspari Theodora, and Behzadan Amir H.. 2019. Single- and multi-label classification of construction objects using deep transfer learning methods. J. Inf. Technol. Construct. 24, 28 (2019), 511526.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Nguyen Hoang D., Vu Xuan-Son, and Le Duc-Trong. 2021. Modular graph transformer networks for multi-label image classification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’21). AAAI.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Pennington Jeffrey, Socher Richard, and Manning Christopher. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 15321543.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Perozzi Bryan, Al-Rfou Rami, and Skiena Steven. 2014. DeepWalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 701710.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Quevedo José Ramón, Luaces Oscar, and Bahamonde Antonio. 2012. Multilabel classifiers with a probabilistic thresholding strategy. Pattern Recogn. 45, 2 (2012), 876883.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Radford Alec, Narasimhan Karthik, Salimans Tim, and Sutskever Ilya. 2018. Improving language understanding by generative pre-training. (unpublished).Google ScholarGoogle Scholar
  38. [38] Read Jesse, Pfahringer Bernhard, Holmes Geoff, and Frank Eibe. 2011. Classifier chains for multi-label classification. Mach. Learn. 85, 3 (2011), 333359.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Tal Ridnik, Gilad Sharir, Avi Ben-Cohen, Emanuel Ben-Baruch, and Asaf Noy. 2023. ML-decoder: Scalable and versatile classification head. In Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 32–41.Google ScholarGoogle Scholar
  40. [40] Simonyan Karen and Zisserman Andrew. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google ScholarGoogle Scholar
  41. [41] Song Yaguang, Yang Xiaoshan, and Xu Changsheng. 2022. Self-supervised calorie-aware heterogeneous graph networks for food recommendation. ACM Trans. Multimedia Comput. Commun. Appl. (2022).Google ScholarGoogle Scholar
  42. [42] Tang Jian, Qu Meng, Wang Mingzhe, Zhang Ming, Yan Jun, and Mei Qiaozhu. 2015. LINE: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web. 10671077.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Vol. 30. 59986008.Google ScholarGoogle Scholar
  44. [44] Veličković Petar, Cucurull Guillem, Casanova Arantxa, Romero Adriana, Liò Pietro, and Bengio Yoshua. 2018. Graph attention networks. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  45. [45] Vu Xuan-Son, Le Duc-Trong, Edlund Christoffer, Jiang Lili, and Nguyen Hoang D.. 2020. Privacy-preserving visual content tagging using graph transformer networks. In Proceedings of the 28th ACM International Conference on Multimedia. 22992307.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Wang Haidong, He Xuan, Li* Zhiyong, Yuan* Jin, and Li Shutao. 2022. JDAN: Joint detection and association network for real-time online multi-object tracking. ACM Trans. Multimedia Comput. Commun. Appl. (2022).Google ScholarGoogle Scholar
  47. [47] Wang Jiang, Yang Yi, Mao Junhua, Huang Zhiheng, Huang Chang, and Xu Wei. 2016. CNN-RNN: A unified framework for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 22852294.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Wang Xiao, Ji Houye, Shi Chuan, Wang Bai, Ye Yanfang, Cui Peng, and Yu Philip S.. 2019. Heterogeneous graph attention network. In Proceedings of the World Wide Web Conference. 20222032.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Wang Ya, He Dongliang, Li Fu, Long Xiang, Zhou Zhichao, Ma Jinwen, and Wen Shilei. 2020. Multi-label classification with label graph superimposing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 1226512272.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Wang Zhouxia, Chen Tianshui, Li Guanbin, Xu Ruijia, and Lin Liang. 2017. Multi-label image recognition by recurrently discovering attentional regions. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 464472.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Wei Yunchao, Xia Wei, Lin Min, Huang Junshi, Ni Bingbing, Dong Jian, Zhao Yao, and Yan Shuicheng. 2016. HCP: A flexible CNN framework for multi-label image classification. IEEE Trans. Pattern Anal. Mach. Intell. 38, 9 (2016), 19011907.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Xie Saining, Girshick Ross, Dollar Piotr, Tu Zhuowen, and He Kaiming. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 59875995.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Yalniz I. Zeki, Jégou Hervé, Chen Kan, Paluri Manohar, and Mahajan Dhruv. 2019. Billion-scale semi-supervised learning for image classification. arXiv:1905.00546. Retrieved from https://arxiv.org/abs/1905.02546.Google ScholarGoogle Scholar
  54. [54] Yang Hao, Zhou Joey Tianyi, Zhang Yu, Gao Bin-Bin, Wu Jianxin, and Cai Jianfei. 2016. Exploit bounding box annotations for multi-label object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 280288.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Yazici Vacit Oguz, Gonzalez-Garcia Abel, Ramisa Arnau, Twardowski Bartlomiej, and Weijer Joost van de. 2020. Orderless recurrent models for multi-label classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 1344013449.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Yun Seongjun, Jeong Minbyul, Kim Raehyun, Kang Jaewoo, and Kim Hyunwoo J.. 2019. Graph transformer networks. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems (NeurIPS’19), Vol. 32. 1196011970.Google ScholarGoogle Scholar
  57. [57] Zhang Yong, Pan Yingwei, Yao Ting, Huang Rui, Mei Tao, and Chen Chang-Wen. 2022. Boosting scene graph generation with visual relation saliency. ACM Trans. Multimedia Comput. Commun. Appl. (2022).Google ScholarGoogle Scholar
  58. [58] Zhao Kaili, Chu Wen-Sheng, and Zhang Honggang. 2016. Deep region and multi-label learning for facial action unit detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 33913399.Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Zhou Wei, Xia Zhiwu, Dou Peng, Su Tao, and Hu Haifeng. 2022. Double attention based on graph attention network for image multi-label classification. ACM Trans. Multimedia Comput. Commun. Appl. (2022).Google ScholarGoogle Scholar
  60. [60] Zhu Feng, Li Hongsheng, Ouyang Wanli, Yu Nenghai, and Wang Xiaogang. 2017. Learning spatial regularization with image-level supervisions for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 20272036.Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Zhuang Ni, Yan Yan, Chen Si, Wang Hanzi, and Shen Chunhua. 2018. Multi-label learning based deep transfer neural network for facial attribute classification. Pattern Recogn. 80 (2018), 225240.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Graph Attention Transformer Network for Multi-label Image Classification

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 4
      July 2023
      263 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3582888
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 February 2023
      • Online AM: 29 December 2022
      • Accepted: 18 December 2022
      • Revised: 31 October 2022
      • Received: 17 May 2022
      Published in tomm Volume 19, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)368
      • Downloads (Last 6 weeks)50

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!