Abstract
Multi-label classification aims to recognize multiple objects or attributes from images. The key to solving this issue relies on effectively characterizing the inter-label correlations or dependencies, which bring the prevailing graph neural network. However, current methods often use the co-occurrence probability of labels based on the training set as the adjacency matrix to model this correlation, which is greatly limited by the dataset and affects the model’s generalization ability. This article proposes a Graph Attention Transformer Network, a general framework for multi-label image classification by mining rich and effective label correlation. First, we use the cosine similarity value of the pre-trained label word embedding as the initial correlation matrix, which can represent richer semantic information than the co-occurrence one. Subsequently, we propose the graph attention transformer layer to transfer this adjacency matrix to adapt to the current domain. Our extensive experiments have demonstrated that our proposed methods can achieve highly competitive performance on three datasets.
- [1] . 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google Scholar
- [2] . 2014. Spectral networks and locally connected networks on graphs. In Proceedings of the International Conference on Learning Representations (ICLR’14).Google Scholar
- [3] . 2020. Generative pretraining from pixels. In Proceedings of the 37th International Conference on Machine Learning (ICML’20), Vol. 1. 1691–1703.Google Scholar
- [4] . 2020. Label distribution learning on auxiliary label space graphs for facial expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 13984–13993.Google Scholar
Cross Ref
- [5] . 2017. Order-free RNN with visual attention for multi-label classification. In Proceedings of the AAAI Annual Conference on Artificial Intelligence (AAAI’17). 6714–6721.Google Scholar
- [6] Tianshui Chen, Liang Lin, Xiaolu Hui, Riquan Chen, and Hefeng Wu. 2020. Knowledge-guided multi-label few-shot learning for general image recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 3 (2020), 1371–1384.Google Scholar
- [7] . 2017. Recurrent attentional reinforcement learning for multi-label image recognition. In Proceedings of the AAAI Annual Conference on Artificial Intelligence (AAAI’17). 6730–6737.Google Scholar
- [8] . 2019. Multi-label recognition with graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 5177–5186.Google Scholar
Cross Ref
- [9] . 2009. NUS-WIDE: A real-world web image database from national university of singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval. 48.Google Scholar
Digital Library
- [10] . 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186.Google Scholar
- [11] . 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR’21).Google Scholar
- [12] . 2010. The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 2 (2010), 303–338.Google Scholar
Digital Library
- [13] . 2020. Attention in natural language processing. IEEE Trans. Neural Netw. Learn. Syst. 32, 10 (2020), 4291–4308.Google Scholar
Cross Ref
- [14] . 2017. Deep label distribution learning with label ambiguity. IEEE Trans. Image Process. 26, 6 (2017), 2825–2838.Google Scholar
Digital Library
- [15] . 2018. Chest x-rays classification: A multi-label and fine-grained problem. arXiv:1807.07247. Retrieved from https://arxiv.org/abs/1807.07247.Google Scholar
- [16] . 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. 2016. 855–864.Google Scholar
Digital Library
- [17] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770–778.Google Scholar
Cross Ref
- [18] . 2014. Personalized video recommendation through graph propagation. ACM Trans. Multimedia Comput. Commun. Appl. 10, 4 (2014), 1–17.Google Scholar
Digital Library
- [19] . 2016. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations (ICLR’16).Google Scholar
- [20] Jack Lanchantin, Tianlu Wang, Vicente Ordonez, and Yanjun Qi. 2021. General multi-label image classification with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16478–16488.Google Scholar
- [21] . 2018. Attentive recurrent neural network for weak-supervised multi-label image classification. In Proceedings of the 26th ACM International Conference on Multimedia. 1092–1100.Google Scholar
Digital Library
- [22] . 2019. Learning category correlations for multi-label image recognition with graph networks. arXiv:1909.13005. Retrieved from https://arxiv.org/abs/1909.13005.Google Scholar
- [23] . 2016. Human attribute recognition by deep hierarchical contexts. In Proceedings of the European Conference on Computer Vision. 684–700.Google Scholar
Cross Ref
- [24] . 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740–755.Google Scholar
Cross Ref
- [25] . 2017. Semantic regularisation for recurrent image annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 4160–4168.Google Scholar
Cross Ref
- [26] . 2021. Query2label: A simple transformer way to multi-label classification. arXiv:2107.10834. Retrieved from https://arxiv.org/abs/2107.10834.Google Scholar
- [27] . 2015. On the optimality of classifier chain for multi-label classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15), Vol. 28. 712–720.Google Scholar
- [28] . 2021. A survey of visual transformers. arXiv:2111.06091. Retrieved from https://arxiv.org/abs/2111.06091.Google Scholar
- [29] . 2022. Scenario-aware recurrent transformer for goal-directed video captioning. ACM Trans. Multimedia Comput. Commun. Appl. 18, 4 (2022), 1–17.Google Scholar
Digital Library
- [30] . 2021. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans. Multimedia Comput. Commun. Appl. 17, 4 (2021), 1–23.Google Scholar
Digital Library
- [31] . 2019. Learning context-dependent label permutations for multi-label classification. In Proceedings of the International Conference on Machine Learning. 4733–4742.Google Scholar
- [32] . 2019. Single- and multi-label classification of construction objects using deep transfer learning methods. J. Inf. Technol. Construct. 24, 28 (2019), 511–526.Google Scholar
Cross Ref
- [33] . 2021. Modular graph transformer networks for multi-label image classification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’21). AAAI.Google Scholar
Cross Ref
- [34] . 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532–1543.Google Scholar
Cross Ref
- [35] . 2014. DeepWalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 701–710.Google Scholar
Digital Library
- [36] . 2012. Multilabel classifiers with a probabilistic thresholding strategy. Pattern Recogn. 45, 2 (2012), 876–883.Google Scholar
Digital Library
- [37] . 2018. Improving language understanding by generative pre-training. (unpublished).Google Scholar
- [38] . 2011. Classifier chains for multi-label classification. Mach. Learn. 85, 3 (2011), 333–359.Google Scholar
Digital Library
- [39] Tal Ridnik, Gilad Sharir, Avi Ben-Cohen, Emanuel Ben-Baruch, and Asaf Noy. 2023. ML-decoder: Scalable and versatile classification head. In Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 32–41.Google Scholar
- [40] . 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google Scholar
- [41] . 2022. Self-supervised calorie-aware heterogeneous graph networks for food recommendation. ACM Trans. Multimedia Comput. Commun. Appl. (2022).Google Scholar
- [42] . 2015. LINE: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web. 1067–1077.Google Scholar
Digital Library
- [43] . 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Vol. 30. 5998–6008.Google Scholar
- [44] . 2018. Graph attention networks. In International Conference on Learning Representations.Google Scholar
- [45] . 2020. Privacy-preserving visual content tagging using graph transformer networks. In Proceedings of the 28th ACM International Conference on Multimedia. 2299–2307.Google Scholar
Digital Library
- [46] . 2022. JDAN: Joint detection and association network for real-time online multi-object tracking. ACM Trans. Multimedia Comput. Commun. Appl. (2022).Google Scholar
- [47] . 2016. CNN-RNN: A unified framework for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 2285–2294.Google Scholar
Cross Ref
- [48] . 2019. Heterogeneous graph attention network. In Proceedings of the World Wide Web Conference. 2022–2032.Google Scholar
Digital Library
- [49] . 2020. Multi-label classification with label graph superimposing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12265–12272.Google Scholar
Cross Ref
- [50] . 2017. Multi-label image recognition by recurrently discovering attentional regions. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 464–472.Google Scholar
Cross Ref
- [51] . 2016. HCP: A flexible CNN framework for multi-label image classification. IEEE Trans. Pattern Anal. Mach. Intell. 38, 9 (2016), 1901–1907.Google Scholar
Digital Library
- [52] . 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 5987–5995.Google Scholar
Cross Ref
- [53] . 2019. Billion-scale semi-supervised learning for image classification. arXiv:1905.00546. Retrieved from https://arxiv.org/abs/1905.02546.Google Scholar
- [54] . 2016. Exploit bounding box annotations for multi-label object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 280–288.Google Scholar
Cross Ref
- [55] . 2020. Orderless recurrent models for multi-label classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 13440–13449.Google Scholar
Cross Ref
- [56] . 2019. Graph transformer networks. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems (NeurIPS’19), Vol. 32. 11960–11970.Google Scholar
- [57] . 2022. Boosting scene graph generation with visual relation saliency. ACM Trans. Multimedia Comput. Commun. Appl. (2022).Google Scholar
- [58] . 2016. Deep region and multi-label learning for facial action unit detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 3391–3399.Google Scholar
Cross Ref
- [59] . 2022. Double attention based on graph attention network for image multi-label classification. ACM Trans. Multimedia Comput. Commun. Appl. (2022).Google Scholar
- [60] . 2017. Learning spatial regularization with image-level supervisions for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2027–2036.Google Scholar
Cross Ref
- [61] . 2018. Multi-label learning based deep transfer neural network for facial attribute classification. Pattern Recogn. 80 (2018), 225–240.Google Scholar
Cross Ref
Index Terms
Graph Attention Transformer Network for Multi-label Image Classification
Recommendations
Double Attention Based on Graph Attention Network for Image Multi-Label Classification
The task of image multi-label classification is to accurately recognize multiple objects in an input image. Most of the recent works need to leverage the label co-occurrence matrix counted from training data to construct the graph structure, which are ...
Semantic guide for semi-supervised few-shot multi-label node classification
AbstractWe study a new research problem named semi-supervised few-shot multi-label node classification which has the following characteristics: 1) the extreme imbalance between the number of labeled and unlabeled nodes that are connected on ...
Calibrated Multi-label Classification with Label Correlations
AbstractMulti-label classification is a special learning task where each instance may be associated with multiple labels simultaneously. There are two main challenges: (a) discovering and exploiting the label correlations automatically, and (b) separating ...






Comments