Abstract
Image multi-label classification task is mainly to correctly predict multiple object categories in the images. To capture the correlation between labels, graph convolution network based methods have to manually count the label co-occurrence probability from training data to construct a pre-defined graph as the input of graph network, which is inflexible and may degrade model generalizability. Moreover, most of the current methods cannot effectively align the learned salient object features with the label concepts, so that the predicted results of model may not be consistent with the image content. Therefore, how to learn the salient semantic features of images and capture the correlation between labels, and then effectively align them is one of the key to improve the performance of image multi-label classification task. To this end, we propose a novel image multi-label classification framework which aims to align Image Semantics with Label Concepts (ISLC). Specifically, we propose a residual encoder to learn salient object features in the images, and exploit the self-attention layer in aligned decoder to automatically capture the correlation between labels. Then, we leverage the cross-attention layers in aligned decoder to align image semantic features with label concepts, so as to make the labels predicted by model more consistent with image content. Finally, the output features of the last layer of residual encoder and aligned decoder are fused to obtain the final output feature for classification. The proposed ISLC model achieves good performance on various prevalent multi-label image datasets such as MS-COCO 2014, PASCAL VOC 2007, VG-500, and NUS-WIDE with 87.2%, 96.9%, 39.4%, and 64.2%, respectively.
- [1] . 2019. Semi-supervised robust deep neural networks for multi-label classification. In Proceedings of the CVPR Workshops. 9–17.Google Scholar
- [2] . 2020. Knowledge-guided multi-label few-shot learning for general image recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 3 (2020), 1371–1384.Google Scholar
- [3] . 2019. Learning semantic-specific graph representation for multi-label image recognition. In Proceedings of the IEEE International Conference on Computer Vision. 522–531.Google Scholar
Cross Ref
- [4] . 2019. Multi-label image recognition with graph convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5177–5186.Google Scholar
Cross Ref
- [5] . 2021. Do we really need explicit position encodings for vision transformers? arXiv:2102.10882. Retrieved from https://arxiv.org/abs/2102.10882.Google Scholar
- [6] . 2009. NUS-WIDE: A real-world web image database from National University of Singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval. 1–9.Google Scholar
Digital Library
- [7] . 2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv:1901.02860. Retrieved from https://arxiv.org/abs/1901.02860.Google Scholar
- [8] . 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.Google Scholar
Cross Ref
- [9] . 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. Retrieved from https://arxiv.org/abs/2010.11929.Google Scholar
- [10] . 2017. Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 642–651.Google Scholar
Cross Ref
- [11] . 2018. Exploiting negative evidence for deep latent structured models. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2 (2018), 337–351.Google Scholar
Digital Library
- [12] . 2020. Recurrent image annotation with explicit inter-label dependencies. In Proceedings of the European Conference on Computer Vision. Springer, 191–207.Google Scholar
Digital Library
- [13] . 2010. The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88, 2 (2010), 303–338.Google Scholar
Digital Library
- [14] . 2021. Learning to discover multi-class attentional regions for multi-label image recognition. IEEE Transactions on Image Processing 30, 6 (2021), 5920–5932.Google Scholar
- [15] . 2013. Deep convolutional ranking for multilabel image annotation. arXiv:1312.4894. Retrieved from https://arxiv.org/abs/1312.4894.Google Scholar
- [16] . 2019. Visual attention consistency under image transforms for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 729–739.Google Scholar
Cross Ref
- [17] . 2020. Channel pruning guided by classification loss and feature importance. In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34, 10885–10892.Google Scholar
Cross Ref
- [18] . 2020. Multi-dimensional pruning: A unified framework for model compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1508–1517.Google Scholar
Cross Ref
- [19] . 2020. Model compression using progressive channel pruning. IEEE Transactions on Circuits and Systems for Video Technology 31, 3 (2020), 1114–1124.Google Scholar
Cross Ref
- [20] . 2021. Visual semantic-based representation learning using deep CNNs for scene recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 2 (2021), 1–24.Google Scholar
Digital Library
- [21] . 2022. Learning discriminative representations for multi-label image recognition. Journal of Visual Communication and Image Representation 83, C (2022), 103448.Google Scholar
Digital Library
- [22] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Cross Ref
- [23] . 2020. RealFormer: Transformer likes residual attention. arXiv:2012.11747. Retrieved from https://arxiv.org/abs/2012.11747.Google Scholar
- [24] . 2021. Alignment enhancement network for fine-grained visual categorization. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 1s (2021), 1–20.Google Scholar
Digital Library
- [25] . 2020. Attention-based modality-gated networks for image-text sentiment analysis. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 3 (2020), 1–19.Google Scholar
Digital Library
- [26] . 2020. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv:2004.00849. Retrieved from https://arxiv.org/abs/2004.00849.Google Scholar
- [27] . 2021. A multi-instance multi-label dual learning approach for video captioning. ACM Transactions on Multimedia Computing Communications and Applications 17, 2s (2021), 1–18.Google Scholar
Digital Library
- [28] . 2016. Annotation order matters: Recurrent image annotator for arbitrary length image tagging. In Proceedings of the 2016 23rd International Conference on Pattern Recognition. IEEE, 2452–2457.Google Scholar
- [29] . 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73.Google Scholar
Digital Library
- [30] . 2021. General multi-label image classification with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16478–16488.Google Scholar
Cross Ref
- [31] . 2020. PSConv: Squeezing feature pyramid into one compact poly-scale convolutional layer. In Proceedings of the European Conference on Computer Vision. Springer, 615–632.Google Scholar
- [32] . 2020. Multi-scale cross-modal spatial attention fusion for multi-label image recognition. In Proceedings of the International Conference on Artificial Neural Networks. Springer, 736–747.Google Scholar
Digital Library
- [33] . 2019. A hierarchical CNN-RNN approach for visual emotion classification. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 3s (2019), 1–17.Google Scholar
Digital Library
- [34] . 2019. Learning category correlations for multi-label image recognition with graph networks. arXiv:1909.13005. Retrieved from https://arxiv.org/abs/1909.13005.Google Scholar
- [35] . 2021. A semi-supervised learning approach based on adaptive weighted fusion for automatic image annotation. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 1 (2021), 1–23.Google Scholar
Digital Library
- [36] . 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision. Springer, 740–755.Google Scholar
Cross Ref
- [37] . 2019. Decoupling category-wise independence and relevance with self-attention for multi-label image classification. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 1682–1686.Google Scholar
Cross Ref
- [38] . 2017. Compositional model based fisher vector coding for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 12 (2017), 2335–2348.Google Scholar
Cross Ref
- [39] . 2018. Coarse to fine: Multi-label image classification with global/local attention. In Proceedings of the 2018 IEEE International Smart Cities Conference. IEEE, 1–7.Google Scholar
Cross Ref
- [40] . 2019. Attend and imagine: Multi-label image classification with visual attention and recurrent neural networks. IEEE Transactions on Multimedia 21, 8 (2019), 1971–1981.Google Scholar
Cross Ref
- [41] . 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, Nov (2008), 2579–2605.Google Scholar
- [42] . 2019. Multi-label image classification with attention mechanism and graph convolutional networks. In Proceedings of the ACM Multimedia Asia. 1–6.Google Scholar
Digital Library
- [43] . 2021. Modular graph transformer networks for multi-label image classification. In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35, 9092–9100.Google Scholar
Cross Ref
- [44] . 2022. Semantic representation and dependency learning for multi-label image recognition. arXiv:2204.03795. Retrieved from https://arxiv.org/abs/2204.03795.Google Scholar
- [45] . 2017. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision. 618–626.Google Scholar
Cross Ref
- [46] . 2022. An attention-driven multi-label image classification with semantic embedding and graph convolutional networks. Cognitive Computation 9, 1 (2022), 1–12.Google Scholar
- [47] . 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 5998–6008.Google Scholar
- [48] . 2017. Graph attention networks. arXiv:1710.10903. Retrieved from https://arxiv.org/abs/1710.10903.Google Scholar
- [49] . 2020. Privacy-preserving visual content tagging using graph transformer networks. In Proceedings of the 28th ACM International Conference on Multimedia. 2299–2307.Google Scholar
Digital Library
- [50] . 2016. CNN-RNN: A unified framework for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2285–2294.Google Scholar
Cross Ref
- [51] . 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7794–7803.Google Scholar
Cross Ref
- [52] . 2021. Distance restricted transformer encoder for multi-label classification. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo. IEEE, 1–6.Google Scholar
Cross Ref
- [53] . 2020. Multi-label classification with label graph superimposing. In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34, 12265–12272.Google Scholar
Cross Ref
- [54] . 2020. Fast graph convolution network based multi-label image recognition via cross-modal fusion. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 1575–1584.Google Scholar
Digital Library
- [55] . 2017. Multi-label image recognition by recurrently discovering attentional regions. In Proceedings of the IEEE International Conference on Computer Vision. 464–472.Google Scholar
Cross Ref
- [56] . 2021. Semantic supplementary network with prior information for multi-label image classification. IEEE Transactions on Circuits and Systems for Video Technology 32, 4 (2021), 1848–1859.Google Scholar
- [57] . 2020. AdaHGNN: Adaptive hypergraph neural networks for multi-label image classification. In Proceedings of the 28th ACM International Conference on Multimedia. 284–293.Google Scholar
Digital Library
- [58] . 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1492–1500.Google Scholar
Cross Ref
- [59] . 2019. Multi-label image classification by feature attention network. IEEE Access 7 (2019), 98005–98013.Google Scholar
Cross Ref
- [60] . 2016. Exploit bounding box annotations for multi-label object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 280–288.Google Scholar
Cross Ref
- [61] . 2020. Orderless recurrent models for multi-label classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13440–13449.Google Scholar
- [62] . 2020. Attention-driven dynamic graph convolutional network for multi-label image recognition. In Proceedings of the 16th European Conference on Computer Vision. Springer, 649–665.Google Scholar
Digital Library
- [63] . 2020. Cross-modality attention with semantic graph embedding for multi-label classification. In Proceedings of the AAAI Conference on Artificial Intelligence. 12709–12716.Google Scholar
Cross Ref
- [64] . 2019. DELTA: A deep dual-stream network for multi-label image classification. Pattern Recognition 91, C (2019), 322–331.Google Scholar
Digital Library
- [65] . 2021. Incorporating convolution designs into visual transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 579–588.Google Scholar
- [66] . 2018. Multilabel image classification with regional latent semantic dependencies. IEEE Transactions on Multimedia 20, 10 (2018), 2801–2813.Google Scholar
Cross Ref
- [67] . 2020. Double attention for multi-label image classification. IEEE Access 8 (2020), 225539–225550.Google Scholar
Cross Ref
- [68] . 2021. Transformer-based dual relation graph for multi-label image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 163–172.Google Scholar
Cross Ref
- [69] . 2021. Transformer3D-Det: Improving 3D object detection by vote refinement. IEEE Transactions on Circuits and Systems for Video Technology 31, 12 (2021), 4735–4746.Google Scholar
Cross Ref
- [70] . 2020. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6881–6890.Google Scholar
- [71] . 2021. Multi-label image classification via category prototype compositional learning. IEEE Transactions on Circuits and Systems for Video Technology 32, 7 (2021), 4513–4525.Google Scholar
- [72] . 2022. Double attention based on graph attention network for image multi-label classification. ACM Transactions on Multimedia Computing, Communications, and Applications (2022). Retrieved from Google Scholar
Digital Library
- [73] . 2017. Learning spatial regularization with image-level supervisions for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5513–5522.Google Scholar
Cross Ref
- [74] . 2021. Residual attention: A simple but effective method for multi-label recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 184–193.Google Scholar
Cross Ref
Index Terms
Aligning Image Semantics and Label Concepts for Image Multi-Label Classification
Recommendations
Double Attention Based on Graph Attention Network for Image Multi-Label Classification
The task of image multi-label classification is to accurately recognize multiple objects in an input image. Most of the recent works need to leverage the label co-occurrence matrix counted from training data to construct the graph structure, which are ...
Semi-supervised multi-label classification using incomplete label information
Highlights- An inductive semi-supervised method called Smile is proposed for multi-label classification using incomplete label information.
AbstractClassifying multi-label instances using incompletely labeled instances is one of the fundamental tasks in multi-label learning. Most existing methods regard this task as supervised weak-label learning problem and assume sufficient ...
Weak Labeled Multi-Label Active Learning for Image Classification
MM '15: Proceedings of the 23rd ACM international conference on MultimediaIn order to achieve better classification performance with even fewer labeled images, active learning is suitable for these situations. Several active learning methods have been proposed for multi-label image classification, but all of them assume that ...






Comments