Abstract
The purpose of image multi-label classification is to predict all the object categories presented in an image. Some recent works exploit graph convolution network to capture the correlation between labels. Although promising results have been reported, these methods cannot learn salient object features in the images and ignore the correlation between channel feature maps. In addition, the current researches only learn the feature information within individual input image, but fail to mine the contextual information of various categories from the dataset to enhance the input feature representation. To address these issues, we propose an Attention-Augmented Memory Network (AAMN) model for the image multi-label classification task. Specifically, we first propose a novel categorical memory module to excavate the contextual information of various categories from the dataset to augment the current input feature. Secondly, we design a new channel-relation exploration module to capture the inter-channel relationship of features, so as to enhance the correlation between objects in the images. Thirdly, we develop a spatial-relation enhancement module to model second-order statistics of features and capture long-range dependencies between pixels in feature maps, so as to learn salient object features. Experimental results on standard benchmarks, including MS-COCO 2014, PASCAL VOC 2007, and VG-500, demonstrate the effectiveness and superiority of AAMN model, which outperforms current state-of-the-art methods.
- [1] . 2021. Semi-supervised semantic segmentation with pixel-level contrastive learning from a class-wise memory bank. arXiv preprint arXiv:2104.13415 (2021).Google Scholar
- [2] . 2019. GCNet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 0–10.Google Scholar
Cross Ref
- [3] . 2012. Semantic segmentation with second-order pooling. In European Conference on Computer Vision. Springer, 430–443.Google Scholar
Digital Library
- [4] . 2020. Knowledge-guided multi-label few-shot learning for general image recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).Google Scholar
- [5] . 2019. Learning semantic-specific graph representation for multi-label image recognition. In Proceedings of the IEEE International Conference on Computer Vision. 522–531.Google Scholar
Cross Ref
- [6] . 2019. Multi-label image recognition with graph convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5177–5186.Google Scholar
Cross Ref
- [7] . 2021. MlTr: Multi-label classification with transformer. arXiv preprint arXiv:2106.06195 (2021).Google Scholar
- [8] . 2019. Object guided external memory network for video object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6678–6687.Google Scholar
Cross Ref
- [9] . 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.Google Scholar
Cross Ref
- [10] . 2020. Fine-grained classification via categorical memory networks. arXiv preprint arXiv:2012.06793 (2020).Google Scholar
- [11] . 2017. Wildcat: Weakly supervised learning of deep ConvNets for image classification, pointwise localization and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 642–651.Google Scholar
Cross Ref
- [12] . 2018. Exploiting negative evidence for deep latent structured models. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2 (2018), 337–351.Google Scholar
Digital Library
- [13] . 2020. Recurrent image annotation with explicit inter-label dependencies. In European Conference on Computer Vision. Springer, 191–207.Google Scholar
Digital Library
- [14] . 2010. The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision 88, 2 (2010), 303–338.Google Scholar
Digital Library
- [15] . 2019. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3146–3154.Google Scholar
Cross Ref
- [16] . 2021. Learning to discover multi-class attentional regions for multi-label image recognition. IEEE Transactions on Image Processing 30 (2021), 5920–5932.Google Scholar
Digital Library
- [17] . 2013. Deep convolutional ranking for multilabel image annotation. arXiv preprint arXiv:1312.4894 (2013).Google Scholar
- [18] . 2019. Visual attention consistency under image transforms for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 729–739.Google Scholar
Cross Ref
- [19] . 2021. Visual semantic-based representation learning using deep CNNs for scene recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 2 (2021), 1–24.Google Scholar
Digital Library
- [20] . 2022. Learning discriminative representations for multi-label image recognition. Journal of Visual Communication and Image Representation 83 (2022), 103448.Google Scholar
Digital Library
- [21] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Cross Ref
- [22] . 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7132–7141.Google Scholar
Cross Ref
- [23] . 2021. Memory-guided unsupervised image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6558–6567.Google Scholar
Cross Ref
- [24] . 2021. A multi-instance multi-label dual learning approach for video captioning. ACM Transactions on Multimedia Computing Communications and Applications 17, 2s (2021), 1–18.Google Scholar
Digital Library
- [25] . 2016. Annotation order matters: Recurrent image annotator for arbitrary length image tagging. In 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE, 2452–2457.Google Scholar
Cross Ref
- [26] . 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).Google Scholar
- [27] . 2016. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332 (2016).Google Scholar
- [28] . 2021. General multi-label image classification with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16478–16488.Google Scholar
Cross Ref
- [29] . 2018. Efficient coarse-to-fine non-local module for the detection of small objects. arXiv preprint arXiv:1811.12152 (2018).Google Scholar
- [30] . 2020. Multi-scale cross-modal spatial attention fusion for multi-label image recognition. In International Conference on Artificial Neural Networks. Springer, 736–747.Google Scholar
Digital Library
- [31] . 2019. Learning category correlations for multi-label image recognition with graph networks. arXiv preprint arXiv:1909.13005 (2019).Google Scholar
- [32] . 2014. Microsoft COCO: Common objects in context. In European Conference on Computer Vision. Springer, 740–755.Google Scholar
Cross Ref
- [33] . 2017. Bilinear convolutional neural networks for fine-grained visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2017), 1309–1322.Google Scholar
Cross Ref
- [34] . 2021. Query2Label: A simple transformer way to multi-label classification. arXiv preprint arXiv:2107.10834 (2021).Google Scholar
- [35] . 2019. Attend and imagine: Multi-label image classification with visual attention and recurrent neural networks. IEEE Transactions on Multimedia 21, 8 (2019), 1971–1981.Google Scholar
Cross Ref
- [36] . 2019. Multi-label image classification with attention mechanism and graph convolutional networks. In Proceedings of the ACM Multimedia Asia. 1–6.Google Scholar
Digital Library
- [37] . 2010. Factorization machines. In 2010 IEEE International Conference on Data Mining. IEEE, 995–1000.Google Scholar
Digital Library
- [38] . 2021. Asymmetric loss for multi-label classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 82–91.Google Scholar
Cross Ref
- [39] . 2017. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision. 618–626.Google Scholar
Cross Ref
- [40] . 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- [41] . 2022. IML-GCN: Improved multi-label graph convolutional network for efficient yet precise image classification. In AAAI-22 Workshop Program-Deep Learning on Graphs: Methods and Applications.Google Scholar
- [42] . 2022. An attention-driven multi-label image classification with semantic embedding and graph convolutional networks. Cognitive Computation (2022), 1–12.Google Scholar
- [43] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.Google Scholar
- [44] . 2016. CNN-RNN: A unified framework for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2285–2294.Google Scholar
Cross Ref
- [45] . 2020. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11534–11542.Google Scholar
Cross Ref
- [46] . 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7794–7803.Google Scholar
Cross Ref
- [47] . 2021. Distance restricted transformer encoder for multi-label classification. In 2021 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1–6.Google Scholar
Cross Ref
- [48] . 2020. Multi-label classification with label graph superimposing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12265–12272.Google Scholar
Cross Ref
- [49] . 2020. Fast graph convolution network based multi-label image recognition via cross-modal fusion. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 1575–1584.Google Scholar
Digital Library
- [50] . 2021. Semantic supplementary network with prior information for multi-label image classification. IEEE Transactions on Circuits and Systems for Video Technology (2021).Google Scholar
- [51] . 2014. Memory networks. arXiv preprint arXiv:1410.3916 (2014).Google Scholar
- [52] . 2018. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV). 3–19.Google Scholar
Digital Library
- [53] . 2020. AdaHGNN: Adaptive hypergraph neural networks for multi-label image classification. In Proceedings of the 28th ACM International Conference on Multimedia. 284–293.Google Scholar
Digital Library
- [54] . 2019. Multi-label image classification by feature attention network. IEEE Access 7 (2019), 98005–98013.Google Scholar
Cross Ref
- [55] . 2020. Orderless recurrent models for multi-label classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13440–13449.Google Scholar
- [56] . 2018. Hierarchical bilinear pooling for fine-grained visual recognition. In Proceedings of the European Conference on Computer Vision (ECCV). 574–589.Google Scholar
Digital Library
- [57] . 2019. DELTA: A deep dual-stream network for multi-label image classification. Pattern Recognition 91 (2019), 322–331.Google Scholar
Digital Library
- [58] . 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 1821–1830.Google Scholar
Cross Ref
- [59] . 2018. Compact generalized non-local network. arXiv preprint arXiv:1810.13125 (2018).Google Scholar
- [60] . 2014. Visualizing and understanding convolutional networks. In European Conference on Computer Vision. Springer, 818–833.Google Scholar
Cross Ref
- [61] . 2018. Multilabel image classification with regional latent semantic dependencies. IEEE Transactions on Multimedia 20, 10 (2018), 2801–2813.Google Scholar
Cross Ref
- [62] . 2020. Adaptive graph convolutional network with attention graph clustering for co-saliency detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9050–9059.Google Scholar
Cross Ref
- [63] . 2021. RSTNet: Captioning with adaptive attention on visual and non-visual words. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15465–15474.Google Scholar
Cross Ref
- [64] . 2019. Relation-aware global attention. arXiv preprint arXiv:1904.02998 (2019).Google Scholar
- [65] . 2020. Double attention for multi-label image classification. IEEE Access 8 (2020), 225539–225550.Google Scholar
Cross Ref
- [66] . 2021. Memory enhanced embedding learning for cross-modal video-text retrieval. arXiv preprint arXiv:2103.15686 (2021).Google Scholar
- [67] . 2021. Multi-label image classification via category prototype compositional learning. IEEE Transactions on Circuits and Systems for Video Technology (2021).Google Scholar
- [68] . 2020. Deep semantic dictionary learning for multi-label image classification. arXiv preprint arXiv:2012.12509 (2020).Google Scholar
- [69] . 2022. Double attention based on graph attention network for image multi-label classification. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) (2022).Google Scholar
Digital Library
- [70] . 2017. Learning spatial regularization with image-level supervisions for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5513–5522.Google Scholar
Cross Ref
- [71] . 2021. Residual attention: A simple but effective method for multi-label recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 184–193.Google Scholar
Cross Ref
- [72] . 2021. Unifying nonlocal blocks for neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12292–12301.Google Scholar
Cross Ref
- [73] . 2019. An empirical study of spatial attention mechanisms in deep networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6688–6697.Google Scholar
Cross Ref
- [74] . 2019. Global second-order pooling convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA, 16–20.Google Scholar
Index Terms
Attention-Augmented Memory Network for Image Multi-Label Classification
Recommendations
Aligning Image Semantics and Label Concepts for Image Multi-Label Classification
Image multi-label classification task is mainly to correctly predict multiple object categories in the images. To capture the correlation between labels, graph convolution network based methods have to manually count the label co-occurrence probability ...
Double Attention Based on Graph Attention Network for Image Multi-Label Classification
The task of image multi-label classification is to accurately recognize multiple objects in an input image. Most of the recent works need to leverage the label co-occurrence matrix counted from training data to construct the graph structure, which are ...
Feature learning network with transformer for multi-label image classification
Highlights- A novel framework termed FL-Tran is proposed to solve the multi-label image classification task.
AbstractThe purpose of multi-label image classification task is to accurately assign a set of labels to the objects in images. Although promising results have been achieved, most of the existing methods cannot effectively learn multi-scale ...






Comments