Abstract
Fine-grained visual categorization (FGVC) aims to automatically recognize objects from different sub-ordinate categories. Despite attracting considerable attention from both academia and industry, it remains a challenging task due to subtle visual differences among different classes. Cross-layer feature aggregation and cross-image pairwise learning become prevailing in improving the performance of FGVC by extracting discriminative class-specific features. However, they are still inefficient to fully use the cross-layer information based on the simple aggregation strategy, while existing pairwise learning methods also fail to explore long-range interactions between different images. To address these problems, we propose a novel Alignment Enhancement Network (AENet), including two-level alignments, Cross-layer Alignment (CLA) and Cross-image Alignment (CIA). The CLA module exploits the cross-layer relationship between low-level spatial information and high-level semantic information, which contributes to cross-layer feature aggregation to improve the capacity of feature representation for input images. The new CIA module is further introduced to produce the aligned feature map, which can enhance the relevant information as well as suppress the irrelevant information across the whole spatial region. Our method is based on an underlying assumption that the aligned feature map should be closer to the inputs of CIA when they belong to the same category. Accordingly, we establish Semantic Affinity Loss to supervise the feature alignment within each CIA block. Experimental results on four challenging datasets show that the proposed AENet achieves the state-of-the-art results over prior arts.
- Steve Branson, Grant Van Horn, Serge Belongie, and Pietro Perona. 2014. Bird species categorization using pose normalized deep convolutional nets. Retrieved from https://arXiv:1406.2952.Google Scholar
- Sijia Cai, Wangmeng Zuo, and Lei Zhang. 2017. Higher-order integration of hierarchical convolutional activations for fine-grained visual categorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 511–520.Google Scholar
Cross Ref
- Zhaowei Cai, Quanfu Fan, Rogerio S. Feris, and Nuno Vasconcelos. 2016. A unified multi-scale deep convolutional neural network for fast object detection. In Proceedings of the European Conference on Computer Vision. Springer, 354–370.Google Scholar
Cross Ref
- Kaixuan Chen, Lina Yao, Dalin Zhang, Xianzhi Wang, Xiaojun Chang, and Feiping Nie. 2019. A semisupervised recurrent convolutional attention model for human activity recognition. IEEE Trans. Neural Netw. Learn. Syst. 31, 5 (2019), 1747--1756.Google Scholar
Cross Ref
- Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. 2017. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5659–5667.Google Scholar
Cross Ref
- Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision. 801–818.Google Scholar
Cross Ref
- Zhineng Chen, Shanshan Ai, and Caiyan Jia. 2019. Structure-aware deep learning for product image classification. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1s (2019), 1–20. Google Scholar
Digital Library
- Yin Cui, Yang Song, Chen Sun, Andrew Howard, and Serge Belongie. 2018. Large-scale fine-grained categorization and domain-specific transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4109–4118.Google Scholar
Cross Ref
- Guiguang Ding, Yuchen Guo, Kai Chen, Chaoqun Chu, Jungong Han, and Qionghai Dai. 2019. DECODE: Deep confidence network for robust image classification. IEEE Trans. Image Process. 28, 8 (2019), 3752–3765.Google Scholar
Cross Ref
- Songtao Ding, Shiru Qu, Yuling Xi, and Shaohua Wan. 2019. A long video caption generation algorithm for big video data retrieval. Future Gen. Comput. Syst. 93 (2019), 583–595.Google Scholar
Cross Ref
- Songtao Ding, Shiru Qu, Yuling Xi, and Shaohua Wan. 2020. Stimulus-driven and concept-driven analysis for image caption generation. Neurocomputing 398 (2020), 520–530.Google Scholar
Cross Ref
- Yao Ding, Yanzhao Zhou, Yi Zhu, Qixiang Ye, and Jianbin Jiao. 2019. Selective sparse sampling for fine-grained image recognition. In Proceedings of the IEEE International Conference on Computer Vision. 6599–6608.Google Scholar
Cross Ref
- Abhimanyu Dubey, Otkrist Gupta, Pei Guo, Ramesh Raskar, Ryan Farrell, and Nikhil Naik. 2018. Pairwise confusion for fine-grained visual classification. In Proceedings of the European Conference on Computer Vision. 70–86.Google Scholar
Cross Ref
- Melih Engin, Lei Wang, Luping Zhou, and Xinwang Liu. 2018. DeepKSPD: Learning kernel-matrix-based SPD representation for fine-grained image recognition. In Proceedings of the European Conference on Computer Vision. 612–627.Google Scholar
Cross Ref
- Zhang Fan, Li Meng, Zhai Guisheng, and Liu Yizhao. 2020. Multi-branch and multi-scale attention learning for fine-grained visual categorization. Retrieved from https://arXiv:2003.09150.Google Scholar
- Jianlong Fu, Heliang Zheng, and Tao Mei. 2017. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4438–4446.Google Scholar
Cross Ref
- Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. 2016. Compact bilinear pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 317–326.Google Scholar
Cross Ref
- Zan Gao, Yinming Li, and Shaohua Wan. 2020. Exploring deep learning for view-based 3D model retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 16, 1 (2020), 1–21. Google Scholar
Digital Library
- Zan Gao, Haixin Xue, and Shaohua Wan. 2020. Multiple discrimination and pairwise CNN for view-based 3D object retrieval. Neural Netw. 125 (2020), 290--302.Google Scholar
Cross Ref
- Weifeng Ge, Xiangru Lin, and Yizhou Yu. 2019. Weakly supervised complementary parts models for fine-grained image classification from the bottom up. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3034–3043.Google Scholar
Cross Ref
- Rohit Girdhar and Deva Ramanan. 2017. Attentional pooling for action recognition. In Advances in Neural Information Processing Systems. MIT Press, 34–45. Google Scholar
Digital Library
- Mengran Gou, Fei Xiong, Octavia Camps, and Mario Sznaier. 2018. MoNet: Moments embedding network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3175–3183.Google Scholar
Cross Ref
- Harald Hanselmann and Hermann Ney. 2020. ELoPE: Fine-grained visual classification with efficient localization, pooling and embedding. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 1247–1256.Google Scholar
Cross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Digital Library
- Jingtao Hu, En Zhu, Siqi Wang, Siwei Wang, Xinwang Liu, and Jianping Yin. 2019. Two-stage unsupervised video anomaly detection using low-rank-based unsupervised one-class learning with ridge regression. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’19). IEEE, 1–8.Google Scholar
Cross Ref
- Yutao Hu, Xiaolong Jiang, Xuhui Liu, Baochang Zhang, Jungong Han, Xianbin Cao, and David Doermann. 2020. NAS-Count: Counting-by-density with neural architecture search. Retrieved from https://arXiv:2003.00217.Google Scholar
- Yutao Hu, Yandan Yang, Jun Zhang, Xianbin Cao, and Xiantong Zhen. 2021. Attentional kernel encoding networks for fine-grained visual categorization. IEEE Trans. Circ. Syst. Video Technol. 31, 1 (2021), 301--314.Google Scholar
Digital Library
- Shaoli Huang, Zhe Xu, Dacheng Tao, and Ya Zhang. 2016. Part-stacked CNN for fine-grained visual categorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1173–1182.Google Scholar
Cross Ref
- Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems. MIT Press, 103–112.Google Scholar
- Ruyi Ji, Longyin Wen, Libo Zhang, Dawei Du, Yanjun Wu, Chen Zhao, Xianglong Liu, and Feiyue Huang. 2020. Attention convolutional binary neural tree for fine-grained visual categorization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10468–10477.Google Scholar
Cross Ref
- Xiaolong Jiang, Zehao Xiao, Baochang Zhang, Xiantong Zhen, Xianbin Cao, David Doermann, and Ling Shao. 2019. Crowd counting and density estimation by trellis encoder-decoder networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6133–6142.Google Scholar
Cross Ref
- Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Fei-Fei Li. 2011. Novel dataset for fine-grained image categorization: Stanford dogs. In Proceedings of the CVPR Workshop on Fine-Grained Visual Categorization (FGVC’11), Vol. 2.Google Scholar
- Dimitri Korsch, Paul Bodesheim, and Joachim Denzler. 2020. End-to-end learning of a fisher vector encoding for part features in fine-grained recognition. Retrieved from https://arXiv:2007.02080.Google Scholar
- Jonathan Krause, Hailin Jin, Jianchao Yang, and Li Fei-Fei. 2015. Fine-grained recognition without part annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5546–5555.Google Scholar
Cross Ref
- Jonathan Krause, Michael Stark, Deng Jia, and Fei Fei Li. 2013. 3D object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 554–561. Google Scholar
Digital Library
- Hao Li, Xiaopeng Zhang, Hongkai Xiong, and Qi Tian. 2020. Attribute mix: Semantic data augmentation for fine grained recognition. Retrieved from https://arXiv:2004.02684.Google Scholar
- Jingjing Li, Lei Zhu, Zi Huang, Ke Lu, and Jidong Zhao. 2018. I read, i saw, i tell: Texts assisted fine-grained visual classification. In Proceedings of the 26th ACM International Conference on Multimedia. 663–671. Google Scholar
Digital Library
- Di Lin, Xiaoyong Shen, Cewu Lu, and Jiaya Jia. 2015. Deep lac: Deep localization, alignment and classification for fine-grained recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1666–1674.Google Scholar
Cross Ref
- Tsung-Yu Lin and Subhransu Maji. 2017. Improved bilinear pooling with CNNs. Retrieved from https://arXiv:1707.06772.Google Scholar
- Tsung-Yu Lin, Aruni Roy Chowdhury, and Subhransu Maji. 2015. Bilinear CNN models for fine-grained visual recognition. In Proceedings of the IEEE International Conference on Computer Vision. 1449–1457. Google Scholar
Digital Library
- Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision. Springer, 21–37.Google Scholar
- Xiao Liu, Tian Xia, Jiang Wang, and Yuanqing Lin. 2016. Fully convolutional attention localization networks: Efficient attention localization for fine-grained recognition. arXiv preprint arXiv:1603.06765 (2016).Google Scholar
- Shangzhen Luan, Chen Chen, Baochang Zhang, Jungong Han, and Jianzhuang Liu. 2018. Gabor convolutional networks. IEEE Trans. Image Process. 27, 9 (2018), 4357–4366.Google Scholar
Cross Ref
- Wei Luo, Xitong Yang, Xianjie Mo, Yuheng Lu, Larry S. Davis, Jun Li, Jian Yang, and Ser-Nam Lim. 2019. Cross-X learning for fine-grained visual categorization. In Proceedings of the IEEE International Conference on Computer Vision. 8242–8251.Google Scholar
Cross Ref
- Jiaqi Ma, Yipeng Zhang, and Lefei Zhang. 2020. Discriminative subspace matrix factorization for multiview data clustering. Pattern Recogn. 111 (2020), 107676.Google Scholar
Cross Ref
- Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. 2013. Fine-grained visual classification of aircraft. Technical Report. HAL-INRIA.Google Scholar
- Lei Meng, Long Chen, Xun Yang, Dacheng Tao, Hanwang Zhang, Chunyan Miao, and Tat-Seng Chua. 2019. Learning using privileged information for food recognition. In Proceedings of the 27th ACM International Conference on Multimedia. 557–565. Google Scholar
Digital Library
- Shaobo Min, Hongtao Xie, Youliang Tian, Hantao Yao, and Yongdong Zhang. 2019. Adaptive bilinear pooling for fine-grained representation learning. In Proceedings of the ACM Multimedia Asia. 1–6. Google Scholar
Digital Library
- Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-assisted Intervention. Springer, 234–241.Google Scholar
Cross Ref
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Retrieved from https://arXiv:1409.1556.Google Scholar
- Ming Sun, Yuchen Yuan, Feng Zhou, and Errui Ding. 2018. Multi-attention multi-class constraint for fine-grained image recognition. In Proceedings of the European Conference on Computer Vision. 805–821.Google Scholar
Cross Ref
- Mingxing Tan and Quoc V. Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. Retrieved from https://arXiv:1905.11946.Google Scholar
- Min Tan, Jun Yu, Zhou Yu, Fei Gao, Yong Rui, and Dacheng Tao. 2018. User-click-data-based fine-grained image recognition via weakly supervised metric learning. ACM Trans. Multimedia Comput. Commun. Appl. 14, 3 (2018), 1–23. Google Scholar
Digital Library
- Joshua B. Tenenbaum and William T. Freeman. 2000. Separating style and content with bilinear models. Neural Comput. 12, 6 (2000), 1247–1283. Google Scholar
Digital Library
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. MIT Press, 5998–6008. Google Scholar
Digital Library
- Dequan Wang, Zhiqiang Shen, Jie Shao, Wei Zhang, Xiangyang Xue, and Zheng Zhang. 2015. Multiple granularity descriptors for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision. 2399–2406. Google Scholar
Digital Library
- Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. 2017. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.Google Scholar
Cross Ref
- Ning Wang, Sihan Ma, Jingyuan Li, Yipeng Zhang, and Lefei Zhang. 2020. Multistage attention network for image inpainting. Pattern Recogn. 106 (2020), 107448.Google Scholar
Cross Ref
- Qilong Wang, Peihua Li, and Lei Zhang. 2017. G2DeNet: Global gaussian distribution embedding network and its application to visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2730–2739.Google Scholar
Cross Ref
- Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7794–7803.Google Scholar
Cross Ref
- Yaming Wang, Jonghyun Choi, Vlad Morariu, and Larry S. Davis. 2016. Mining discriminative triplets of patches for fine-grained classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1163–1172.Google Scholar
- Yaming Wang, Vlad I. Morariu, and Larry S. Davis. 2018. Learning a discriminative filter bank within a CNN for fine-grained recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4148–4157.Google Scholar
- P. Welinder, S. Branson, C. Wah, F. Schroff, S. Belongie, and P. Perona. 2010. Caltech-UCSD birds 200. Technical Report, California Institute of Technology.Google Scholar
- Junfeng Wu, Li Yao, Bin Liu, and Zheyuan Ding. 2019. Leveraging fine-grained labels to regularize fine-grained visual classification. In Proceedings of the 11th International Conference on Computer Modeling and Simulation. 133–136. Google Scholar
Digital Library
- Liuyu Xiang, Guiguang Ding, and Jungong Han. 2020. Learning from multiple experts: Self-paced knowledge distillation for long-tailed classification. In Proceedings of the European Conference on Computer Vision. Springer, 247–263.Google Scholar
- Chaojian Yu, Xinyi Zhao, Qi Zheng, Peng Zhang, and Xinge You. 2018. Hierarchical bilinear pooling for fine-grained visual recognition. In Proceedings of the European Conference on Computer Vision. Springer, 595–610.Google Scholar
Cross Ref
- Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. 2018. Deep layer aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2403–2412.Google Scholar
Cross Ref
- Anran Zhang, Xiaolong Jiang, Baochang Zhang, and Xianbin Cao. 2020. Multi-scale supervised attentive encoder-decoder network for crowd counting. ACM Trans. Multimedia Comput. Commun. Appl. 16, 1s (2020), 1–20. Google Scholar
Digital Library
- Han Zhang, Tao Xu, Mohamed Elhoseiny, Xiaolei Huang, Shaoting Zhang, Ahmed Elgammal, and Dimitris Metaxas. 2016. Spda-CNN: Unifying semantic part detection and abstraction for fine-grained recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1143–1152.Google Scholar
Cross Ref
- Lianbo Zhang, Shaoli Huang, Wei Liu, and Dacheng Tao. 2019. Learning a mixture of granularity-specific experts for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision. 8331–8340.Google Scholar
Cross Ref
- Ning Zhang, Jeff Donahue, Ross Girshick, and Trevor Darrell. 2014. Part-based R-CNNs for fine-grained category detection. In Proceedings of the European Conference on Computer Vision. Springer, 834–849.Google Scholar
Cross Ref
- Heliang Zheng, Jianlong Fu, Tao Mei, and Jiebo Luo. 2017. Learning multi-attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE International Conference on Computer Vision. 5209–5217.Google Scholar
Cross Ref
- Heliang Zheng, Jianlong Fu, Zheng-Jun Zha, and Jiebo Luo. 2019. Learning deep bilinear transformation for fine-grained image representation. In Advances in Neural Information Processing Systems. MIT Press, 4279–4288.Google Scholar
- Heliang Zheng, Jianlong Fu, Zheng-Jun Zha, and Jiebo Luo. 2019. Looking for the devil in the details: Learning trilinear attention sampling network for fine-grained image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5012–5021.Google Scholar
Cross Ref
- Zhen Zhu, Mengde Xu, Song Bai, Tengteng Huang, and Xiang Bai. 2019. Asymmetric non-local neural networks for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision. 593–602.Google Scholar
Cross Ref
- Peiqin Zhuang, Yali Wang, and Yu Qiao. 2020. Learning attentive pairwise interaction for fine-grained classification. In Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI’20). 13130–13137.Google Scholar
Cross Ref
Index Terms
Alignment Enhancement Network for Fine-grained Visual Categorization
Recommendations
PFNet: a novel part fusion network for fine-grained visual categorization
AbstractThe existing methods in fine-grained visual categorization focus on integrating multiple deep CNN models or complicated attention mechanism, resulting in increasing cumbersome networks. In addition, most methods rely on part annotations which ...
Fused one-vs-all mid-level features for fine-grained visual categorization
MM '14: Proceedings of the 22nd ACM international conference on MultimediaAs an emerging research topic, fine-grained visual categorization has been attracting growing attentions in recent years. Due to the large inter-class similarity and intra-class variance, recognizing objects in fine-grained domains is extremely ...
Learning Disentangled Representation for Fine-Grained Visual Categorization
Image and GraphicsAbstractFine-grained visual categorization (FGVC) that aims to recognize objects from subcategories with very subtle differences remains a challenging task due to the large intra-class and small inter-class variation caused by, e.g., deformation, ...






Comments