Abstract
Automatic product image classification is a task of crucial importance with respect to the management of online retailers. Motivated by recent advancements of deep Convolutional Neural Networks (CNN) on image classification, in this work we revisit the problem in the context of product images with the existence of a predefined categorical hierarchy and attributes, aiming to leverage the hierarchy and attributes to improve classification accuracy. With these structure-aware clues, we argue that more advanced deep models could be developed beyond the flat one-versus-all classification performed by conventional CNNs. To this end, novel efforts of this work include a salient-sensitive CNN that gazes into the product foreground by inserting a dedicated spatial attention module; a multiclass regression-based refinement that is expected to predict more accurately by merging prediction scores from multiple preceding CNNs, each corresponding to a distinct classifier in the hierarchy; and a multitask deep learning architecture that effectively explores correlations among categories and attributes for categorical label prediction. Experimental results on nearly 1 million real-world product images basically validate the effectiveness of the proposed efforts individually and jointly, from which performance gains are observed.
- Shanshan Ai, Caiyan Jia, and Zhineng Chen. 2017. Large-scale product classification via spatial attention based CNN learning and multi-class regression. In Proceedings of the International Conference on Multimedia Modeling, Reykjavik, Iceland. Springer, 176--188.Google Scholar
Cross Ref
- Jinfeng Bai, Zhineng Chen, Bailan Feng, and Bo Xu. 2014. Image character recognition using deep convolutional neural network learned from different languages. In Proceedings of the International Conference on Image Processing, Paris, France. IEEE, 2560--2564.Google Scholar
Cross Ref
- Lunshao Chai, Zhen Qin, Honggang Zhang, Jun Guo, and Christian R Shelton. 2012. Re-ranking using compression-based distance measure for content-based commercial product image retrieval. In Proceedings of the International Conference on Image Processing, Lake Buena Vista, Orlando, FL. IEEE, 1941--1944.Google Scholar
Cross Ref
- Jingjing Chen and Chong-Wah Ngo. 2016. Deep-based ingredient recognition for cooking recipe retrieval. In Proceedings of the 2016 ACM Conference on Multimedia, Amsterdam, Netherlands. ACM, 32--41. Google Scholar
Digital Library
- Zhineng Chen, Juan Cao, Tian Xia, Yicheng Song, Yongdong Zhang, and Jintao Li. 2011. Web video retagging. Multimedia Tools and Applications 55, 1 (2011), 53--82. Google Scholar
Digital Library
- Zhineng Chen, Chong-Wah Ngo, Wei Zhang, Juan Cao, and Yugang Jiang. 2014. Name-face association in web videos: A large-scale dataset, baselines, and open issues. Journal of Computer Science and Technology 29, 5 (2014), 785--798.Google Scholar
Cross Ref
- Zhineng Chen, Wei Zhang, Bin Deng, Hongtao Xie, and Xiaoyan Gu. 2019. Name-face association with web facial image supervision. Multimedia Systems (2017).Google Scholar
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, Florida. IEEE, 248--255.Google Scholar
Cross Ref
- Jianlong Fu, Heliang Zheng, and Tao Mei. 2017. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA. 4476--4484.Google Scholar
Cross Ref
- Marian George and Christian Floerkemeier. 2014. Recognizing products: A per-exemplar multi-label image classification approach. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland. Springer, 440--455.Google Scholar
Cross Ref
- Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH. 580--587. Google Scholar
Digital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada. 770--778.Google Scholar
- Gao Huang, Zhuang Liu, Kilian Q. Weinberger, and Laurens van der Maaten. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA. 4700--4708.Google Scholar
- Shaoli Huang, Zhe Xu, Dacheng Tao, and Ya Zhang. 2016. Part-stacked CNN for fine-grained visual categorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada. 1173--1182.Google Scholar
Cross Ref
- Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. 2015. Spatial transformer networks. Neural Information Processing Systems (2015), 2017--2025. Google Scholar
Digital Library
- Yugang Jiang, Jun Yang, Chongwah Ngo, and Alexander G. Hauptmann. 2010. Representations of keypoint-based semantic concept detection: A comprehensive study. IEEE Transactions on Multimedia 12, 1 (2010), 42--53. Google Scholar
Digital Library
- Yu-Gang Jiang, Zuxuan Wu, Jun Wang, Xiangyang Xue, and Shih-Fu Chang. 2018. Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 2 (2018), 352--364. Google Scholar
Digital Library
- Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Fei-Fei Li. 2011. Novel dataset for fine-grained image categorization: Stanford dogs. In Proceedings of the CVPR Workshop on Fine-Grained Visual Categorization (FGVC), Vol. 2. 1.Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Neural Information Processing Systems, Lake Tahoe, Nevada. 1097--1105. Google Scholar
Digital Library
- Hao Lei, Kuizhi Mei, Jingmin Xin, Peixiang Dong, and Jianping Fan. 2016. Hierarchical learning of large-margin metrics for large-scale image classification. Neurocomputing 208 (2016), 46--58. Google Scholar
Digital Library
- Xirong Li, Tiberio Uricchio, Lamberto Ballan, Marco Bertini, Cees G. M. Snoek, and Alberto Del Bimbo. 2016. Socializing the semantic gap: A comparative survey on image tag assignment, refinement, and retrieval. ACM Computing Surveys 49, 1 (2016), 14:1--14:39. Google Scholar
Digital Library
- Zhetao Li, Jie Zhang, Kaihua Zhang, and Zhiyong Li. 2018. Visual tracking with weighted adaptive local sparse appearance model via spatio-temporal context learning. IEEE Transactions on Image Processing 27, 9 (2018), 4478--4489.Google Scholar
Cross Ref
- Di Lin, Xiaoyong Shen, Cewu Lu, and Jiaya Jia. 2015. Deep lac: Deep localization, alignment and classification for fine-grained recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, Massachusetts. IEEE, 1666--1674.Google Scholar
Cross Ref
- Tsungyu Lin, Aruni Roychowdhury, and Subhransu Maji. 2015. Bilinear CNN models for fine-grained visual recognition. In Proceedings of the International Conference on Computer Vision, Santiago, Chile. 1449--1457. Google Scholar
Digital Library
- Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. 2016. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada. 1096--1104.Google Scholar
Cross Ref
- Shiyang Lu, Tao Mei, Jingdong Wang, Jian Zhang, Zhiyong Wang, and Shipeng Li. 2015. Exploratory product image search with circle-to-search interaction. IEEE Transactions on Circuits and Systems for Video Technology 25, 7 (2015), 1190--1202.Google Scholar
Digital Library
- Changzhi Luo, Zhetao Li, Kaizhu Huang, Jiashi Feng, and Meng Wang. 2018. Zero-shot learning via attribute regression and class prototype rectification. IEEE Transactions on Image Processing 27, 2 (2018), 637--648.Google Scholar
Cross Ref
- Tiendung Mai, Thanh Duc Ngo, Duydinh Le, Duc Anh Duong, Kiem Hoang, and Shinichi Satoh. 2017. Efficient large-scale multi-class image classification by learning balanced trees. Computer Vision and Image Understanding 156 (2017), 151--161. Google Scholar
Digital Library
- Yingwei Pan, Ting Yao, Houqiang Li, Chong-Wah Ngo, and Tao Mei. 2015. Semi-supervised hashing with semantic confidence for large scale visual search. In Proceedings of the ACM SIGIR Conference, Santiago, Chile. ACM, 53--62. Google Scholar
Digital Library
- Florent Perronnin and Diane Larlus. 2015. Fisher vectors meet neural networks: A hybrid classification architecture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, Massachusetts. IEEE, 3743--3752.Google Scholar
Cross Ref
- Zhaofan Qiu, Yingwei Pan, Ting Yao, and Tao Mei. 2017. Deep semantic hashing with generative adversarial networks. In Proceedings of the ACM SIGIR Conference, Tokyo, Japan. ACM, 225--234. Google Scholar
Digital Library
- Scott E Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. 2016. Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada. 49--58.Google Scholar
Cross Ref
- Jorge Sánchez and Florent Perronnin. 2011. High-dimensional signature compression for large-scale image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO. IEEE, 1665--1672. Google Scholar
Digital Library
- Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2013. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv: Computer Vision and Pattern Recognition (2013).Google Scholar
- Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (2015).Google Scholar
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, Massachusetts. IEEE, 1--9.Google Scholar
Cross Ref
- Dequan Wang, Zhiqiang Shen, Jie Shao, Wei Zhang, Xiangyang Xue, and Zheng Zhang. 2015. Multiple granularity descriptors for fine-grained categorization. In Proceedings of the International Conference on Computer Vision, Boston, Massachusetts. IEEE, 2399--2406. Google Scholar
Digital Library
- Peter Welinder, Steve Branson, Takeshi Mita, Catherine Wah, Florian Schroff, Serge Belongie, and Pietro Perona. 2010. Caltech-UCSD birds 200. California Institute of Technology (2010).Google Scholar
- Qiong Wu and Pierre Boulanger. 2016. Enhanced reweighted MRFs for efficient fashion image parsing. ACM Transactions on Multimedia Computing, Communications, and Applications 12, 3 (2016), 42. Google Scholar
Digital Library
- Hongtao Xie, Ke Gao, Yongdong Zhang, and Jintao Li. 2011. Local geometric consistency constraint for image retrieval. In Proceedings of the International Conference on Image Processing, Belgium, Brussels. IEEE, 101--104.Google Scholar
Cross Ref
- Hongtao Xie, Yongdong Zhang, Jianlong Tan, Guo Li, and Jintao Li. 2014. Contextual query expansion for image retrieval. IEEE Transactions on Multimedia 16, 4 (2014), 1104--1114. Google Scholar
Digital Library
- Lexing Xie, Rong Yan, Jelena Tešić, Apostol Natsev, and John R. Smith. 2010. Probabilistic visual concept trees. In Proceedings of the 18th ACM international Conference on Multimedia, Firenze, Italy. ACM, 867--870. Google Scholar
Digital Library
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. Computer Science (2015), 2048--2057. Google Scholar
Digital Library
- Zhicheng Yan, Hao Zhang, Robinson Piramuthu, Vignesh Jagadeesh, Dennis Decoste, Wei Di, and Yizhou Yu. 2015. HD-CNN: Hierarchical deep convolutional neural networks for large scale visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, Massachusetts. IEEE, 2740--2748. Google Scholar
Digital Library
- Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada. 21--29.Google Scholar
- Ting Yao, Fuchen Long, Tao Mei, and Yong Rui. 2016. Deep semantic-preserving and ranking-based hashing for image retrieval. In Proceedings of the International Joint Conferences on Artificial Intelligence, New York, NY. 3931--3937. Google Scholar
Digital Library
- Chunjie Zhang, Jian Cheng, and Qi Tian. 2018. Image-level classification by hierarchical structure learning with visual and semantic similarities. Information Sciences 422 (2018), 271--281.Google Scholar
Cross Ref
- Xiaopeng Zhang, Hongkai Xiong, Wengang Zhou, Weiyao Lin, and Qi Tian. 2016. Picking deep filter responses for fine-grained image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada. 1134--1142.Google Scholar
Cross Ref
- Shiai Zhu, Xiaoyong Wei, and Chong-Wah Ngo. 2014. Collaborative error reduction for hierarchical classification. Computer Vision and Image Understanding 124 (2014), 79--90.Google Scholar
Cross Ref
Index Terms
Structure-Aware Deep Learning for Product Image Classification
Recommendations
Deep CNN for Classification of Image Contents
IPMV '21: Proceedings of the 2021 3rd International Conference on Image Processing and Machine VisionIn recent years the classification of images has made great progress and has been used in many fields. However, it may not be possible to classify images perfectly through the CNN because of overfitting and gradient vanishing. Most existing CNNs have ...
Wavelet-Attention CNN for image classification
AbstractThe feature learning methods based on convolutional neural network (CNN) have successfully produced tremendous achievements in image classification tasks. However, the inherent noise and some other factors may weaken the effectiveness of the ...
A dyadic multi-resolution deep convolutional neural wavelet network for image classification
For almost the past four decades, image classification has gained a lot of attention in the field of pattern recognition due to its application in various fields. Given its importance, several approaches have been proposed up to now. In this paper, we ...






Comments