Abstract
With the rapid growth of online commerce and fashion-related applications, visual clothing analysis and recognition has become a hotspot in computer vision. In this paper, we propose a novel AABLSTM network, which is based on deep CNN-RNN, to solve the visual fashion analysis of clothing category classification, attribute detection, and landmark localization. The designed fashion model is leveraged with the multi-task driven mechanism as follows: firstly, a bidirectional LSTM (Bi-LSTM) branch is proposed for efficiently mining the semantic association between related attributes so as to improve the precision of clothing category classification and attribute detection; then, an imitated hourglass sub-network of “down-up sampling” is constructed for boosting the accuracy of fashion landmark localization; and finally, a specially designed multi-loss function is constructed to better optimize the network training. Extensive experimental results on large-scale fashion datasets demonstrate the superior performance of our approach.
- [1] . 2015. Multi-task CNN model for attribute prediction. IEEE Transactions on Multimedia 17, 11 (2015), 1949–1959. Google Scholar
Digital Library
- [2] . 2017. Fashion forward: Forecasting visual style in fashion. In IEEE International Conference on Computer Vision (ICCV). 388–397. Google Scholar
Cross Ref
- [3] . 2016. Human pose estimation with iterative error feedback. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4733–4742. Google Scholar
Cross Ref
- [4] . 2012. Describing clothing by semantic attributes. In European Conference on Computer Vision (ECCV), , , , , and (Eds.). Springer Berlin, Berlin, 609–623. Google Scholar
Digital Library
- [5] 2015. Deep domain adaptation for describing people based on fine-grained clothing attributes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5315–5324. Google Scholar
Cross Ref
- [6] . 2018. S-CNN: Subcategory-aware convolutional networks for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 40, 10 (2018), 2522–2528. Google Scholar
Digital Library
- [7] . 2014. Articulated pose estimation by a graphical model with image dependent pairwise relations. In International Conference on Neural Information Processing Systems (NIPS), Vol. 1. 1736–1744.Google Scholar
- [8] . 2017. Leveraging weakly annotated data for fashion image retrieval and label prediction. In IEEE International Conference on Computer Vision (ICCV). 2268–2274. Google Scholar
Cross Ref
- [9] . 2005. Histograms of oriented gradients for human detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1. 886–893. Google Scholar
Digital Library
- [10] . 2018. Learning pose grammar to encode human body configuration for 3D pose estimation. In Proceedings of the Conference on Artificial Intelligence (AAAI), and (Eds.). AAAI Press, 6821–6828. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16471.Google Scholar
Cross Ref
- [11] . 2015. Fast R-CNN. In IEEE International Conference on Computer Vision (ICCV). 1440–1448. Google Scholar
Digital Library
- [12] . 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 580–587. Google Scholar
Digital Library
- [13] . 2014. Deep Convolutional Ranking for Multilabel Image Annotation. (2014). arXiv:cs.CV/1312.4894.Google Scholar
- [14] . 2017. Automatic spatially-aware fashion concept discovery. In IEEE International Conference on Computer Vision (ICCV). 1472–1480. Google Scholar
Cross Ref
- [15] . 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778. Google Scholar
Cross Ref
- [16] . 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780. Google Scholar
Digital Library
- [17] . 2020. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 42, 8 (2020), 2011–2023. Google Scholar
Digital Library
- [18] . 2019. Clothing landmark detection using deep networks with prior of key point associations. IEEE Transactions on Cybernetics 49, 10 (2019), 3744–3754. Google Scholar
Cross Ref
- [19] . 2017. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2261–2269. Google Scholar
Cross Ref
- [20] . 2015. Cross-domain image retrieval with a dual attribute-aware ranking network. In IEEE International Conference on Computer Vision (ICCV). IEEE Computer Society, 1062–1070. Google Scholar
Digital Library
- [21] . 2015. Spatial transformer networks. In International Conference on Neural Information Processing Systems (NIPS), , , , , and (Eds.). Vol. 28. Curran Associates, Inc., 2017–2025. https://proceedings.neurips.cc/paper/2015/file/33ceb07bf4eeb3da587e268d663aba1a-Paper.pdf.Google Scholar
- [22] . 2020. Fashionpedia: Ontology, segmentation, and an attribute localization dataset. In European Conference on Computer Vision (ECCV), , , , and (Eds.). Springer International Publishing, Cham, 316–332. Google Scholar
Digital Library
- [23] . 2016. Annotation order matters: Recurrent image annotator for arbitrary length image tagging. In International Conference on Pattern Recognition (ICPR). 2452–2457. Google Scholar
Cross Ref
- [24] . 2013. Getting the look: Clothing recognition and segmentation for automatic product suggestions in everyday photos. In International Conference on Multimedia Retrieval (ICMR). Google Scholar
Digital Library
- [25] . 2014. Hipster wars: Discovering elements of fashion styles. In European Conference on Computer Vision (ECCV), , , , and (Eds.). Springer International Publishing, Cham, 472–488. Google Scholar
Cross Ref
- [26] . 2012. ImageNet classification with deep convolutional neural networks. In International Conference on Neural Information Processing Systems (NIPS), , , , and (Eds.), Vol. 25. Curran Associates, Inc., 1097–1105. https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.Google Scholar
- [27] . 2019. Two-stream multi-task network for fashion recognition. In IEEE International Conference on Image Processing (ICIP). 3038–3042. Google Scholar
Cross Ref
- [28] . 2019. Deep collaborative embedding for social image understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 41, 9 (2019), 2070–2083. Google Scholar
Cross Ref
- [29] . 2020. Weakly-supervised semantic guided hashing for social image retrieval. International Journal of Computer Vision (IJCV) 128, 2 (2020), 2265–2278.Google Scholar
Digital Library
- [30] . 2014. Network in network. In International Conference on Learning Representations (ICLR).Google Scholar
- [31] . 2019. Deep fashion analysis with feature map upsampling and landmark-driven attention. In European Conference on Computer Vision (ECCV), and (Eds.). 30–36.Google Scholar
Cross Ref
- [32] . 2012. Street-to-shop: Cross-scenario clothing retrieval via parts alignment and auxiliary set. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3330–3337. Google Scholar
Cross Ref
- [33] . 2016. DeepFashion: Powering robust clothes recognition and retrieval with rich annotations. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1096–1104. Google Scholar
Cross Ref
- [34] . 2016. Fashion landmark detection in the wild. In European Conference on Computer Vision (ECCV), , , , and (Eds.). Springer International Publishing, Cham, 229–245. Google Scholar
Cross Ref
- [35] . 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision (IJCV) 60, 2 (2004), 91–110. Google Scholar
Digital Library
- [36] . 2017. Fully-adaptive feature sharing in multi-task networks with applications in person attribute classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1131–1140. Google Scholar
Cross Ref
- [37] . 2018. Human-centric indoor scene synthesis using stochastic grammar. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5899–5908. Google Scholar
Cross Ref
- [38] . 2014. Pose machines: Articulated pose estimation via inference machines. In European Conference on Computer Vision (ECCV), , , , and (Eds.). Springer International Publishing, Cham, 33–47. Google Scholar
Cross Ref
- [39] . 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681. Google Scholar
Digital Library
- [40] . 2021. Visual semantic-based representation learning using deep CNNs for scene recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 53 (2021), 1–24. Google Scholar
Digital Library
- [41] . 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. (2015). arXiv:cs.CV/1409.1556.Google Scholar
- [42] 2019. Deep patch representations with shared codebook for scene classification. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 5 (2019), 1–17. Google Scholar
Cross Ref
- [43] . 2015. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1–9. Google Scholar
Cross Ref
- [44] . 2013. Image classification with the Fisher Vector: Theory and practice. International Journal of Computer Vision (IJCV) 105, 3 (2013), 222–245. Google Scholar
Digital Library
- [45] . 2017. What makes a style: Experimental analysis of fashion prediction. In IEEE International Conference on Computer Vision Workshops (ICCVW). 2247–2253. Google Scholar
Cross Ref
- [46] . 2014. Joint training of a convolutional network and a graphical model for human pose estimation. In International Conference on Neural Information Processing Systems (NIPS), Vol. 1. 1799–1807.Google Scholar
- [47] . 2014. DeepPose: Human pose estimation via deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1653–1660. Google Scholar
Digital Library
- [48] . 2015. Learning visual clothing style with heterogeneous dyadic co-occurrences. In IEEE International Conference on Computer Vision (ICCV). 4642–4650. Google Scholar
Digital Library
- [49] . 2016. CNN-RNN: A unified framework for multi-label image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2285–2294. Google Scholar
Cross Ref
- [50] . 2018. Deep visual attention prediction. IEEE Transactions on Image Processing (TIP) 27, 5 (2018), 2368–2378. Google Scholar
Digital Library
- [51] . 2018. Attentive fashion grammar network for fashion landmark detection and clothing category classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4271–4280. Google Scholar
Cross Ref
- [52] 2011. Clothes search in consumer photos via color matching and attribute learning. In Proceedings of the 19th ACM International Conference on Multimedia. 1353–1356. Google Scholar
Digital Library
- [53] . 2017. Multi-label image recognition by recurrently discovering attentional regions. In IEEE International Conference on Computer Vision (ICCV). 464–472. Google Scholar
Cross Ref
- [54] . 2015. Learning from massive noisy labeled data for image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2691–2699. Google Scholar
Cross Ref
- [55] . 2017. Unconstrained fashion landmark detection via hierarchical recurrent transformer networks. In Proceedings of the 25th ACM International Conference on Multimedia (ACM MM). 172–180. Google Scholar
Digital Library
- [56] . 2017. Learning feature pyramids for human pose estimation. In IEEE International Conference on Computer Vision (ICCV). 1290–1299. Google Scholar
Cross Ref
- [57] . 2018. Multilabel image classification with regional latent semantic dependencies. IEEE Transactions on Multimedia 20, 10 (2018), 2801–2813. Google Scholar
Cross Ref
- [58] . 2018. Watch fashion shows to tell clothing attributes. Neurocomputing 282, 22 (2018), 98–110. Google Scholar
Digital Library
- [59] . 2021. Human parsing with pyramidical gather-excite context. IEEE Transactions on Circuits and Systems for Video Technology 31, 3 (2021), 1016–1030. Google Scholar
Cross Ref
- [60] . 2019. Visual content recognition by exploiting semantic feature map with attention and multi-task learning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 6 (2019), 1–22. Google Scholar
Cross Ref
- [61] . 2020. Fashion landmark detection and category classification for robotics. In IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC). 81–88. Google Scholar
Cross Ref
Index Terms
AABLSTM: A Novel Multi-task Based CNN-RNN Deep Model for Fashion Analysis
Recommendations
FANCY: Human-centered, Deep Learning-based Framework for Fashion Style Analysis
WWW '21: Proceedings of the Web Conference 2021Fashion style analysis is of the utmost importance for fashion professionals. However, it has an issue of having different style classification criteria that rely heavily on professionals’ subjective experiences with no quantitative criteria. We ...
Multi-Task CNN Model for Attribute Prediction
This paper proposes a joint multi-task learning algorithm to better predict attributes in images using deep convolutional neural networks (CNN). We consider learning binary semantic attributes through a multi-task CNN model, where each CNN will predict ...
Deep CNN for Classification of Image Contents
IPMV '21: Proceedings of the 2021 3rd International Conference on Image Processing and Machine VisionIn recent years the classification of images has made great progress and has been used in many fields. However, it may not be possible to classify images perfectly through the CNN because of overfitting and gradient vanishing. Most existing CNNs have ...






Comments