skip to main content
research-article

AABLSTM: A Novel Multi-task Based CNN-RNN Deep Model for Fashion Analysis

Authors Info & Claims
Published:05 January 2023Publication History
Skip Abstract Section

Abstract

With the rapid growth of online commerce and fashion-related applications, visual clothing analysis and recognition has become a hotspot in computer vision. In this paper, we propose a novel AABLSTM network, which is based on deep CNN-RNN, to solve the visual fashion analysis of clothing category classification, attribute detection, and landmark localization. The designed fashion model is leveraged with the multi-task driven mechanism as follows: firstly, a bidirectional LSTM (Bi-LSTM) branch is proposed for efficiently mining the semantic association between related attributes so as to improve the precision of clothing category classification and attribute detection; then, an imitated hourglass sub-network of “down-up sampling” is constructed for boosting the accuracy of fashion landmark localization; and finally, a specially designed multi-loss function is constructed to better optimize the network training. Extensive experimental results on large-scale fashion datasets demonstrate the superior performance of our approach.

REFERENCES

  1. [1] Abdulnabi Abrar H., Wang Gang, Lu Jiwen, and Jia Kui. 2015. Multi-task CNN model for attribute prediction. IEEE Transactions on Multimedia 17, 11 (2015), 19491959. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Al-Halah Ziad, Stiefelhagen Rainer, and Grauman Kristen. 2017. Fashion forward: Forecasting visual style in fashion. In IEEE International Conference on Computer Vision (ICCV). 388397. Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Carreira João, Agrawal Pulkit, Fragkiadaki Katerina, and Malik Jitendra. 2016. Human pose estimation with iterative error feedback. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 47334742. Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Chen Huizhong, Gallagher Andrew, and Girod Bernd. 2012. Describing clothing by semantic attributes. In European Conference on Computer Vision (ECCV), Fitzgibbon Andrew, Lazebnik Svetlana, Perona Pietro, Sato Yoichi, and Schmid Cordelia (Eds.). Springer Berlin, Berlin, 609623. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Chen Qiang, Huang Junshi, Feris Rogerio, al. Lisa Brown, et2015. Deep domain adaptation for describing people based on fine-grained clothing attributes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 53155324. Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Chen Tao, Lu Shijian, and Fan Jiayuan. 2018. S-CNN: Subcategory-aware convolutional networks for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 40, 10 (2018), 25222528. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Chen Xianjie and Yuille Alan. 2014. Articulated pose estimation by a graphical model with image dependent pairwise relations. In International Conference on Neural Information Processing Systems (NIPS), Vol. 1. 17361744.Google ScholarGoogle Scholar
  8. [8] Corbière Charles, Ben-Younes Hedi, Ramé Alexandre, and Ollion Charles. 2017. Leveraging weakly annotated data for fashion image retrieval and label prediction. In IEEE International Conference on Computer Vision (ICCV). 22682274. Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Dalal N. and Triggs B.. 2005. Histograms of oriented gradients for human detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1. 886893. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Fang Haoshu, Xu Yuanlu, Wang Wenguan, Liu Xiaobai, and Zhu Song-Chun. 2018. Learning pose grammar to encode human body configuration for 3D pose estimation. In Proceedings of the Conference on Artificial Intelligence (AAAI), McIlraith Sheila A. and Weinberger Kilian Q. (Eds.). AAAI Press, 68216828. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16471.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Girshick Ross. 2015. Fast R-CNN. In IEEE International Conference on Computer Vision (ICCV). 14401448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Girshick Ross, Donahue Jeff, Darrell Trevor, and Malik Jitendra. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 580587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Gong Yunchao, Jia Yangqing, Leung Thomas, Toshev Alexander, and Ioffe Sergey. 2014. Deep Convolutional Ranking for Multilabel Image Annotation. (2014). arXiv:cs.CV/1312.4894.Google ScholarGoogle Scholar
  14. [14] Han Xintong, Wu Zuxuan, Huang Phoenix X., al Xiao Zhang, et. 2017. Automatic spatially-aware fashion concept discovery. In IEEE International Conference on Computer Vision (ICCV). 14721480. Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] He K., Zhang X., Ren S., and Sun J.. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770778. Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Hochreiter S. and Schmidhuber J.. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 17351780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Hu J., Shen L., Albanie S., Sun G., and Wu E.. 2020. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 42, 8 (2020), 20112023. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Huang Chang-Qin, Chen Ji-Kai, Pan Yan, Lai Han-Jiang, Yin Jian, and Huang Qiong-Hao. 2019. Clothing landmark detection using deep networks with prior of key point associations. IEEE Transactions on Cybernetics 49, 10 (2019), 37443754. Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Huang Gao, Liu Zhuang, Maaten Laurens van der, and Weinberger Kilian Q.. 2017. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 22612269. Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Huang Junshi, Feris Rogério Schmidt, Chen Qiang, and Yan Shuicheng. 2015. Cross-domain image retrieval with a dual attribute-aware ranking network. In IEEE International Conference on Computer Vision (ICCV). IEEE Computer Society, 10621070. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Jaderberg Max, Simonyan Karen, Zisserman Andrew, and Kavukcuoglu Koray. 2015. Spatial transformer networks. In International Conference on Neural Information Processing Systems (NIPS), Cortes C., Lawrence N., Lee D., Sugiyama M., and Garnett R. (Eds.). Vol. 28. Curran Associates, Inc., 20172025. https://proceedings.neurips.cc/paper/2015/file/33ceb07bf4eeb3da587e268d663aba1a-Paper.pdf.Google ScholarGoogle Scholar
  22. [22] Jia Menglin, Shi Mengyun, Sirotenko Mikhail, Cui Yin, Cardie Claire, Hariharan Bharath, Adam Hartwig, and Belongie Serge. 2020. Fashionpedia: Ontology, segmentation, and an attribute localization dataset. In European Conference on Computer Vision (ECCV), Vedaldi Andrea, Bischof Horst, Brox Thomas, and Frahm Jan-Michael (Eds.). Springer International Publishing, Cham, 316332. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Jin Jiren and Nakayama Hideki. 2016. Annotation order matters: Recurrent image annotator for arbitrary length image tagging. In International Conference on Pattern Recognition (ICPR). 24522457. Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Kalantidis Yannis, Kennedy Lyndon, and Li L. J.. 2013. Getting the look: Clothing recognition and segmentation for automatic product suggestions in everyday photos. In International Conference on Multimedia Retrieval (ICMR). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Kiapour M. Hadi, Yamaguchi Kota, Berg Alexander C., and Berg Tamara L.. 2014. Hipster wars: Discovering elements of fashion styles. In European Conference on Computer Vision (ECCV), Fleet David, Pajdla Tomas, Schiele Bernt, and Tuytelaars Tinne (Eds.). Springer International Publishing, Cham, 472488. Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2012. ImageNet classification with deep convolutional neural networks. In International Conference on Neural Information Processing Systems (NIPS), Pereira F., Burges C. J. C., Bottou L., and Weinberger K. Q. (Eds.), Vol. 25. Curran Associates, Inc., 10971105. https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.Google ScholarGoogle Scholar
  27. [27] Li Peizhao, Li Yanjing, Jiang Xiaolong, and Zhen Xiantong. 2019. Two-stream multi-task network for fashion recognition. In IEEE International Conference on Image Processing (ICIP). 30383042. Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Li Zechao, Tang Jinhui, and Mei Tao. 2019. Deep collaborative embedding for social image understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 41, 9 (2019), 20702083. Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Li Z., Tang J., Zhang L., and Yang J.. 2020. Weakly-supervised semantic guided hashing for social image retrieval. International Journal of Computer Vision (IJCV) 128, 2 (2020), 22652278.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Lin Min, Chen Qiang, and Yan Shuicheng. 2014. Network in network. In International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  31. [31] Liu Jingyuan and Lu Hong. 2019. Deep fashion analysis with feature map upsampling and landmark-driven attention. In European Conference on Computer Vision (ECCV), Leal-Taixé Laura and Roth Stefan (Eds.). 3036.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Liu S., Song Z., Liu G., Xu C., Lu H., and Yan S.. 2012. Street-to-shop: Cross-scenario clothing retrieval via parts alignment and auxiliary set. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 33303337. Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Liu Ziwei, Luo Ping, Qiu Shi, Wang Xiaogang, and Tang Xiaoou. 2016. DeepFashion: Powering robust clothes recognition and retrieval with rich annotations. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 10961104. Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Liu Ziwei, Yan Sijie, Luo Ping, Wang Xiaogang, and Tang Xiaoou. 2016. Fashion landmark detection in the wild. In European Conference on Computer Vision (ECCV), Leibe Bastian, Matas Jiri, Sebe Nicu, and Welling Max (Eds.). Springer International Publishing, Cham, 229245. Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Lowe D. G.. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision (IJCV) 60, 2 (2004), 91110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Lu Yongxi, Kumar Abhishek, Zhai Shuangfei, Cheng Yu, Javidi Tara, and Feris Rogerio. 2017. Fully-adaptive feature sharing in multi-task networks with applications in person attribute classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 11311140. Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Qi Siyuan, Zhu Yixin, Huang Siyuan, Jiang Chenfanfu, and Zhu Song-Chun. 2018. Human-centric indoor scene synthesis using stochastic grammar. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 58995908. Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Ramakrishna Varun, Munoz Daniel, Hebert Martial, Bagnell James Andrew, and Sheikh Yaser. 2014. Pose machines: Articulated pose estimation via inference machines. In European Conference on Computer Vision (ECCV), Fleet David, Pajdla Tomas, Schiele Bernt, and Tuytelaars Tinne (Eds.). Springer International Publishing, Cham, 3347. Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Schuster M. and Paliwal K. K.. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 26732681. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] al Dileep Aroor Dinesh, Shikha Gupta, Krishan Sharma, et. 2021. Visual semantic-based representation learning using deep CNNs for scene recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 53 (2021), 124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Simonyan Karen and Zisserman Andrew. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. (2015). arXiv:cs.CV/1409.1556.Google ScholarGoogle Scholar
  42. [42] al. Gongwei Chen, Shuqiang Jiang, Xinhang Song, et2019. Deep patch representations with shared codebook for scene classification. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 5 (2019), 117. Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Szegedy Christian, Liu Wei, Jia Yangqing, Sermanet Pierre, Reed Scott, Anguelov Dragomir, Erhan Dumitru, Vanhoucke Vincent, and Rabinovich Andrew. 2015. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 19. Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Perronnin Thomas Mensink, Jakob Verbeek, Jorge Sánchez, and Florent. 2013. Image classification with the Fisher Vector: Theory and practice. International Journal of Computer Vision (IJCV) 105, 3 (2013), 222245. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Takagi Moeko, Simo-Serra Edgar, Iizuka Satoshi, and Ishikawa Hiroshi. 2017. What makes a style: Experimental analysis of fashion prediction. In IEEE International Conference on Computer Vision Workshops (ICCVW). 22472253. Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Tompson Jonathan, Jain Arjun, al Yann Lecun, et. 2014. Joint training of a convolutional network and a graphical model for human pose estimation. In International Conference on Neural Information Processing Systems (NIPS), Vol. 1. 17991807.Google ScholarGoogle Scholar
  47. [47] Toshev A. and Szegedy C.. 2014. DeepPose: Human pose estimation via deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 16531660. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Veit Andreas, Kovacs Balazs, Bell Sean, McAuley Julian, Bala Kavita, and Belongie Serge. 2015. Learning visual clothing style with heterogeneous dyadic co-occurrences. In IEEE International Conference on Computer Vision (ICCV). 46424650. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Wang J., Yang Y., Mao J., Huang Z., Huang C., and Xu W.. 2016. CNN-RNN: A unified framework for multi-label image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 22852294. Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Wang Wenguan and Shen Jianbing. 2018. Deep visual attention prediction. IEEE Transactions on Image Processing (TIP) 27, 5 (2018), 23682378. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Wang W., Xu Y., Shen J., and Zhu S.. 2018. Attentive fashion grammar network for fashion landmark detection and clothing category classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 42714280. Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Wang X. and Zhang. T.2011. Clothes search in consumer photos via color matching and attribute learning. In Proceedings of the 19th ACM International Conference on Multimedia. 13531356. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Wang Zhouxia, Chen Tianshui, Li Guanbin, Xu Ruijia, and Lin Liang. 2017. Multi-label image recognition by recurrently discovering attentional regions. In IEEE International Conference on Computer Vision (ICCV). 464472. Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Xiao Tong, Xia Tian, Yang Yi, Huang Chang, and Wang Xiaogang. 2015. Learning from massive noisy labeled data for image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 26912699. Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Yan Sijie, Liu Ziwei, Luo Ping, Qiu Shi, Wang Xiaogang, and Tang Xiaoou. 2017. Unconstrained fashion landmark detection via hierarchical recurrent transformer networks. In Proceedings of the 25th ACM International Conference on Multimedia (ACM MM). 172180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. [56] Yang Wei, Li Shuang, Ouyang Wanli, Li Hongsheng, and Wang Xiaogang. 2017. Learning feature pyramids for human pose estimation. In IEEE International Conference on Computer Vision (ICCV). 12901299. Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Zhang Junjie, Wu Qi, Shen Chunhua, Zhang Jian, and Lu Jianfeng. 2018. Multilabel image classification with regional latent semantic dependencies. IEEE Transactions on Multimedia 20, 10 (2018), 28012813. Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Zhang S., Liu S., Cao X., Song Z., and Zhou J.. 2018. Watch fashion shows to tell clothing attributes. Neurocomputing 282, 22 (2018), 98110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Zhang Sanyi, Qi Guo-Jun, Cao Xiaochun, Song Zhanjie, and Zhou Jie. 2021. Human parsing with pyramidical gather-excite context. IEEE Transactions on Circuits and Systems for Video Technology 31, 3 (2021), 10161030. Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Zhao Rui-Wei, Wu Zuxuan, Li Jianguo, and Jiang Yu-Gang. 2019. Visual content recognition by exploiting semantic feature map with attention and multi-task learning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 6 (2019), 122. Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Ziegler Thomas, Butepage Judith, Welle Michael C., Varava Anastasiia, Novkovic Tonci, and Kragic Danica. 2020. Fashion landmark detection and category classification for robotics. In IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC). 8188. Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. AABLSTM: A Novel Multi-task Based CNN-RNN Deep Model for Fashion Analysis

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 1
      January 2023
      505 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3572858
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 5 January 2023
      • Online AM: 12 March 2022
      • Accepted: 14 February 2022
      • Revised: 28 December 2021
      • Received: 15 June 2021
      Published in tomm Volume 19, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed
    • Article Metrics

      • Downloads (Last 12 months)158
      • Downloads (Last 6 weeks)22

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!