skip to main content
research-article

An End-to-End Attention-Based Neural Model for Complementary Clothing Matching

Published:16 December 2019Publication History
Skip Abstract Section

Abstract

In modern society, people tend to prefer fashionable and decent outfits that can meet more than basic physiological needs. In fact, a proper outfit usually relies on good matching among complementary fashion items (e.g., the top, bottom, and shoes) that compose it, which thus propels us to investigate the automatic complementary clothing matching scheme. However, this is non-trivial due to the following challenges. First, the main challenge lies in how to accurately model the compatibility between complementary fashion items (e.g., the top and bottom) that come from the heterogeneous spaces with multi-modalities (e.g., the visual modality and textual modality). Second, since different features (e.g., the color, style, and pattern) of fashion items may contribute differently to compatibility modeling, how to encode the confidence of different pairwise features presents a tough challenge. Third, how to jointly learn the latent representation of multi-modal data and the compatibility between complementary fashion items contributes to the last challenge. Toward this end, in this work, we present an end-to-end attention-based neural framework for the compatibility modeling, where we introduce a feature-level attention model to adaptively learn the confidence for different pairwise features. Extensive experiments on a public available real-world dataset show the superiority of our model over state-of-the-art methods.

References

  1. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.Google ScholarGoogle Scholar
  2. Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-Seng Chua. 2017. Attentive collaborative filtering: Multimedia recommendation with item-and component-level attention. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval. 335--344.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. 2017. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 6298--6306.Google ScholarGoogle ScholarCross RefCross Ref
  4. Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2018. Paying more attention to saliency: Image captioning with saliency and context attention. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 2 (2018), 48.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Xin Dong, Lei Yu, Zhonghuo Wu, Yuxia Sun, Lingfeng Yuan, and Fangxi Zhang. 2017. A hybrid collaborative filtering model with deep structure for recommender systems. In Proceedings of the AAAI International Conference on Artificial Intelligence. 1309--1315.Google ScholarGoogle ScholarCross RefCross Ref
  6. Xintong Han, Zuxuan Wu, Yu-Gang Jiang, and Larry S. Davis. 2017. Learning fashion compatibility with bidirectional LSTMS. In Proceedings of the ACM International Conference on Multimedia. 1078--1086.Google ScholarGoogle Scholar
  7. Jing He, Xin Li, Lejian Liao, Dandan Song, and William K. Cheung. 2016. Inferring a personalized next point-of-interest recommendation model with latent behavior patterns. In Proceedings of the AAAI International Conference on Artificial Intelligence. 137--143.Google ScholarGoogle Scholar
  8. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  9. Ruining He and Julian McAuley. 2016. VBPR: Visual Bayesian personalized ranking from implicit feedback. In Proceedings of the AAAI International Conference on Artificial Intelligence. 144--150.Google ScholarGoogle ScholarCross RefCross Ref
  10. Xiangteng He and Yuxin Peng. 2017. Weakly supervised learning of part selection model with spatial constraints for fine-grained image classification. In Proceedings of the AAAI International Conference on Artificial Intelligence. 4075--4081.Google ScholarGoogle ScholarCross RefCross Ref
  11. Yang Hu, Xi Yi, and Larry S. Davis. 2015. Collaborative fashion recommendation: A functional tensor factorization approach. In Proceedings of the ACM International Conference on Multimedia. 129--138.Google ScholarGoogle Scholar
  12. Tomoharu Iwata, Shinji Wanatabe, and Hiroshi Sawada. 2011. Fashion coordinates recommender system using photographs from fashion magazines. In Proceedings of the International Joint Conference on Artificial Intelligence, Vol. 1. 2.Google ScholarGoogle Scholar
  13. Vignesh Jagadeesh, Robinson Piramuthu, Anurag Bhardwaj, Wei Di, and Neel Sundaresan. 2014. Large scale visual recommendations from street fashion images. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining. 1925--1934.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Shuhui Jiang, Yue Wu, and Yun Fu. 2018. Deep bidirectional cross-triplet embedding for online clothing shopping. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 1 (2018), Article 5.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Yu-Gang Jiang, Minjun Li, Xi Wang, Wei Liu, and Xian-Sheng Hua. 2018. DeepProduct: Mobile product search with portable deep features. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 2 (2018), 50.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv:1408.5882.Google ScholarGoogle Scholar
  17. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the International Conference on Neural Information Processing Systems. 1097--1105.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jing Li, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Tao Lian, and Jun Ma. 2017. Neural attentive session-based recommendation. In Proceedings of the ACM International Conference on Information and Knowledge Management. 1419--1428.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Xuelong Li, Lichao Mou, and Xiaoqiang Lu. 2016. Semantic video parsing by combining frame relevance and label propagation from images. Multimedia Tools and Applications 75, 19 (2016), 11961--11976.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Xuelong Li, Bin Zhao, and Xiaoqiang Lu. 2017. A general framework for edited video and raw video summarization. IEEE Transactions on Image Processing 26, 8 (2017), 3652--3664.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. X. Li, B. Zhao, and X. Lu. 2017. Key frame extraction in the summary space. IEEE Transactions on Cybernetics 48, 6 (2017), 1923--1934.Google ScholarGoogle ScholarCross RefCross Ref
  22. Yuncheng Li, Liangliang Cao, Jiang Zhu, and Jiebo Luo. 2017. Mining fashion outfit composition using an end-to-end deep learning approach on set data. IEEE Transactions on Multimedia 19, 8 (2017), 1946--1955.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Chunze Lin, Jiwen Lu, Gang Wang, and Jie Zhou. 2018. Graininess-aware deep feature learning for pedestrian detection. In Proceedings of the European Conference on Computer Vision. 732--747.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Shaohui Lin, Rongrong Ji, Chao Chen, and Feiyue Huang. 2017. ESPACE: Accelerating convolutional neural networks via eliminating spatial and channel redundancy. In Proceedings of the AAAI International Conference on Artificial Intelligence. 1424--1430.Google ScholarGoogle ScholarCross RefCross Ref
  25. Yujie Lin, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Jun Ma, and Maarten de Rijke. 2018. Explainable fashion recommendation with joint outfit matching and comment generation. arXiv:1806.08977.Google ScholarGoogle Scholar
  26. Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. arXiv:1703.03130.Google ScholarGoogle Scholar
  27. Chenxi Liu, Junhua Mao, Fei Sha, and Alan L. Yuille. 2017. Attention correctness in neural image captioning. In Proceedings of the AAAI International Conference on Artificial Intelligence. 4176--4182.Google ScholarGoogle Scholar
  28. Si Liu, Jiashi Feng, Zheng Song, Tianzhu Zhang, Hanqing Lu, Changsheng Xu, and Shuicheng Yan. 2012. Hi, magic closet, tell me what to wear! In Proceedings of the ACM International Conference on Multimedia. 619--628.Google ScholarGoogle Scholar
  29. Yong Liu, Peilin Zhao, Aixin Sun, and Chunyan Miao. 2015. A boosting algorithm for item recommendation with implicit feedback. In Proceedings of the International Joint Conference on Artificial Intelligence, Vol. 15. 1792--1798.Google ScholarGoogle Scholar
  30. Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. 2016. DeepFashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. 1096--1104.Google ScholarGoogle ScholarCross RefCross Ref
  31. Yihui Ma, Jia Jia, Suping Zhou, Jingtian Fu, Yejun Liu, and Zijian Tong. 2017. Towards better understanding the clothing fashion styles: A multimodal deep learning approach. In Proceedings of the AAAI International Conference on Artificial Intelligence. 38--44.Google ScholarGoogle ScholarCross RefCross Ref
  32. Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. 2015. Image-based recommendations on styles and substitutes. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval. 43--52.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Xiongkuo Min, Guangtao Zhai, Ke Gu, and Xiaokang Yang. 2017. Fixation prediction through multimodal analysis. ACM Transactions on Multimedia Computing, Communications, and Applications 13, 1 (2017), 6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. Recurrent models of visual attention. In Proceedings of the International Conference on Neural Information Processing Systems. 2204--2212.Google ScholarGoogle Scholar
  35. Takuma Nakamura and Ryosuke Goto. 2018. Outfit generation and style extraction via bidirectional LSTM and autoencoder. arXiv:1807.03133.Google ScholarGoogle Scholar
  36. Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. 2016. Text matching as image recognition. In Proceedings of the AAAI International Conference on Artificial Intelligence. 2793--2799.Google ScholarGoogle ScholarCross RefCross Ref
  37. Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the International Conference on Uncertainty in Artificial Intelligence. 452--461.Google ScholarGoogle Scholar
  38. Sungyong Seo, Jing Huang, Hao Yang, and Yan Liu. 2017. Interpretable convolutional neural networks with dual local and global attention for review rating prediction. In Proceedings of the ACM International Conference on Recommender Systems. 297--305.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Yong-Siang Shih, Kai-Yueh Chang, Hsuan-Tien Lin, and Min Sun. 2017. Compatibility family learning for item recommendation and generation. arXiv:1712.01262.Google ScholarGoogle Scholar
  40. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.Google ScholarGoogle Scholar
  41. Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu. 2017. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In Proceedings of the AAAI International Conference on Artificial Intelligence, Vol. 1. 4263--4270.Google ScholarGoogle ScholarCross RefCross Ref
  42. Xuemeng Song, Fuli Feng, Xianjing Han, Xin Yang, Wei Liu, and Liqiang Nie. 2018. Neural compatibility modeling with attentive knowledge distillation. arXiv:1805.00313.Google ScholarGoogle Scholar
  43. Xuemeng Song, Fuli Feng, Jinhuan Liu, Zekun Li, Liqiang Nie, and Jun Ma. 2017. NeuroStylist: Neural compatibility modeling for clothing matching. In Proceedings of the ACM International Conference on Multimedia. 753--761.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi. 2017. Inception-v4, inception-ResNet and the impact of residual connections on learning. In Proceedings of the AAAI International Conference on Artificial Intelligence, Vol. 4. 12.Google ScholarGoogle Scholar
  45. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  46. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the Conference on Computer Vision and Pattern Recognition. 2818--2826.Google ScholarGoogle ScholarCross RefCross Ref
  47. Anran Wang, Jianfei Cai, Jiwen Lu, and Tat-Jen Cham. 2018. Structure-aware multimodal feature fusion for RGB-D scene classification and beyond. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 2s (2018), Article 39.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. 5005--5013.Google ScholarGoogle ScholarCross RefCross Ref
  49. Chenyan Xiong, Jimie Callan, and Tie-Yen Liu. 2017. Learning to attend and to rank with word-entity duets. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval, Vol. 763. 772.Google ScholarGoogle Scholar
  50. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048--2057.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An End-to-End Attention-Based Neural Model for Complementary Clothing Matching

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 4
        November 2019
        322 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3376119
        Issue’s Table of Contents

        Copyright © 2019 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 December 2019
        • Accepted: 1 September 2019
        • Revised: 1 March 2019
        • Received: 1 November 2018
        Published in tomm Volume 15, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!