Abstract
Features extracted by deep networks have been popular in many visual search tasks. This article studies deep network structures and training schemes for mobile visual search. The goal is to learn an effective yet portable feature representation that is suitable for bridging the domain gap between mobile user photos and (mostly) professionally taken product images while keeping the computational cost acceptable for mobile-based applications. The technical contributions are twofold. First, we propose an alternative of the contrastive loss popularly used for training deep Siamese networks, namely robust contrastive loss, where we relax the penalty on some positive and negative pairs to alleviate overfitting. Second, a simple multitask fine-tuning scheme is leveraged to train the network, which not only utilizes knowledge from the provided training photo pairs but also harnesses additional information from the large ImageNet dataset to regularize the fine-tuning process. Extensive experiments on challenging real-world datasets demonstrate that both the robust contrastive loss and the multitask fine-tuning scheme are effective, leading to very promising results with a time cost suitable for mobile product search scenarios.
- Sean Bell and Kavita Bala. 2015. Learning visual similarity for product design with convolutional neural networks. ACM Transactions on Graphics 34, 4, 98. Google Scholar
Digital Library
- Gal Chechik, Varun Sharma, Uri Shalit, and Samy Bengio. 2010. Large scale online learning of image similarity through ranking. Journal of Machine Learning Research 11, 1109--1135. Google Scholar
Digital Library
- Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274.Google Scholar
- Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1. IEEE, Los Alamitos, CA, 539--546. Google Scholar
Digital Library
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, Los Alamitos, CA, 248--255.Google Scholar
Cross Ref
- Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2013. Decaf: A deep convolutional activation feature for generic visual recognition. arXiv:1310.1531.Google Scholar
- Clement Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. 2013. Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 8, 1915--1929. Google Scholar
Digital Library
- M. Hadi Kiapour, Xufeng Han, Svetlana Lazebnik, Alexander C. Berg, and Tamara L. Berg. 2015. Where to buy it: Matching street clothing photos in online shops. In Proceedings of the IEEE International Conference on Computer Vision. 3343--3351. Google Scholar
Digital Library
- Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2. IEEE, Los Alamitos, CA, 1735--1742. Google Scholar
Digital Library
- Junfeng He, Jinyuan Feng, Xianglong Liu, Tao Cheng, Tai-Hsu Lin, Hyunjin Chung, and Shih-Fu Chang. 2012. Mobile product search with bag of hash bits and boundary reranking. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). IEEE, Los Alamitos, CA, 3005--3012. Google Scholar
Digital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, Los Alamitos, CA, 770--778.Google Scholar
Cross Ref
- Junshi Huang, Rogerio S. Feris, Qiang Chen, and Shuicheng Yan. 2015. Cross-domain image retrieval with a dual attribute-aware ranking network. In Proceedings of the IEEE International Conference on Computer Vision. 1062--1070. Google Scholar
Digital Library
- Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167.Google Scholar
- Yu-Gang Jiang and Jiajun Wang. 2016. Partial copy detection in videos: A benchmark and an evaluation of popular methods. IEEE Transactions on Big Data 2, 1, 32--42.Google Scholar
Cross Ref
- Yannis Kalantidis, Lyndon Kennedy, and Li-Jia Li. 2013. Getting the look: Clothing recognition and segmentation for automatic product suggestions in everyday photos. In Proceedings of the 3rd ACM Conference on Multimedia Retrieval. ACM, New York, NY, 105--112. Google Scholar
Digital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’12), Vol. 1. 1097--1105. Google Scholar
Digital Library
- Yin-Hsi Kuo, Wen-Huang Cheng, Hsuan-Tien Lin, and Winston H. Hsu. 2012. Unsupervised semantic feature discovery for image object retrieval and tag refinement. IEEE Transactions on Multimedia 14, 4, 1079--1090. Google Scholar
Digital Library
- Hanjiang Lai, Yan Pan, Ye Liu, and Shuicheng Yan. 2015. Simultaneous feature learning and hash coding with deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3270--3278.Google Scholar
Cross Ref
- Daryl Lim, Brian McFee, and Gert R. Lanckriet. 2013. Robust structural metric learning. In Proceedings of the 30th International Conference on Machine Learning. 615--623. Google Scholar
Digital Library
- Si Liu, Zheng Song, Guangcan Liu, Changsheng Xu, Hanqing Lu, and Shuicheng Yan. 2012. Street-to-shop: Cross-scenario clothing retrieval via parts alignment and auxiliary set. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). IEEE, Los Alamitos, CA, 3330--3337. Google Scholar
Digital Library
- Wu Liu, Huadong Ma, Heng Qi, Dong Zhao, and Zhineng Chen. 2017. Deep learning hashing for mobile visual search. EURASIP Journal on Image and Video Processing 2017, 1, 17.Google Scholar
Cross Ref
- Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. 2016. DeepFashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, Los Alamitos, CA, 1096--1104.Google Scholar
Cross Ref
- Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, 2579--2605.Google Scholar
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv:1506.01497.Google Scholar
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, 3, 211--252. Google Scholar
Digital Library
- Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 815--823.Google Scholar
Cross Ref
- Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. 2014. CNN features off-the-shelf: An astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 806--813. Google Scholar
Digital Library
- Edgar Simo-Serra and Hiroshi Ishikawa. 2016. Fashion style in 128 floats: Joint ranking and classification using weak data for feature extraction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 298--307.Google Scholar
Cross Ref
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1--9.Google Scholar
Cross Ref
- Koen E. A. Van de Sande, Jasper R. R. Uijlings, Theo Gevers, and Arnold W. M. Smeulders. 2011. Segmentation as selective search for object recognition. In Proceedings of the International Conference on Computer Vision. IEEE, Los Alamitos, CA. Google Scholar
Digital Library
- Xi Wang, Zhenfeng Sun, Wenqiang Zhang, Yu Zhou, and Yu-Gang Jiang. 2016. Matching user photos to online products with robust deep features. In Proceedings of the 2016 ACM International Conference on Multimedia Retrieval. ACM, New York, NY, 7--14. Google Scholar
Digital Library
- Pengcheng Wu, Steven C. H. Hoi, Hao Xia, Peilin Zhao, Dayong Wang, and Chunyan Miao. 2013. Online multimodal deep similarity learning with application to image retrieval. In Proceedings of the 21st ACM International Conference on Multimedia. ACM, New York, NY, 153--162. Google Scholar
Digital Library
- Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, and Xiangyang Xue. 2015. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In Proceedings of the 23rd ACM International Conference on Multimedia. ACM, New York, NY, 461--470. Google Scholar
Digital Library
Index Terms
DeepProduct: Mobile Product Search With Portable Deep Features
Recommendations
Transfer of Pretrained Model Weights Substantially Improves Semi-supervised Image Classification
AI 2020: Advances in Artificial IntelligenceAbstractDeep neural networks produce state-of-the-art results when trained on a large number of labeled examples but tend to overfit when small amounts of labeled examples are used for training. Creating a large number of labeled examples requires ...
Unsupervised Deep Metric Learning with Transformed Attention Consistency and Contrastive Clustering Loss
Computer Vision – ECCV 2020AbstractExisting approaches for unsupervised metric learning focus on exploring self-supervision information within the input image itself. We observe that, when analyzing images, human eyes often compare images against each other instead of examining ...
ContrasGAN: Unsupervised domain adaptation in Human Activity Recognition via adversarial and contrastive learning
AbstractHuman Activity Recognition (HAR) makes it possible to drive applications directly from embedded and wearable sensors. Machine learning, and especially deep learning, has made significant progress in learning sensor features from raw ...






Comments