Abstract
One fundamental problem in image search is to learn the ranking functions (i.e., the similarity between query and image). Recent progress on this topic has evolved through two paradigms: the text-based model and image ranker learning. The former relies on image surrounding texts, making the similarity sensitive to the quality of textual descriptions. The latter may suffer from the robustness problem when human-labeled query-image pairs cannot represent user search intent precisely. We demonstrate in this article that the preceding two limitations can be well mitigated by learning a cross-view embedding that leverages click data. Specifically, a novel click-based Deep Structure-Preserving Embeddings with visual Attention (DSPEA) model is presented, which consists of two components: deep convolutional neural networks followed by image embedding layers for learning visual embedding, and a deep neural networks for generating query semantic embedding. Meanwhile, visual attention is incorporated at the top of the convolutional neural network to reflect the relevant regions of the image to the query. Furthermore, considering the high dimension of the query space, a new click-based representation on a query set is proposed for alleviating this sparsity problem. The whole network is end-to-end trained by optimizing a large margin objective that combines cross-view ranking constraints with in-view neighborhood structure preservation constraints. On a large-scale click-based image dataset with 11.7 million queries and 1 million images, our model is shown to be powerful for keyword-based image search with superior performance over several state-of-the-art methods and achieves, to date, the best reported [email protected] of 52.21%.
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 2015 International Conference on Learning Representations.Google Scholar
- Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Corinna Cortes, and Mehryar Mohri. 2009. Polynomial semantic indexing. In Advances in Neural Information Processing Systems. 64--72. Google Scholar
Digital Library
- Yalong Bai, Wei Yu, Tianjun Xiao, Chang Xu, Kuiyuan Yang, Wei-Ying Ma, and Tiejun Zhao. 2014. Bag-of-words based deep neural network for image retrieval. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, New York, NY, 229--232. Google Scholar
Digital Library
- Gal Chechik, Varun Sharma, Uri Shalit, and Samy Bengio. 2010. Large scale online learning of image similarity through ranking. Journal of Machine Learning Research 11 (2010), 1109--1135. Google Scholar
Digital Library
- Zheng Fang and Zhongfei Mark Zhang. 2013. Discriminative feature selection for multi-view cross-domain learning. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. ACM, New York, NY, 1321--1330. Google Scholar
Digital Library
- Andrea Frome, Greg S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. 2013. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems. 2121--2129. Google Scholar
Digital Library
- Kenji Fukumizu, Francis R. Bach, and Arthur Gretton. 2007. Statistical consistency of kernel canonical correlation analysis. Journal of Machine Learning Research 8 (2007), 361--383. Google Scholar
Digital Library
- Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling Internet images, tags, and their semantics. International Journal of Computer Vision 106, 2 (2014), 210--233. Google Scholar
Digital Library
- David R. Hardoon, Sandor Szedmak, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16, 12 (2004), 2639--2664. Google Scholar
Digital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google Scholar
Cross Ref
- Yonghao He, Shiming Xiang, Cuicui Kang, Jian Wang, and Chunhong Pan. 2016. Cross-modal retrieval via deep and bidirectional representation learning. IEEE Transactions on Multimedia 18, 7 (2016), 1363--1377. Google Scholar
Digital Library
- Ralf Herbrich, Thore Graepel, and Klaus Obermayer. 2000. Large margin rank boundaries for ordinal regression. In Advances in Large Margin Classifiers. 115--132.Google Scholar
- Xian-Sheng Hua, Linjun Yang, Jingdong Wang, Jing Wang, Ming Ye, Kuansan Wang, Yong Rui, and Jin Li. 2013. Clickage: Towards bridging semantic and intent gaps via mining click logs of search engines. In Proceedings of the 21st ACM International Conference on Multimedia. ACM, New York, NY, 243--252. Google Scholar
Digital Library
- Vidit Jain and Manik Varma. 2011. Learning to re-rank: Query-dependent image re-ranking using click data. In Proceedings of the 20th International Conference on World Wide Web. ACM, New York, NY, 277--286. Google Scholar
Digital Library
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, New York, NY, 675--678. Google Scholar
Digital Library
- Meina Kan, Shiguang Shan, Haihong Zhang, Shihong Lao, and Xilin Chen. 2016. Multi-view discriminant analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 1 (2016), 188--194. Google Scholar
Digital Library
- Cuicui Kang, Shiming Xiang, Shengcai Liao, Changsheng Xu, and Chunhong Pan. 2015. Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Transactions on Multimedia 17, 3 (2015), 370--381.Google Scholar
Digital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105. Google Scholar
Digital Library
- Yehao Li, Ting Yao, Tao Mei, Hongyang Chao, and Yong Rui. 2016. Share-and-chat: Achieving human-level video commenting by search and multi-view embedding. In Proceedings of the 24th ACM International Conference on Multimedia. ACM, New York, NY, 928--937. Google Scholar
Digital Library
- Yuan Liu, Zhongchao Shi, Xue Li, and Gang Wang. 2015. Click-through-based deep visual-semantic embedding for image search. In Proceedings of the 23rd ACM International Conference on Multimedia. ACM, New York, NY, 955--958. Google Scholar
Digital Library
- Stefano Melacci and Mikhail Belkin. 2011. Laplacian support vector machines trained in the primal. Journal of Machine Learning Research 12 (2011), 1149--1184. Google Scholar
Digital Library
- Yingwei Pan, Ting Yao, Tao Mei, Houqiang Li, Chong-Wah Ngo, and Yong Rui. 2014. Click-through-based cross-view learning for image search. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 717--726. Google Scholar
Digital Library
- Yingwei Pan, Ting Yao, Xinmei Tian, Houqiang Li, and Chong-Wah Ngo. 2014. Click-through-based subspace learning for image search. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, New York, NY, 233--236. Google Scholar
Digital Library
- Yingwei Pan, Ting Yao, Kuiyuan Yang, Houqiang Li, Chong-Wah Ngo, Jingdong Wang, and Tao Mei. 2013. Image search by graph-based label propagation with image representation from DNN. In Proceedings of the 21st ACM International Conference on Multimedia. ACM, New York, NY, 397--400. Google Scholar
Digital Library
- S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. 1994. Okapi at TREC-3. In Proceedings of the 3rd Text REtrieval Conference.Google Scholar
- Joseph P. Romano. 1990. On the behavior of randomization tests without a group invariance assumption. Journal of the American Statistical Association 85, 411 (1990), 686--692.Google Scholar
Cross Ref
- Roman Rosipal and Nicole Krämer. 2005. Overview and recent advances in partial least squares. In Proceedings of the 2005 International Conference on Subspace, Latent Structure, and Feature Selection. 34--51. Google Scholar
Digital Library
- Gerard Salton and Michael J. McGill. 1986. Introduction to Modern Information Retrieval. McGraw-Hill, New York, NY. Google Scholar
Digital Library
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008. Google Scholar
Digital Library
- Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In Proceedings of the International Conference on Learning Representations.Google Scholar
- Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.Google Scholar
Cross Ref
- Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton Van Den Hengel. 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203--212.Google Scholar
Cross Ref
- Wei Wu, Hang Li, and Jun Xu. 2013. Learning query and document similarities from click-through bipartite graph with metadata. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining. ACM, New York, NY, 687--696. Google Scholar
Digital Library
- Hongtao Xie, Zhendong Mao, Yongdong Zhang, Han Deng, Chenggang Yan, and Zhineng Chen. 2018. Double-bit quantization and index hashing for nearest neighbor search. IEEE Transactions on Multimedia 21, 5 (2018), 1248--1260.Google Scholar
Digital Library
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048--2057. Google Scholar
Digital Library
- Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3441--3450.Google Scholar
Cross Ref
- Yang Yang, Yi Yang, and Heng Tao Shen. 2013. Effective transfer tagging from image to video. ACM Transactions on Multimedia Computing, Communications, and Applications 9, 2 (2013), 14. Google Scholar
Digital Library
- Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 21--29.Google Scholar
Cross Ref
- Ting Yao, Tao Mei, and Chong-Wah Ngo. 2010. Co-reranking by mutual reinforcement for image search. In Proceedings of the ACM International Conference on Image and Video Retrieval. ACM, New York, NY, 34--41. Google Scholar
Digital Library
- Ting Yao, Tao Mei, and Chong-Wah Ngo. 2015. Learning query and image similarities with ranking canonical correlation analysis. In Proceedings of the IEEE International Conference on Computer Vision. 28--36. Google Scholar
Digital Library
- Ting Yao, Tao Mei, Chong-Wah Ngo, and Shipeng Li. 2013. Annotation for free: Video tagging by mining user search behavior. In Proceedings of the 21st ACM International Conference on Multimedia. ACM, New York, NY, 977--986. Google Scholar
Digital Library
- Ting Yao, Chong-Wah Ngo, and Tao Mei. 2013. Circular reranking for visual search. IEEE Transactions on Image Processing 22, 4 (2013), 1644--1655. Google Scholar
Digital Library
- Ting Yao, Yingwei Pan, Chong-Wah Ngo, Houqiang Li, and Tao Mei. 2015. Semi-supervised domain adaptation with subspace learning for visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2142--2150.Google Scholar
Cross Ref
- Chengxiang Zhai and John Lafferty. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems 22, 2 (2004), 179--214. Google Scholar
Digital Library
- Hanwang Zhang, Xindi Shang, Huanbo Luan, Meng Wang, and Tat-Seng Chua. 2017. Learning from collective intelligence: Feature learning using social images and tags. ACM Transactions on Multimedia Computing, Communications, and Applications 13, 1 (2017), 1. Google Scholar
Digital Library
- Lei Zhang, Yongdong Zhang, Xiaoguang Gu, Jinhui Tang, and Qi Tian. 2014. Scalable similarity search with topology preserving hashing. IEEE Transactions on Image Processing 23, 7 (2014), 3025--3039.Google Scholar
Cross Ref
Index Terms
Learning Click-Based Deep Structure-Preserving Embeddings with Visual Attention
Recommendations
Learning to re-rank: query-dependent image re-ranking using click data
WWW '11: Proceedings of the 20th international conference on World wide webOur objective is to improve the performance of keyword based image search engines by re-ranking their original results. To this end, we address three limitations of existing search engines in this paper. First, there is no straight-forward, fully ...
Random walks on the click graph
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrievalSearch engines can record which documents were clicked for which query, and use these query-document pairs as "soft" relevance judgments. However, compared to the true judgments, click logs give noisy and sparse relevance information. We apply a Markov ...
Click-through-based Deep Visual-Semantic Embedding for Image Search
MM '15: Proceedings of the 23rd ACM international conference on MultimediaThe problem of image search is mostly considered from the perspectives of feature-based vector model and image ranker learning. A fundamental issue that underlies the success of these approaches is the similarity learning between query and image. The ...






Comments