skip to main content
research-article

Learning Click-Based Deep Structure-Preserving Embeddings with Visual Attention

Published:08 August 2019Publication History
Skip Abstract Section

Abstract

One fundamental problem in image search is to learn the ranking functions (i.e., the similarity between query and image). Recent progress on this topic has evolved through two paradigms: the text-based model and image ranker learning. The former relies on image surrounding texts, making the similarity sensitive to the quality of textual descriptions. The latter may suffer from the robustness problem when human-labeled query-image pairs cannot represent user search intent precisely. We demonstrate in this article that the preceding two limitations can be well mitigated by learning a cross-view embedding that leverages click data. Specifically, a novel click-based Deep Structure-Preserving Embeddings with visual Attention (DSPEA) model is presented, which consists of two components: deep convolutional neural networks followed by image embedding layers for learning visual embedding, and a deep neural networks for generating query semantic embedding. Meanwhile, visual attention is incorporated at the top of the convolutional neural network to reflect the relevant regions of the image to the query. Furthermore, considering the high dimension of the query space, a new click-based representation on a query set is proposed for alleviating this sparsity problem. The whole network is end-to-end trained by optimizing a large margin objective that combines cross-view ranking constraints with in-view neighborhood structure preservation constraints. On a large-scale click-based image dataset with 11.7 million queries and 1 million images, our model is shown to be powerful for keyword-based image search with superior performance over several state-of-the-art methods and achieves, to date, the best reported [email protected] of 52.21%.

References

  1. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 2015 International Conference on Learning Representations.Google ScholarGoogle Scholar
  2. Bing Bai, Jason Weston, David Grangier, Ronan Collobert, Kunihiko Sadamasa, Yanjun Qi, Corinna Cortes, and Mehryar Mohri. 2009. Polynomial semantic indexing. In Advances in Neural Information Processing Systems. 64--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Yalong Bai, Wei Yu, Tianjun Xiao, Chang Xu, Kuiyuan Yang, Wei-Ying Ma, and Tiejun Zhao. 2014. Bag-of-words based deep neural network for image retrieval. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, New York, NY, 229--232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Gal Chechik, Varun Sharma, Uri Shalit, and Samy Bengio. 2010. Large scale online learning of image similarity through ranking. Journal of Machine Learning Research 11 (2010), 1109--1135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Zheng Fang and Zhongfei Mark Zhang. 2013. Discriminative feature selection for multi-view cross-domain learning. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. ACM, New York, NY, 1321--1330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Andrea Frome, Greg S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. 2013. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems. 2121--2129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Kenji Fukumizu, Francis R. Bach, and Arthur Gretton. 2007. Statistical consistency of kernel canonical correlation analysis. Journal of Machine Learning Research 8 (2007), 361--383. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling Internet images, tags, and their semantics. International Journal of Computer Vision 106, 2 (2014), 210--233. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. David R. Hardoon, Sandor Szedmak, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16, 12 (2004), 2639--2664. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  11. Yonghao He, Shiming Xiang, Cuicui Kang, Jian Wang, and Chunhong Pan. 2016. Cross-modal retrieval via deep and bidirectional representation learning. IEEE Transactions on Multimedia 18, 7 (2016), 1363--1377. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Ralf Herbrich, Thore Graepel, and Klaus Obermayer. 2000. Large margin rank boundaries for ordinal regression. In Advances in Large Margin Classifiers. 115--132.Google ScholarGoogle Scholar
  13. Xian-Sheng Hua, Linjun Yang, Jingdong Wang, Jing Wang, Ming Ye, Kuansan Wang, Yong Rui, and Jin Li. 2013. Clickage: Towards bridging semantic and intent gaps via mining click logs of search engines. In Proceedings of the 21st ACM International Conference on Multimedia. ACM, New York, NY, 243--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Vidit Jain and Manik Varma. 2011. Learning to re-rank: Query-dependent image re-ranking using click data. In Proceedings of the 20th International Conference on World Wide Web. ACM, New York, NY, 277--286. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, New York, NY, 675--678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Meina Kan, Shiguang Shan, Haihong Zhang, Shihong Lao, and Xilin Chen. 2016. Multi-view discriminant analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 1 (2016), 188--194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Cuicui Kang, Shiming Xiang, Shengcai Liao, Changsheng Xu, and Chunhong Pan. 2015. Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Transactions on Multimedia 17, 3 (2015), 370--381.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Yehao Li, Ting Yao, Tao Mei, Hongyang Chao, and Yong Rui. 2016. Share-and-chat: Achieving human-level video commenting by search and multi-view embedding. In Proceedings of the 24th ACM International Conference on Multimedia. ACM, New York, NY, 928--937. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Yuan Liu, Zhongchao Shi, Xue Li, and Gang Wang. 2015. Click-through-based deep visual-semantic embedding for image search. In Proceedings of the 23rd ACM International Conference on Multimedia. ACM, New York, NY, 955--958. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Stefano Melacci and Mikhail Belkin. 2011. Laplacian support vector machines trained in the primal. Journal of Machine Learning Research 12 (2011), 1149--1184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Yingwei Pan, Ting Yao, Tao Mei, Houqiang Li, Chong-Wah Ngo, and Yong Rui. 2014. Click-through-based cross-view learning for image search. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 717--726. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Yingwei Pan, Ting Yao, Xinmei Tian, Houqiang Li, and Chong-Wah Ngo. 2014. Click-through-based subspace learning for image search. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, New York, NY, 233--236. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Yingwei Pan, Ting Yao, Kuiyuan Yang, Houqiang Li, Chong-Wah Ngo, Jingdong Wang, and Tao Mei. 2013. Image search by graph-based label propagation with image representation from DNN. In Proceedings of the 21st ACM International Conference on Multimedia. ACM, New York, NY, 397--400. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. 1994. Okapi at TREC-3. In Proceedings of the 3rd Text REtrieval Conference.Google ScholarGoogle Scholar
  26. Joseph P. Romano. 1990. On the behavior of randomization tests without a group invariance assumption. Journal of the American Statistical Association 85, 411 (1990), 686--692.Google ScholarGoogle ScholarCross RefCross Ref
  27. Roman Rosipal and Nicole Krämer. 2005. Overview and recent advances in partial least squares. In Proceedings of the 2005 International Conference on Subspace, Latent Structure, and Feature Selection. 34--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Gerard Salton and Michael J. McGill. 1986. Introduction to Modern Information Retrieval. McGraw-Hill, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  31. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.Google ScholarGoogle ScholarCross RefCross Ref
  32. Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton Van Den Hengel. 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203--212.Google ScholarGoogle ScholarCross RefCross Ref
  33. Wei Wu, Hang Li, and Jun Xu. 2013. Learning query and document similarities from click-through bipartite graph with metadata. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining. ACM, New York, NY, 687--696. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Hongtao Xie, Zhendong Mao, Yongdong Zhang, Han Deng, Chenggang Yan, and Zhineng Chen. 2018. Double-bit quantization and index hashing for nearest neighbor search. IEEE Transactions on Multimedia 21, 5 (2018), 1248--1260.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048--2057. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3441--3450.Google ScholarGoogle ScholarCross RefCross Ref
  37. Yang Yang, Yi Yang, and Heng Tao Shen. 2013. Effective transfer tagging from image to video. ACM Transactions on Multimedia Computing, Communications, and Applications 9, 2 (2013), 14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 21--29.Google ScholarGoogle ScholarCross RefCross Ref
  39. Ting Yao, Tao Mei, and Chong-Wah Ngo. 2010. Co-reranking by mutual reinforcement for image search. In Proceedings of the ACM International Conference on Image and Video Retrieval. ACM, New York, NY, 34--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Ting Yao, Tao Mei, and Chong-Wah Ngo. 2015. Learning query and image similarities with ranking canonical correlation analysis. In Proceedings of the IEEE International Conference on Computer Vision. 28--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Ting Yao, Tao Mei, Chong-Wah Ngo, and Shipeng Li. 2013. Annotation for free: Video tagging by mining user search behavior. In Proceedings of the 21st ACM International Conference on Multimedia. ACM, New York, NY, 977--986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Ting Yao, Chong-Wah Ngo, and Tao Mei. 2013. Circular reranking for visual search. IEEE Transactions on Image Processing 22, 4 (2013), 1644--1655. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Ting Yao, Yingwei Pan, Chong-Wah Ngo, Houqiang Li, and Tao Mei. 2015. Semi-supervised domain adaptation with subspace learning for visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2142--2150.Google ScholarGoogle ScholarCross RefCross Ref
  44. Chengxiang Zhai and John Lafferty. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems 22, 2 (2004), 179--214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Hanwang Zhang, Xindi Shang, Huanbo Luan, Meng Wang, and Tat-Seng Chua. 2017. Learning from collective intelligence: Feature learning using social images and tags. ACM Transactions on Multimedia Computing, Communications, and Applications 13, 1 (2017), 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Lei Zhang, Yongdong Zhang, Xiaoguang Gu, Jinhui Tang, and Qi Tian. 2014. Scalable similarity search with topology preserving hashing. IEEE Transactions on Image Processing 23, 7 (2014), 3025--3039.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Learning Click-Based Deep Structure-Preserving Embeddings with Visual Attention

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 3
        August 2019
        331 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3352586
        Issue’s Table of Contents

        Copyright © 2019 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 8 August 2019
        • Accepted: 1 April 2019
        • Revised: 1 March 2019
        • Received: 1 September 2018
        Published in tomm Volume 15, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!