skip to main content
research-article
Public Access

Learning a Multi-Concept Video Retrieval Model with Multiple Latent Variables

Authors Info & Claims
Published:25 April 2018Publication History
Skip Abstract Section

Abstract

Effective and efficient video retrieval has become a pressing need in the “big video” era. The objective of this work is to provide a principled model for computing the ranking scores of a video in response to one or more concepts, where the concepts could be directly supplied by users or inferred by the system from the user queries. Indeed, how to deal with multi-concept queries has become a central component in modern video retrieval systems that accept text queries. However, it has been long overlooked and simply implemented by weighted averaging of the corresponding concept detectors’ scores. Our approach, which can be considered as a latent ranking SVM, integrates the advantages of various recent works in text and image retrieval, such as choosing ranking over structured prediction, modeling inter-dependencies between querying concepts, and so on. Videos consist of shots, and we use latent variables to account for the mutually complementary cues within and across shots. Concept labels of shots are scarce and noisy. We introduce a simple and effective technique to make our model robust to outliers. Our approach gives superior performance when it is tested on not only the queries seen at training but also novel queries, some of which consist of more concepts than the queries used for training.

References

  1. Tim Althoff, Hyun Oh Song, and Trevor Darrell. 2012. Detection bank: An object detection-based video representation for multimedia event recognition. In Proceedings of the ACM Conference on Multimedia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Y. Alp Aslandogan and Clement T. Yu. 1999. Techniques and systems for image and video retrieval. Knowl. Data Eng. 11, 1 (1999), 56--63. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Shayan Assari, Amir Zamir, and Mubarak Shah. 2014. Video classification using semantic concept co-occurrences. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Yusuf Aytar, Mubarak Shah, and Jiebo Luo. 2008. Utilizing semantic word similarity measures for video retrieval. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’08).Google ScholarGoogle Scholar
  5. Murray Campbell, Alexander Haubold, Ming Liu, Apostol Natsev, John R. Smith, Jelena Tesic, Lexing Xie, Rong Yan, and Jun Yang. 2007. IBM research TRECVID-2007 video retrieval system. In Proceedings of the NIST TRECVID Workshop.Google ScholarGoogle Scholar
  6. Shih-Fu Chang, Dan Ellis, Wei Jiang, Keansub Lee, Akira Yanagawa, Alexander C. Loui, and Jiebo Luo. 2007. Large-scale multimodal semantic concept detection for consumer video. In Proceedings of the Workshop on Multimedia Information Retrieval. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Shih-Fu Chang, Winston Hsu, Lyndon Kennedy, Lexing Xie, Akira Yanagawa, Eric Zavesky, and Dong-Qing Zhang. 2005. Columbia university trecvid-2005 video search and high-level feature extraction. In Proceedings of the NIST TRECVID Workshop.Google ScholarGoogle Scholar
  8. Olivier Chapelle and S. Sathiya Keerthi. 2010. Efficient algorithms for ranking with SVMs. Info. Retriev. 13, 3 (2010), 201--215. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Tao Chen, Damian Borth, Trevor Darrell, and Shih-Fu Chang. 2014. Deepsentibank: Visual sentiment concept classification with deep convolutional neural networks. arXiv Preprint (2014).Google ScholarGoogle Scholar
  10. Afshin Dehghan, Mahdi M. Kalayeh, Yang Zhang, Haroon Idrees, Yicong Tian, Amir Mazaheri, Mubarak Shah, Jingen Liu, and Hui Cheng. 2014. UCF-CRCV at TRECVID 2014: Semantic indexing. In Proceedings of the NIST TRECVID Workshop.Google ScholarGoogle Scholar
  11. Alireza Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. 2009. Describing objects by their attributes. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09).Google ScholarGoogle ScholarCross RefCross Ref
  12. David Grangier and Samy Bengio. 2008. A discriminative kernel-based approach to rank images from text queries. Pattern Anal. Mach. Intell. 30, 8 (2008), 1371--1384. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Matthieu Guillaumin, Thomas Mensink, Jakob Verbeek, and Cordelia Schmid. 2009. Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In Proceedings of the International Conference on Computer Vision (ICCV’09).Google ScholarGoogle ScholarCross RefCross Ref
  14. Alexander Haubold and Apostol Natsev. 2008. Web-based information content and its application to concept-based video retrieval. In Proceedings of the International Conference on Content-based Image and Video Retrieval. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Alexander Hauptmann, Rong Yan, and Wei-Hao Lin. 2007. How many high-level concepts will fill the semantic gap in news video retrieval? In Proceedings of the ACM International Conference on Image and Video Retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Ralf Herbrich, Thore Graepel, and Klaus Obermayer. 1999. Large margin rank boundaries for ordinal regression. Adv. Neural Info. Process. Syst. (1999), 115--132.Google ScholarGoogle Scholar
  17. Rui Hou, Amir Roshan Zamir, Rahul Sukthankar, and Mubarak Shah. 2014. DaMN--Discriminative and mutually nearest: Exploiting pairwise category proximity for video action recognition. In Proceedings of the European Conference on Computer Vision (ECCV’14).Google ScholarGoogle ScholarCross RefCross Ref
  18. Weiming Hu, Nianhua Xie, Li Li, Xianglin Zeng, and Stephen Maybank. 2011. A survey on visual content-based video indexing and retrieval. Syst., Man, and Cybernet., Part C: Appl. Rev. 41, 6 (2011), 797--819. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Cisco Visual Networking Index. 2013. The zettabyte era--trends and analysis. Cisco White Paper (2013).Google ScholarGoogle Scholar
  20. Cisco Visual Networking Index. 2014. Global mobile data traffic forecast. Retrieved from http://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/white_paper_c11-520862.html.Google ScholarGoogle Scholar
  21. Satoru Ishikawa, Markus Koskela, Mats Sjöberg, Jorma Laaksonen, Erkki Oja, Ehsan Amid, Kalle Palomäki, Annamaria Mesaros, and Mikko Kurimo. 2013. Picsom experiments in TRECVID 2013. In Proceedings of the NIST TRECVID Workshop.Google ScholarGoogle Scholar
  22. Giridharan Iyengar and Harriet J. Nock. 2003. Discriminative model fusion for semantic concept detection and annotation in video. In Proceedings of the ACM Conference on Multimedia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kalervo Järvelin, Susan L. Price, Lois M. L. Delcambre, and Marianne Lykke Nielsen. 2008. Discounted cumulated gain-based evaluation of multiple-query IR sessions. In Advances in Information Retrieval. Springer, 4--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Lu Jiang, Shoou-I Yu, Deyu Meng, Teruko Mitamura, and Alexander G. Hauptmann. 2015. Bridging the ultimate semantic gap: A semantic search engine for internet videos. In Proceedings of the ACM Conference on Multimedia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the ACM SIGKDD Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Lyndon Kennedy, Shih-Fu Chang, and Apostol Natsev. 2008. Query-adaptive fusion for multimodal search. Proc. IEEE 96, 4 (2008), 567--588.Google ScholarGoogle Scholar
  27. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. 2014. Attribute-based classification for zero-shot visual object categorization. Pattern Anal. Mach. Intell. 36, 3 (2014), 453--465. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Tian Lan, Weilong Yang, Yang Wang, and Greg Mori. 2012. Image retrieval with structured object queries using latent ranking svm. In Proceedings of the European Conference on Computer Vision (ECCV’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Li-Jia Li, Hao Su, Li Fei-Fei, and Eric P. Xing. 2010. Object bank: A high-level image representation for scene classification & semantic feature sparsification. In Advances in Neural Information Processing Systems. 1378--1386. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Quannan Li, Jiajun Wu, and Zhuowen Tu. 2013. Harvesting mid-level visual concepts from large-scale internet images. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Xirong Li, Dong Wang, Jianmin Li, and Bo Zhang. 2007. Video search in concept subspace: A text-like paradigm. In Proceedings of the ACM International Conference on Image and Video Retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. David G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60, 2 (2004), 91--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Amir Mazaheri, M. Kalayeh, Haroon Idrees, and Mubarak Shah. 2015. Ucf-crcv at trecvid2015: Semantic indexing. In NIST TRECVID Workshop.Google ScholarGoogle Scholar
  35. Kieran McDonald and Alan F. Smeaton. 2005. A comparison of score, rank and probability-based fusion methods for video shot retrieval. In Image and Video Retrieval. Springer, 61--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Apostol Paul Natsev, Alexander Haubold, Jelena Tešić, Lexing Xie, and Rong Yan. 2007. Semantic concept-based query expansion and re-ranking for multimedia retrieval. In Proceedings of the ACM Conference on Multimedia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Shi-Yong Neo, Jin Zhao, Min-Yen Kan, and Tat-Seng Chua. 2006. Video retrieval using high level features: Exploiting query matching and confidence-based weighting. In Image and Video Retrieval. Springer, 143--152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Paul Over, George Awad, Martial Michel, Jonathan Fiscus, Wessel Kraaij, Alan F. Smeaton, Georges Quéenot, and Roeland Ordelman. 2015. TRECVID 2015—An overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of the NIST TRECVID Workshop. NIST.Google ScholarGoogle Scholar
  39. Paul Over, Georges Awad, Martial Michel, Johnatan Fiscus, Greg Sanders, Wessel Kraaij, Alan F. Smeaton, and Georges Quénot. 2014. TRECVID 2014- An overview of the goals. Proceedings of the NIST TRECVID Workshop.Google ScholarGoogle Scholar
  40. O. Paul, G. Awad, M. Michel, J. Fiscus, W. Kraaij, A. F. Smeaton, and G. Quéenot. 2011. Trecvid 2011—An overview of the goals, tasks, data, evaluation mechanisms and metrics. Proceedings of the NIST TRECVID Workshop.Google ScholarGoogle Scholar
  41. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532--1543.Google ScholarGoogle ScholarCross RefCross Ref
  42. Florent Perronnin, Jorge Sánchez, and Thomas Mensink. 2010. Improving the Fisher kernel for large-scale image classification. Proceedings of the European Conference on Computer Vision (ECCV’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. James Petterson and Tibério S. Caetano. 2010. Reverse multi-label learning. In Advances in Neural Information Processing Systems. 1912--1920. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Wei Ping, Qiang Liu, and Alexander Ihler. 2014. Marginal structured SVM with hidden variables. arXiv Preprint (2014).Google ScholarGoogle Scholar
  45. Guo-Jun Qi, Xian-Sheng Hua, Yong Rui, Jinhui Tang, Tao Mei, and Hong-Jiang Zhang. 2007. Correlative multi-label video annotation. In Proceedings of the ACM Conference on Multimedia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Sreemanananth Sadanand and Jason J. Corso. 2012. Action bank: A high-level representation of activity in video. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Amnon Shashua and Anat Levin. 2002. Ranking with large margin principle: Two approaches. In Advances in Neural Information Processing Systems. 937--944. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Behjat Siddiquie, Rogerio S. Feris, and Larry S. Davis. 2011. Image ranking and retrieval based on multi-attribute queries. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Alan F. Smeaton. 2007. Techniques used and open challenges to the analysis, indexing and retrieval of digital video. Info. Syst. 32, 4 (2007), 545--559. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Alan F. Smeaton, Paul Over, and Wessel Kraaij. 2009. High-level feature detection from video in TRECVid: A 5-year retrospective of achievements. In Multimedia Content Analysis. Springer, 1--24.Google ScholarGoogle Scholar
  51. C. G. M. Snoek, K. E. A. van de Sande, D. Fontijne, A. Habibian, M. Jain, S. Kordumova, Z. Li, M. Mazloom, S. L. Pintea, R. Tao, et al. 2013. MediaMill at TRECVID 2013: Searching concepts, objects, instances and events in video. In Proceedings of the NIST TRECVID Workshop.Google ScholarGoogle Scholar
  52. Cees G. M. Snoek, Bouke Huurnink, Laura Hollink, Maarten De Rijke, Guus Schreiber, and Marcel Worring. 2007. Adding semantics to detectors for video retrieval. Multimedia 9, 5 (2007), 975--986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Cees G. M. Snoek and Marcel Worring. 2008. Concept-based video retrieval. Found. Trends Info. Retrieval 2, 4 (2008), 215--322. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Cees G. M. Snoek, Marcel Worring, Jan C. Van Gemert, Jan-Mark Geusebroek, and Arnold W. M. Smeulders. 2006. The challenge problem for automated detection of 101 semantic concepts in multimedia. In Proceedings of the ACM Conference on Multimedia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Lorenzo Torresani, Martin Szummer, and Andrew Fitzgibbon. 2010. Efficient object category recognition using classemes. In Proceedings of the European Conference on Computer Vision (ECCV’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. 2004. Support vector machine learning for interdependent and structured output spaces. In Proceedings of the International Conference on Machine Learning. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Vladimir Vapnik. 2013. The Nature of Statistical Learning Theory. Springer Science 8 Business Media.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Andrea Vedaldi and Brian Fulkerson. 2010. VLFeat: An open and portable library of computer vision algorithms. In Proceedings of the ACM Conference on Multimedia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Dong Wang, Xirong Li, Jianmin Li, and Bo Zhang. 2007. The importance of query-concept-mapping for automatic video retrieval. In Proceedings of the ACM Conference on Multimedia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the International Conference on Computer Vision (ICCV’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Xiao-Yong Wei, Chong-Wah Ngo, and Yu-Gang Jiang. 2008. Selection of concept detectors for video search by ontology-enriched semantic spaces. Multimedia 10, 6 (2008), 1085--1096. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Jun Wu and Marcel Worring. 2012. Efficient genre-specific semantic video indexing. IEEE Trans. Multimedia 14, 2 (2012), 291--302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Rong Yan and Alexander G. Hauptmann. 2003. The combination limit in multimedia retrieval. In Proceedings of the ACM Conference on Multimedia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Yang Yang and Mubarak Shah. 2012. Complex events detection using data-driven concepts. In Proceedings of the European Conference on Computer Vision (ECCV’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Guangnan Ye, Yitong Li, Hongliang Xu, Dong Liu, and Shih-Fu Chang. 2015. EventNet: A large scale structured concept library for complex event detection in video. In Proceedings of the ACM Conference on Multimedia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Emine Yilmaz and Javed A. Aslam. 2006. Inferred ap: Estimating average precision with incomplete judgments. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM’06). 102--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Chun-Nam John Yu and Thorsten Joachims. 2009. Learning structural SVMs with latent variables. In Proceedings of the International Conference on Machine Learning. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Felix X. Yu, Rongrong Ji, Ming-Hen Tsai, Guangnan Ye, and Shih-Fu Chang. 2012. Weak attributes for large-scale image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Yang Zhang, Boqing Gong, and Mubarak Shah. 2016. Fast zero-shot image tagging. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, 5985--5994.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Learning a Multi-Concept Video Retrieval Model with Multiple Latent Variables

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 14, Issue 2
      May 2018
      208 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3210458
      Issue’s Table of Contents

      Copyright © 2018 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 25 April 2018
      • Accepted: 1 December 2017
      • Revised: 1 October 2017
      • Received: 1 May 2017
      Published in tomm Volume 14, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!