Abstract
Effective and efficient video retrieval has become a pressing need in the “big video” era. The objective of this work is to provide a principled model for computing the ranking scores of a video in response to one or more concepts, where the concepts could be directly supplied by users or inferred by the system from the user queries. Indeed, how to deal with multi-concept queries has become a central component in modern video retrieval systems that accept text queries. However, it has been long overlooked and simply implemented by weighted averaging of the corresponding concept detectors’ scores. Our approach, which can be considered as a latent ranking SVM, integrates the advantages of various recent works in text and image retrieval, such as choosing ranking over structured prediction, modeling inter-dependencies between querying concepts, and so on. Videos consist of shots, and we use latent variables to account for the mutually complementary cues within and across shots. Concept labels of shots are scarce and noisy. We introduce a simple and effective technique to make our model robust to outliers. Our approach gives superior performance when it is tested on not only the queries seen at training but also novel queries, some of which consist of more concepts than the queries used for training.
- Tim Althoff, Hyun Oh Song, and Trevor Darrell. 2012. Detection bank: An object detection-based video representation for multimedia event recognition. In Proceedings of the ACM Conference on Multimedia. Google Scholar
Digital Library
- Y. Alp Aslandogan and Clement T. Yu. 1999. Techniques and systems for image and video retrieval. Knowl. Data Eng. 11, 1 (1999), 56--63. Google Scholar
Digital Library
- Shayan Assari, Amir Zamir, and Mubarak Shah. 2014. Video classification using semantic concept co-occurrences. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). Google Scholar
Digital Library
- Yusuf Aytar, Mubarak Shah, and Jiebo Luo. 2008. Utilizing semantic word similarity measures for video retrieval. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’08).Google Scholar
- Murray Campbell, Alexander Haubold, Ming Liu, Apostol Natsev, John R. Smith, Jelena Tesic, Lexing Xie, Rong Yan, and Jun Yang. 2007. IBM research TRECVID-2007 video retrieval system. In Proceedings of the NIST TRECVID Workshop.Google Scholar
- Shih-Fu Chang, Dan Ellis, Wei Jiang, Keansub Lee, Akira Yanagawa, Alexander C. Loui, and Jiebo Luo. 2007. Large-scale multimodal semantic concept detection for consumer video. In Proceedings of the Workshop on Multimedia Information Retrieval. ACM. Google Scholar
Digital Library
- Shih-Fu Chang, Winston Hsu, Lyndon Kennedy, Lexing Xie, Akira Yanagawa, Eric Zavesky, and Dong-Qing Zhang. 2005. Columbia university trecvid-2005 video search and high-level feature extraction. In Proceedings of the NIST TRECVID Workshop.Google Scholar
- Olivier Chapelle and S. Sathiya Keerthi. 2010. Efficient algorithms for ranking with SVMs. Info. Retriev. 13, 3 (2010), 201--215. Google Scholar
Digital Library
- Tao Chen, Damian Borth, Trevor Darrell, and Shih-Fu Chang. 2014. Deepsentibank: Visual sentiment concept classification with deep convolutional neural networks. arXiv Preprint (2014).Google Scholar
- Afshin Dehghan, Mahdi M. Kalayeh, Yang Zhang, Haroon Idrees, Yicong Tian, Amir Mazaheri, Mubarak Shah, Jingen Liu, and Hui Cheng. 2014. UCF-CRCV at TRECVID 2014: Semantic indexing. In Proceedings of the NIST TRECVID Workshop.Google Scholar
- Alireza Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. 2009. Describing objects by their attributes. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09).Google Scholar
Cross Ref
- David Grangier and Samy Bengio. 2008. A discriminative kernel-based approach to rank images from text queries. Pattern Anal. Mach. Intell. 30, 8 (2008), 1371--1384. Google Scholar
Digital Library
- Matthieu Guillaumin, Thomas Mensink, Jakob Verbeek, and Cordelia Schmid. 2009. Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In Proceedings of the International Conference on Computer Vision (ICCV’09).Google Scholar
Cross Ref
- Alexander Haubold and Apostol Natsev. 2008. Web-based information content and its application to concept-based video retrieval. In Proceedings of the International Conference on Content-based Image and Video Retrieval. ACM. Google Scholar
Digital Library
- Alexander Hauptmann, Rong Yan, and Wei-Hao Lin. 2007. How many high-level concepts will fill the semantic gap in news video retrieval? In Proceedings of the ACM International Conference on Image and Video Retrieval. Google Scholar
Digital Library
- Ralf Herbrich, Thore Graepel, and Klaus Obermayer. 1999. Large margin rank boundaries for ordinal regression. Adv. Neural Info. Process. Syst. (1999), 115--132.Google Scholar
- Rui Hou, Amir Roshan Zamir, Rahul Sukthankar, and Mubarak Shah. 2014. DaMN--Discriminative and mutually nearest: Exploiting pairwise category proximity for video action recognition. In Proceedings of the European Conference on Computer Vision (ECCV’14).Google Scholar
Cross Ref
- Weiming Hu, Nianhua Xie, Li Li, Xianglin Zeng, and Stephen Maybank. 2011. A survey on visual content-based video indexing and retrieval. Syst., Man, and Cybernet., Part C: Appl. Rev. 41, 6 (2011), 797--819. Google Scholar
Digital Library
- Cisco Visual Networking Index. 2013. The zettabyte era--trends and analysis. Cisco White Paper (2013).Google Scholar
- Cisco Visual Networking Index. 2014. Global mobile data traffic forecast. Retrieved from http://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/white_paper_c11-520862.html.Google Scholar
- Satoru Ishikawa, Markus Koskela, Mats Sjöberg, Jorma Laaksonen, Erkki Oja, Ehsan Amid, Kalle Palomäki, Annamaria Mesaros, and Mikko Kurimo. 2013. Picsom experiments in TRECVID 2013. In Proceedings of the NIST TRECVID Workshop.Google Scholar
- Giridharan Iyengar and Harriet J. Nock. 2003. Discriminative model fusion for semantic concept detection and annotation in video. In Proceedings of the ACM Conference on Multimedia. Google Scholar
Digital Library
- Kalervo Järvelin, Susan L. Price, Lois M. L. Delcambre, and Marianne Lykke Nielsen. 2008. Discounted cumulated gain-based evaluation of multiple-query IR sessions. In Advances in Information Retrieval. Springer, 4--15. Google Scholar
Digital Library
- Lu Jiang, Shoou-I Yu, Deyu Meng, Teruko Mitamura, and Alexander G. Hauptmann. 2015. Bridging the ultimate semantic gap: A semantic search engine for internet videos. In Proceedings of the ACM Conference on Multimedia. Google Scholar
Digital Library
- Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the ACM SIGKDD Conference. Google Scholar
Digital Library
- Lyndon Kennedy, Shih-Fu Chang, and Apostol Natsev. 2008. Query-adaptive fusion for multimodal search. Proc. IEEE 96, 4 (2008), 567--588.Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105. Google Scholar
Digital Library
- Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. 2014. Attribute-based classification for zero-shot visual object categorization. Pattern Anal. Mach. Intell. 36, 3 (2014), 453--465. Google Scholar
Digital Library
- Tian Lan, Weilong Yang, Yang Wang, and Greg Mori. 2012. Image retrieval with structured object queries using latent ranking svm. In Proceedings of the European Conference on Computer Vision (ECCV’12). Google Scholar
Digital Library
- Li-Jia Li, Hao Su, Li Fei-Fei, and Eric P. Xing. 2010. Object bank: A high-level image representation for scene classification & semantic feature sparsification. In Advances in Neural Information Processing Systems. 1378--1386. Google Scholar
Digital Library
- Quannan Li, Jiajun Wu, and Zhuowen Tu. 2013. Harvesting mid-level visual concepts from large-scale internet images. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13). Google Scholar
Digital Library
- Xirong Li, Dong Wang, Jianmin Li, and Bo Zhang. 2007. Video search in concept subspace: A text-like paradigm. In Proceedings of the ACM International Conference on Image and Video Retrieval. Google Scholar
Digital Library
- David G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60, 2 (2004), 91--110. Google Scholar
Digital Library
- Amir Mazaheri, M. Kalayeh, Haroon Idrees, and Mubarak Shah. 2015. Ucf-crcv at trecvid2015: Semantic indexing. In NIST TRECVID Workshop.Google Scholar
- Kieran McDonald and Alan F. Smeaton. 2005. A comparison of score, rank and probability-based fusion methods for video shot retrieval. In Image and Video Retrieval. Springer, 61--70. Google Scholar
Digital Library
- Apostol Paul Natsev, Alexander Haubold, Jelena Tešić, Lexing Xie, and Rong Yan. 2007. Semantic concept-based query expansion and re-ranking for multimedia retrieval. In Proceedings of the ACM Conference on Multimedia. Google Scholar
Digital Library
- Shi-Yong Neo, Jin Zhao, Min-Yen Kan, and Tat-Seng Chua. 2006. Video retrieval using high level features: Exploiting query matching and confidence-based weighting. In Image and Video Retrieval. Springer, 143--152. Google Scholar
Digital Library
- Paul Over, George Awad, Martial Michel, Jonathan Fiscus, Wessel Kraaij, Alan F. Smeaton, Georges Quéenot, and Roeland Ordelman. 2015. TRECVID 2015—An overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of the NIST TRECVID Workshop. NIST.Google Scholar
- Paul Over, Georges Awad, Martial Michel, Johnatan Fiscus, Greg Sanders, Wessel Kraaij, Alan F. Smeaton, and Georges Quénot. 2014. TRECVID 2014- An overview of the goals. Proceedings of the NIST TRECVID Workshop.Google Scholar
- O. Paul, G. Awad, M. Michel, J. Fiscus, W. Kraaij, A. F. Smeaton, and G. Quéenot. 2011. Trecvid 2011—An overview of the goals, tasks, data, evaluation mechanisms and metrics. Proceedings of the NIST TRECVID Workshop.Google Scholar
- Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532--1543.Google Scholar
Cross Ref
- Florent Perronnin, Jorge Sánchez, and Thomas Mensink. 2010. Improving the Fisher kernel for large-scale image classification. Proceedings of the European Conference on Computer Vision (ECCV’10). Google Scholar
Digital Library
- James Petterson and Tibério S. Caetano. 2010. Reverse multi-label learning. In Advances in Neural Information Processing Systems. 1912--1920. Google Scholar
Digital Library
- Wei Ping, Qiang Liu, and Alexander Ihler. 2014. Marginal structured SVM with hidden variables. arXiv Preprint (2014).Google Scholar
- Guo-Jun Qi, Xian-Sheng Hua, Yong Rui, Jinhui Tang, Tao Mei, and Hong-Jiang Zhang. 2007. Correlative multi-label video annotation. In Proceedings of the ACM Conference on Multimedia. Google Scholar
Digital Library
- Sreemanananth Sadanand and Jason J. Corso. 2012. Action bank: A high-level representation of activity in video. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). Google Scholar
Digital Library
- Amnon Shashua and Anat Levin. 2002. Ranking with large margin principle: Two approaches. In Advances in Neural Information Processing Systems. 937--944. Google Scholar
Digital Library
- Behjat Siddiquie, Rogerio S. Feris, and Larry S. Davis. 2011. Image ranking and retrieval based on multi-attribute queries. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). Google Scholar
Digital Library
- Alan F. Smeaton. 2007. Techniques used and open challenges to the analysis, indexing and retrieval of digital video. Info. Syst. 32, 4 (2007), 545--559. Google Scholar
Digital Library
- Alan F. Smeaton, Paul Over, and Wessel Kraaij. 2009. High-level feature detection from video in TRECVid: A 5-year retrospective of achievements. In Multimedia Content Analysis. Springer, 1--24.Google Scholar
- C. G. M. Snoek, K. E. A. van de Sande, D. Fontijne, A. Habibian, M. Jain, S. Kordumova, Z. Li, M. Mazloom, S. L. Pintea, R. Tao, et al. 2013. MediaMill at TRECVID 2013: Searching concepts, objects, instances and events in video. In Proceedings of the NIST TRECVID Workshop.Google Scholar
- Cees G. M. Snoek, Bouke Huurnink, Laura Hollink, Maarten De Rijke, Guus Schreiber, and Marcel Worring. 2007. Adding semantics to detectors for video retrieval. Multimedia 9, 5 (2007), 975--986. Google Scholar
Digital Library
- Cees G. M. Snoek and Marcel Worring. 2008. Concept-based video retrieval. Found. Trends Info. Retrieval 2, 4 (2008), 215--322. Google Scholar
Digital Library
- Cees G. M. Snoek, Marcel Worring, Jan C. Van Gemert, Jan-Mark Geusebroek, and Arnold W. M. Smeulders. 2006. The challenge problem for automated detection of 101 semantic concepts in multimedia. In Proceedings of the ACM Conference on Multimedia. Google Scholar
Digital Library
- Lorenzo Torresani, Martin Szummer, and Andrew Fitzgibbon. 2010. Efficient object category recognition using classemes. In Proceedings of the European Conference on Computer Vision (ECCV’10). Google Scholar
Digital Library
- Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. 2004. Support vector machine learning for interdependent and structured output spaces. In Proceedings of the International Conference on Machine Learning. ACM. Google Scholar
Digital Library
- Vladimir Vapnik. 2013. The Nature of Statistical Learning Theory. Springer Science 8 Business Media.Google Scholar
Digital Library
- Andrea Vedaldi and Brian Fulkerson. 2010. VLFeat: An open and portable library of computer vision algorithms. In Proceedings of the ACM Conference on Multimedia. Google Scholar
Digital Library
- Dong Wang, Xirong Li, Jianmin Li, and Bo Zhang. 2007. The importance of query-concept-mapping for automatic video retrieval. In Proceedings of the ACM Conference on Multimedia. Google Scholar
Digital Library
- Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the International Conference on Computer Vision (ICCV’13). Google Scholar
Digital Library
- Xiao-Yong Wei, Chong-Wah Ngo, and Yu-Gang Jiang. 2008. Selection of concept detectors for video search by ontology-enriched semantic spaces. Multimedia 10, 6 (2008), 1085--1096. Google Scholar
Digital Library
- Jun Wu and Marcel Worring. 2012. Efficient genre-specific semantic video indexing. IEEE Trans. Multimedia 14, 2 (2012), 291--302. Google Scholar
Digital Library
- Rong Yan and Alexander G. Hauptmann. 2003. The combination limit in multimedia retrieval. In Proceedings of the ACM Conference on Multimedia. Google Scholar
Digital Library
- Yang Yang and Mubarak Shah. 2012. Complex events detection using data-driven concepts. In Proceedings of the European Conference on Computer Vision (ECCV’12). Google Scholar
Digital Library
- Guangnan Ye, Yitong Li, Hongliang Xu, Dong Liu, and Shih-Fu Chang. 2015. EventNet: A large scale structured concept library for complex event detection in video. In Proceedings of the ACM Conference on Multimedia. Google Scholar
Digital Library
- Emine Yilmaz and Javed A. Aslam. 2006. Inferred ap: Estimating average precision with incomplete judgments. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM’06). 102--111. Google Scholar
Digital Library
- Chun-Nam John Yu and Thorsten Joachims. 2009. Learning structural SVMs with latent variables. In Proceedings of the International Conference on Machine Learning. ACM. Google Scholar
Digital Library
- Felix X. Yu, Rongrong Ji, Ming-Hen Tsai, Guangnan Ye, and Shih-Fu Chang. 2012. Weak attributes for large-scale image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). Google Scholar
Digital Library
- Yang Zhang, Boqing Gong, and Mubarak Shah. 2016. Fast zero-shot image tagging. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, 5985--5994.Google Scholar
Cross Ref
Index Terms
Learning a Multi-Concept Video Retrieval Model with Multiple Latent Variables
Recommendations
The uncertain representation ranking framework for concept-based video retrieval
Concept based video retrieval often relies on imperfect and uncertain concept detectors. We propose a general ranking framework to define effective and robust ranking functions, through explicitly addressing detector uncertainty. It can cope with ...
Mutual relevance feedback for multimodal query formulation in video retrieval
MIR '05: Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrievalVideo indexing and retrieval systems allow users to find relevant video segments for a given information need. A multimodal video index may include speech indices, a text-from-screen (OCR) index, semantic visual concepts, content-based image features, ...
A Learned Lexicon-Driven Paradigm for Interactive Video Retrieval
Effective video retrieval is the result of interplay between interactive query selection, advanced visualization of results, and a goal-oriented human user. Traditional interactive video retrieval approaches emphasize paradigms, such as query-by-keyword ...






Comments