Abstract
Recent years have witnessed a great explosion of user-generated videos on the Web. In order to achieve an effective and efficient video search, it is critical for modern video search engines to associate videos with semantic keywords automatically. Most of the existing video tagging methods can hardly achieve reliable performance due to deficiency of training data. It is noticed that abundant well-tagged data are available in other relevant types of media (e.g., images). In this article, we propose a novel video tagging framework, termed as Cross-Media Tag Transfer (CMTT), which utilizes the abundance of well-tagged images to facilitate video tagging. Specifically, we build a “cross-media tunnel” to transfer knowledge from images to videos. To this end, an optimal kernel space, in which distribution distance between images and video is minimized, is found to tackle the domain-shift problem. A novel cross-media video tagging model is proposed to infer tags by exploring the intrinsic local structures of both labeled and unlabeled data, and learn reliable video classifiers. An efficient algorithm is designed to optimize the proposed model in an iterative and alternative way. Extensive experiments illustrate the superiority of our proposal compared to the state-of-the-art algorithms.
- Belkin, M., Niyogi, P., and Sindhwani, V. 2006. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7, 2399--2434. Google Scholar
Digital Library
- Borgwardt, K. M., Gretton, A., Rasch, M. J., Kriegel, H.-P., Scholkopf, B., and Smola, A. J. 2006. Integrating structured biological data by kernel maximum mean discrepancy. Bioinf. 22, e49--e57. Google Scholar
Digital Library
- Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., and Zheng, Y. 2009. Nus-Wide: A real-world web image database from national university of singapore. In Proceeedings of the ACM International Conference on Image and Video Retrieval (CIVR'09). 48:1--48:9. Google Scholar
Digital Library
- Cortes, C., Mohri, M., and Rostamizadeh, A. 2009. L2 regularization for learning kernels. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI'09). 109--116. Google Scholar
Digital Library
- Dai, W., Yang, Q., Xue, G., and Yu, Y. 2007. Boosting for transfer learning. In Proceedings of the International Conference on Machine Learning (ICML'07). 193--200. Google Scholar
Digital Library
- Duan, L., Xu, D., Tsang, I. W.-H., and Luo, J. 2010. Visual event recognition in videos by learning from web data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'10). 1959--1966.Google Scholar
- Fan, J., Shen, Y., Zhou, N., and Gao, Y. 2010. Harvesting large-scale weakly-tagged image databases from the web. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogntion (CVPR'10). 802--809.Google Scholar
- Grant, M. and Boyd, S. 2011. CVX: Matlab software for disciplined convex programming, version 1.21. http://cvxr.com/cvx/.Google Scholar
- Huiskes, M. J. and Lew, M. S. 2008. The mir flickr retrieval evaluation. In Proceedings of the ACM International Conference on Multimedia Information Retrieval (MIR'08). 39--43. Google Scholar
Digital Library
- Jiang, W., Zavesky, E., Chang, S., and Loui, A. 2008. Cross-Domain learning methods for high-level visual concept classification. In Proceedings of the International Conference on Image Processing (ICIP'08). 161--164.Google Scholar
- Jiang, Y.-G., Ngo, C.-W., and Chang, S.-F. 2009a. Semantic context transfer across heterogeneous sources for domain adaptive video search. In Proceedings of the ACM Multimedia Conference. 155--164. Google Scholar
Digital Library
- Jiang, Y.-G., Wang, J., Chang, S.-F., and Ngo, C.-W. 2009b. Domain adaptive semantic diffusion for large scale context-based video annotation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV'09). 1420--1427.Google Scholar
- Liu, X., Yao, H., Ji, R., Xu, P., Sun, X., and Tian, Q. 2011. Learning heterogeneous data for hierarchical web video classification. In Proceedings of the ACM Multimedia Conference. 433--442. Google Scholar
Digital Library
- Loui, A. C., Chang, S.-F., Ellis, D., Jiang, W., Kennedy, L., Lee, K., and Yanagawa, A. 2008. Kodak's consumer video benchmark data set: concept definition and annotation. http://www.ee.columbia.edu/~wjiang/references/datamir07.pdf. Google Scholar
Digital Library
- Ojala, T., Pietikainen, M., and Harwood, D. 1996. A comparative study of texture measures with classification based on featured distributions. Pattern Recogn. 29, 51--59.Google Scholar
Cross Ref
- Pan, S. J., Kwok, J. T., and Yang, Q. 2008. Transfer learning via dimensionality reduction. In Proceedings of the AAAI Conference on Artificial Intelligence. 677--682. Google Scholar
Digital Library
- Rakotomamonjy, A., Bach, F. R., Canu, S., and Grandvalet, Y. 2008. Simplemkl. J. Mach. Learn. Res. 9, 2491--2521.Google Scholar
- Rockafellar, R. and Roger, J. 2005. Variational Analysis. Springer.Google Scholar
- Tang, J., Hua, X.-S., Qi, G.-J., Song, Y., and Wu, X. 2008. Video annotation based on kernel linear neighborhood propagation. IEEE Trans. Multimedia 10, 4, 620--628. Google Scholar
Digital Library
- Tang, J., Hua, X.-S., Qi, G.-J., Wang, M., Mei, T., and Wu, X. 2007. Structure-Sensitive manifold ranking for video concept detection. In Proceedings of the ACM Multimedia Conference. 852--861. Google Scholar
Digital Library
- Tang, J., Yan, S., Hong, R., Qi, G.-J., and Chua, T.-S. 2009. Inferring semantic concepts from community-contributed images and noisy tags. In Proceedings of the ACM Multimedia Conference. 223--232. Google Scholar
Digital Library
- Torralba, A., Fergus, R., and Freeman, W. 2008. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 30, 11, 1958--1970. Google Scholar
Digital Library
- Trecvid. 2007. Trec video retrieval evaluation. http://www.nlpir.nist.gov/projects/trecvid.Google Scholar
- Wang, M., Hong, R., Li, G., Zha, Z., Yan, S., and Chua, T. 2011. Event driven web video summarization by tag localization and key-shot identification. IEEE Trans. Multimedia 14, 99, 1--1.Google Scholar
- Wang, M., Hua, X., Mei, T., Hong, R., Qi, G., Song, Y., and Dai, L. 2009a. Semi-Supervised kernel density estimation for video annotation. J. Comput. Vis. Image Understand. 113, 3, 384--396. Google Scholar
Digital Library
- Wang, M., Hua, X., Tang, J., and Hong, R. 2009b. Beyond distance measurement: Constructing neighborhood similarity for video annotation. IEEE Trans. Multimedia 11, 3, 465--476. Google Scholar
Digital Library
- Wang, M, Hua, X.-S., Hong, R., Tang, J., Qi, G.-J., and Song, Y. 2009c. Unified video annotation via multigraph learning. IEEE Trans. Circ. Syst. Video Technol. 19, 5, 733--746. Google Scholar
Digital Library
- Wang, M., Yang, K., Hua, X., and Zhang, H. 2010. Towards a relevant and diverse search of social images. IEEE Trans. Multimedia 12, 8, 829--842. Google Scholar
Digital Library
- Yang, J., Yan, R., and Hauptmann, A. G. 2007. Cross-Domain video concept detection using adaptive svms. In Proceedings of the ACM Multimedia Conference. 188--197. Google Scholar
Digital Library
- Yang, Y., Huang, Z., Shen, H. T., and Zhou, X. 2011a. Mining multi-tag association for image tagging. World Wide Web 14, 2, 133--156. Google Scholar
Digital Library
- Yang, Y., Xu, D., Nie, F., Luo, J., and Zhuang, Y. 2009. Ranking with local regression and global alignment for cross media retrieval. In Proceedings of the ACM Multimedia Conference. 175--184. Google Scholar
Digital Library
- Yang, Y., Yang, Y., Huang, Z., and Ma, Z. 2012. Robust cross-media transfer for visual event detection. In Proceedings of the ACM Multimedia Conference. Google Scholar
Digital Library
- Yang, Y., Yang, Y., Huang, Z., and Shen, H. 2011b. Transfer tagging from image to video. In Proceedings of the ACM Multimedia Conference. 1137--1140. Google Scholar
Digital Library
- Yang, Y., Yang, Y., Huang, Z., Shen, H., and Nie, F. 2011c. Tag localization with spatial correlations and joint group sparsity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'11). 881--888. Google Scholar
Digital Library
- Yao, Y. and Doretto, G. 2010. Boosting for transfer learning with multiple sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'10). 1855--1862.Google Scholar
- Zha, Z.-J., Hua, X.-S., Mei, T., Wang, J., Qi, G.-J., and Wang, Z. 2008. Joint multi-label multi-instance learning for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'08). 1--8.Google Scholar
- Zha, Z.-J., Wang, M., Zheng, Y.-T., Yang, Y., Hong, R., and Chua, T.-S. 2012. Interactive video indexing with statistical active learning. IEEE Trans. Multimedia 14, 1, 17--27. Google Scholar
Digital Library
- Zha, Z.-J., Yang, L., Mei, T., Wang, M., and Wang, Z. 2009. Visual query suggestion. In Proceedings of the ACM Multimedia Conference. 15--24. Google Scholar
Digital Library
- Zha, Z.-J., Yang, L., Mei, T., Wang, M., Wang, Z., Chua, T.-S., and Hua, X.-S. 2010. Visual query suggestion: Towards capturing user intent in internet image search. ACM Trans. Multimedia Comput. Comm. Appl. 6, 3, 1--19. Google Scholar
Digital Library
- Zhu, X. 2008. Semi-Supervised learning literature survey. http://pages.cs.wisc.edu/~jerryzhu/pub/ssl_survey_7_19_2008.pdf.Google Scholar
- Zhu, X., Huang, Z., and Shen, H. T. 2011a. Video-to-Shot tag allocation by weighted sparse group lasso. In Proceedings of the ACM Multimedia Conference. 1501--1504. Google Scholar
Digital Library
- Zhu, Y., Chen, Y., Lu, Z., Pan, S., Xue, G., Yu, Y., and Yang, Q. 2011b. Heterogeneous transfer learning for image classification. In Proceedings of the AAAI Conference on Artificial Intelligence. 1304--1309.Google Scholar
Index Terms
Effective transfer tagging from image to video
Recommendations
Transfer tagging from image to video
MM '11: Proceedings of the 19th ACM international conference on MultimediaNowadays massive amount of web video datum has been emerging on the Internet. To achieve an effective and efficient video retrieval, it is critical to automatically assign semantic keywords to the videos via content analysis. However, most of the ...
A data-driven approach for tag refinement and localization in web videos
Our approach locates the temporal positions of tags in videos at the keyframe level.We deal with a scenario in which there is no pre-defined set of tags.We report experiments about the use of different web sources (Flickr, Google, Bing).We show state-of-...
Transfer of Pretrained Model Weights Substantially Improves Semi-supervised Image Classification
AI 2020: Advances in Artificial IntelligenceAbstractDeep neural networks produce state-of-the-art results when trained on a large number of labeled examples but tend to overfit when small amounts of labeled examples are used for training. Creating a large number of labeled examples requires ...






Comments