Abstract
We investigate the challenging issue of joint audio-visual analysis of generic videos targeting at concept detection. We extract a novel local representation, Audio-Visual Atom (AVA), which is defined as a region track associated with regional visual features and audio onset features. We develop a hierarchical algorithm to extract visual atoms from generic videos, and locate energy onsets from the corresponding soundtrack by time-frequency analysis. Audio atoms are extracted around energy onsets. Visual and audio atoms form AVAs, based on which discriminative audio-visual codebooks are constructed for concept detection. Experiments over Kodak's consumer benchmark videos confirm the effectiveness of our approach.
- Anemueller, J., Bach, J., Caputo, B., et al. 2008. Biologically motivated audio-visual cue integration for object categorization. In Proceedings of the International Conference on Computational Science.Google Scholar
- Barzelay, Z. and Schechner, Y. 2007. Harmony in motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
- Beal, M. and Jojic, N. 2003. A graphical model for audiovisual object tracking. IEEE Trans. Patt. Anal. Mach. Intell. 25, 7, 828--836. Google Scholar
Digital Library
- Birchfeld, S. 2007. Kit: An implementation of the Kanade-Lucas-Tomasi feature tracker. http://vision.stanford.eduj/~birch.Google Scholar
- Chang, S., Ellis, D., Jiang, W., et al. 2007. Large-scale multimodal semantic concept detection for consumer video. In Proceedings of the ACM SIGMM Workshop on Multimedia Information Retrieval. Google Scholar
Digital Library
- Chen, Y. and Wang, J. 2004. Image categorization by learning and reasoning with regions. J. Mach. Learn. Resear. 5, 913--939. Google Scholar
Digital Library
- Chu, S. and Narayanan, S. 2008. Environmental sound recognition using mp-based features. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1--4.Google Scholar
- Cristani, M., Manuele, B., and Murino, V. 2007. Audio-visual event recognition m surveillance video sequences. IEEE Trans. Multimedia 9, 2, 257--267. Google Scholar
Digital Library
- Dementhon, D. and Doermann, D. 2003. Video retrieval using spatial-temporal descriptors. In Proceedings of ACM Multimedia, 508--517. Google Scholar
Digital Library
- Deng, Y. and Manjunath, B. 2001. Unsupervised segmentation of color-texture regions in images and video. IEEE Trans. Patt. Anal. Mach. Intell. 23, 8, 800--810. Google Scholar
Digital Library
- Friedman, J., Hastie, T., and Tibshirani, R. 2000. Additive logistic regression: a statistical view of boosting. Ann. Stat. 28, 22, 337--407.Google Scholar
- Galmar, E. and Huet, B. 2007. Analysis of vector space model and spatiotemporal segmentation for video indexing and retrieval. In Proceedings of the ACM International Conference on Image and Video Retrieval. Google Scholar
Digital Library
- Grauman, K. and Darrel, T. 2005. The pyramid match kernel: Discriminative classification with sets of image features. In Proceedings of the IEEE International Conference on Computer Vision. 1458--1465. Google Scholar
Digital Library
- Iwano, K., Yoshinaga, T., Tamura, S., and Furui, S. 2007. Audiovisual speech recognition using lip information extracted from side-face images. EURASIP J. Audio Speech Music Process. 1, 4--4. Google Scholar
Digital Library
- Jepson, A., Fleet, D., and El-Maraghi, T. 2003. Robust online appearance models for visual tracking. IEEE Trans. Patt. Anal. Mach. Intell. 25, 10, 1296--1311. Google Scholar
Digital Library
- Jiang, W. 2010. Advanced techniques for semantic concept detection in unconstrained videos. Ph.D. thesis, Columbia University.Google Scholar
- Kaucic, R., Dalton, B., and Blake, A. 1996. Real-time lip tracking for audio-visual speech recognition applications. In Proceedings of the 2nd European Conference on Computer Vision. 376--387. Google Scholar
Digital Library
- Krstuloyic, S. and Grigonyal, R. 2006. MPTK Matching Pursuit made tractable. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 496--499.Google Scholar
- Loui, A., Luo, J., and Chang, S. 2007. Kodak's consumer video benchmark data set: concept definition and annotation. In Proceedings of the ACM SIGMM Workshop on Multimedia Information Retrieval. 245--254. Google Scholar
Digital Library
- Lowe, D. 2004. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60, 2, 91--110. Google Scholar
Digital Library
- Lucas, B. and Kanade, T. 1981. An iterative image registration technique with an application to stereo vision. In Proceedings of the Imaging Understanding Workshop. 121--130.Google Scholar
- Mallat, S. and Zhang, Z. 1993. Matching pursuits with time-frequency dictionaries. IEEE Trans. Sign. Process. 41, 12, 3397--3415.Google Scholar
Digital Library
- Maron, O. and Lozano-Perez, T. 1998. A framework for multiple-instance learning. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 570--576. Google Scholar
Digital Library
- Niebles, J., Han, B., Ferencz, A., and Li, F. 2008. Extracting moving people from internet videos. Int. J. Comput. Vision 79, 3, 299--318.Google Scholar
Digital Library
- Ogle, J. and Ellis, D. 2007. Fingerprinting to identify repeated sound events in long-duration personal audio recordings. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 233--236.Google Scholar
- Petitcolas, F. 2003. Mpeg for matlab. http://www.petitcolas.net/fabien/software.mpeg.Google Scholar
- Shi, J. and Tomasi, C. 1994. Good features to track. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 593--600.Google Scholar
- Smeaton, A. F., Over, P., and Kraaij, W. 2006. Evaluation campaigns and TRECVid. In Proceedings of the ACM SIGMM Workshop on Multimedia Information Retrieval. 321--330. Google Scholar
Digital Library
- Stauffer, C. and Grimson, W. 2002. Learning patterns of activity using real-time tracking. IEEE Trans. Patt. Anal. Mach. Intell. 22, 8, 747--757. Google Scholar
Digital Library
- Tieu, K. and Viola, P. 2000. Boosting Image retrieval. Int. J. Comput. Vision 56, 1--2, 228--235. Google Scholar
Digital Library
- Yanagawa, A., Hsu, W., and Chang, S. 2006. Brief descriptions of visual features for baseline TRECVID concept detectors. Columbia University ADVENT Tech. rep. 219-2006-5.Google Scholar
- Yang, C., Dong, M., and Hua, J. 2006. Region-based image annotation using asymmetrical support vector machine-based multiple-instance learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2057--2063. Google Scholar
Digital Library
- Zhou, H., Yuan, Y., and Shi, C. 2009. Object tracking using SIFT features and Mean Shift. Comput. Vis. Image Understand. 113, 3, 345--352. Google Scholar
Digital Library
Index Terms
Audio-visual atoms for generic video concept classification
Recommendations
Short-term audio-visual atoms for generic video concept classification
MM '09: Proceedings of the 17th ACM international conference on MultimediaWe investigate the challenging issue of joint audio-visual analysis of generic videos targeting at semantic concept detection. We propose to extract a novel representation, the Short-term Audio-Visual Atom (S-AVA), for improved concept detection. An S-...
Automatic music video summarization based on audio-visual-text analysis and alignment
SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrievalIn this paper, we propose a novel approach for automatic music video summarization based on audio-visual-text analysis and alignment. The music video is separated into the music and video tracks. For the music track, the chorus is detected based on ...
Audio-Based Semantic Concept Classification for Consumer Video
This paper presents a novel method for automatically classifying consumer video clips based on their soundtracks. We use a set of 25 overlapping semantic classes, chosen for their usefulness to users, viability of automatic detection and of annotator ...






Comments