skip to main content
research-article

Audio-visual atoms for generic video concept classification

Published:27 August 2010Publication History
Skip Abstract Section

Abstract

We investigate the challenging issue of joint audio-visual analysis of generic videos targeting at concept detection. We extract a novel local representation, Audio-Visual Atom (AVA), which is defined as a region track associated with regional visual features and audio onset features. We develop a hierarchical algorithm to extract visual atoms from generic videos, and locate energy onsets from the corresponding soundtrack by time-frequency analysis. Audio atoms are extracted around energy onsets. Visual and audio atoms form AVAs, based on which discriminative audio-visual codebooks are constructed for concept detection. Experiments over Kodak's consumer benchmark videos confirm the effectiveness of our approach.

References

  1. Anemueller, J., Bach, J., Caputo, B., et al. 2008. Biologically motivated audio-visual cue integration for object categorization. In Proceedings of the International Conference on Computational Science.Google ScholarGoogle Scholar
  2. Barzelay, Z. and Schechner, Y. 2007. Harmony in motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle Scholar
  3. Beal, M. and Jojic, N. 2003. A graphical model for audiovisual object tracking. IEEE Trans. Patt. Anal. Mach. Intell. 25, 7, 828--836. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Birchfeld, S. 2007. Kit: An implementation of the Kanade-Lucas-Tomasi feature tracker. http://vision.stanford.eduj/~birch.Google ScholarGoogle Scholar
  5. Chang, S., Ellis, D., Jiang, W., et al. 2007. Large-scale multimodal semantic concept detection for consumer video. In Proceedings of the ACM SIGMM Workshop on Multimedia Information Retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chen, Y. and Wang, J. 2004. Image categorization by learning and reasoning with regions. J. Mach. Learn. Resear. 5, 913--939. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chu, S. and Narayanan, S. 2008. Environmental sound recognition using mp-based features. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1--4.Google ScholarGoogle Scholar
  8. Cristani, M., Manuele, B., and Murino, V. 2007. Audio-visual event recognition m surveillance video sequences. IEEE Trans. Multimedia 9, 2, 257--267. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Dementhon, D. and Doermann, D. 2003. Video retrieval using spatial-temporal descriptors. In Proceedings of ACM Multimedia, 508--517. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Deng, Y. and Manjunath, B. 2001. Unsupervised segmentation of color-texture regions in images and video. IEEE Trans. Patt. Anal. Mach. Intell. 23, 8, 800--810. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Friedman, J., Hastie, T., and Tibshirani, R. 2000. Additive logistic regression: a statistical view of boosting. Ann. Stat. 28, 22, 337--407.Google ScholarGoogle Scholar
  12. Galmar, E. and Huet, B. 2007. Analysis of vector space model and spatiotemporal segmentation for video indexing and retrieval. In Proceedings of the ACM International Conference on Image and Video Retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Grauman, K. and Darrel, T. 2005. The pyramid match kernel: Discriminative classification with sets of image features. In Proceedings of the IEEE International Conference on Computer Vision. 1458--1465. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Iwano, K., Yoshinaga, T., Tamura, S., and Furui, S. 2007. Audiovisual speech recognition using lip information extracted from side-face images. EURASIP J. Audio Speech Music Process. 1, 4--4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jepson, A., Fleet, D., and El-Maraghi, T. 2003. Robust online appearance models for visual tracking. IEEE Trans. Patt. Anal. Mach. Intell. 25, 10, 1296--1311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jiang, W. 2010. Advanced techniques for semantic concept detection in unconstrained videos. Ph.D. thesis, Columbia University.Google ScholarGoogle Scholar
  17. Kaucic, R., Dalton, B., and Blake, A. 1996. Real-time lip tracking for audio-visual speech recognition applications. In Proceedings of the 2nd European Conference on Computer Vision. 376--387. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Krstuloyic, S. and Grigonyal, R. 2006. MPTK Matching Pursuit made tractable. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 496--499.Google ScholarGoogle Scholar
  19. Loui, A., Luo, J., and Chang, S. 2007. Kodak's consumer video benchmark data set: concept definition and annotation. In Proceedings of the ACM SIGMM Workshop on Multimedia Information Retrieval. 245--254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Lowe, D. 2004. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60, 2, 91--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Lucas, B. and Kanade, T. 1981. An iterative image registration technique with an application to stereo vision. In Proceedings of the Imaging Understanding Workshop. 121--130.Google ScholarGoogle Scholar
  22. Mallat, S. and Zhang, Z. 1993. Matching pursuits with time-frequency dictionaries. IEEE Trans. Sign. Process. 41, 12, 3397--3415.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Maron, O. and Lozano-Perez, T. 1998. A framework for multiple-instance learning. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 570--576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Niebles, J., Han, B., Ferencz, A., and Li, F. 2008. Extracting moving people from internet videos. Int. J. Comput. Vision 79, 3, 299--318.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ogle, J. and Ellis, D. 2007. Fingerprinting to identify repeated sound events in long-duration personal audio recordings. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 233--236.Google ScholarGoogle Scholar
  26. Petitcolas, F. 2003. Mpeg for matlab. http://www.petitcolas.net/fabien/software.mpeg.Google ScholarGoogle Scholar
  27. Shi, J. and Tomasi, C. 1994. Good features to track. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 593--600.Google ScholarGoogle Scholar
  28. Smeaton, A. F., Over, P., and Kraaij, W. 2006. Evaluation campaigns and TRECVid. In Proceedings of the ACM SIGMM Workshop on Multimedia Information Retrieval. 321--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Stauffer, C. and Grimson, W. 2002. Learning patterns of activity using real-time tracking. IEEE Trans. Patt. Anal. Mach. Intell. 22, 8, 747--757. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Tieu, K. and Viola, P. 2000. Boosting Image retrieval. Int. J. Comput. Vision 56, 1--2, 228--235. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Yanagawa, A., Hsu, W., and Chang, S. 2006. Brief descriptions of visual features for baseline TRECVID concept detectors. Columbia University ADVENT Tech. rep. 219-2006-5.Google ScholarGoogle Scholar
  32. Yang, C., Dong, M., and Hua, J. 2006. Region-based image annotation using asymmetrical support vector machine-based multiple-instance learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2057--2063. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Zhou, H., Yuan, Y., and Shi, C. 2009. Object tracking using SIFT features and Mean Shift. Comput. Vis. Image Understand. 113, 3, 345--352. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Audio-visual atoms for generic video concept classification

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!