ABSTRACT
This paper presents a novel approach for video representation, called bag-of-scenes. The proposed method is based on dictionaries of scenes, which provide a high-level representation for videos. Scenes are elements with much more semantic information than local features, specially for geotagging videos using visual content. Thus, each component of the representation model has self-contained semantics and, hence, it can be directly related to a specific place of interest. Experiments were conducted in the context of the MediaEval 2011 Placing Task. The reported results show our strategy compared to those from other participants that used only visual content to accomplish this task. Despite our very simple way to generate the visual dictionary, which has taken photos at random, the results show that our approach presents high accuracy relative to the state-of-the art solutions.
References
- J. Almeida, N. J. Leite, and R. Torres. Comparison of video sequences with histograms of motion patterns. In ICIP, pages 3673--3676, 2011.Google Scholar
Cross Ref
- J. Almeida, N. J. Leite, and R. Torres. VISON: VIdeo Summarization for ONline applications. Pattern Recognition Letters, 33(4):397--409, 2012. Google Scholar
Digital Library
- J. Almeida, N. J. Leite, and R. Torres. Online video summarization on compressed domain. J. Visual Communication and Image Representation, 2012. DOI: 10.1016/j.jvcir.2012.01.009.Google Scholar
- J. Almeida, R. Torres, and N. J. Leite. Rapid video summarization on compressed video. In ISM, pages 113--120, 2010. Google Scholar
Digital Library
- S. Avila, N. Thome, M. Cord, E. Valle, and A. de A. Araújo. Bossa: Extended bow formalism for image classification. In ICIP, pages 2966--2969, 2011.Google Scholar
Cross Ref
- Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce. Learning mid-level features for recognition. CVPR, pages 2559--2566, 2010.Google Scholar
Cross Ref
- J. Choi, H. Lei, and G. Friedland. The 2011 ICSI video location estimation system. In Working Notes Proc. MediaEval Workshop, volume 807, 2011.Google Scholar
- C. Hauff and G.-J. Houben. WISTUD at MediaEval 2011: Placing task. In Working Notes Proc. MediaEval Workshop, volume 807, 2011.Google Scholar
- J. Hays and A. A. Efros. im2gps: estimating geographic information from a single image. In CVPR, 2008.Google Scholar
Cross Ref
- Y.-G. Jiang and C.-W. Ngo. Visual word proximity and linguistics for semantic video indexing and near-duplicate retrieval. Computer Vision and Image Understanding, 113(3):405--414, 2009. Google Scholar
Digital Library
- Y. Kalantidis, G. Tolias, Y. Avrithis, M. Phinikettos, E. Spyrou, P. Mylonas, and S. Kollias. Viral: Visual image retrieval and localization. Multimedia Tools and Applications, 51:555--592, 2011. Google Scholar
Digital Library
- Y. Ke, R. Sukthankar, and L. Huston. An efficient parts-based near-duplicate and sub-image retrieval system. In ACM MM, pages 869--876, 2004. Google Scholar
Digital Library
- P. Kelm, S. Schmiedeke, and T. Sikora. Multi-modal, Multi-resource Methods for Placing Flickr Videos on the Map. In ACM ICMR, 2011. Google Scholar
Digital Library
- E. P. X. L-J. Li, H. Su and L. Fei-Fei. Object bank: A high-level image representation for scene classification and semantic feature sparsification. In NIPS, 2010.Google Scholar
- O. V. Laere, S. Schockaert, and B. Dhoedt. Ghent university at the 2011 placing task. In Working Notes Proc. MediaEval Workshop, volume 807, 2011.Google Scholar
- I. Laptev. On space-time interest points. Int. J. Comp. Vision, 64(2--3):107--123, 2005. Google Scholar
Digital Library
- M. Larson, M. Soleymani, P. Serdyukov, S. Rudinac, C. Wartena, V. Murdock, G. Friedland, R. Ordelman, and G. J. F. Jones. Automatic tagging and geotagging in video collections and communities. In ACM ICMR, pages 51:1--51:8, 2011. Google Scholar
Digital Library
- R. R. Larson. Geographic information retrieval and digital libraries. In ECDL, volume 5714/2009, pages 461--464, 2009. Google Scholar
Digital Library
- S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, volume 2, pages 2169--2178, 2006. Google Scholar
Digital Library
- L. T. Li, J. Almeida, and R. Torres. RECOD working notes for placing task MediaEval 2011. In Working Notes Proc. MediaEval Workshop, volume 807, 2011.Google Scholar
- L. Liu, L. Wang, and X. Liu. In defense of soft-assignment coding. In ICCV, pages 1--8, 2011.Google Scholar
- D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comp. Vision, 60(2):91--110, 2004. Google Scholar
Digital Library
- J. Luo, D. Joshi, J. Yu, and A. Gallagher. Geotagging in multimedia and computer vision-a survey. Multimedia Tools Appl., 51:187--211, 2011. Google Scholar
Digital Library
- B. S. Manjunath, J.-R. Ohm, V. V. Vasudevan, and A. Yamada. Color and texture descriptors. IEEE Trans. Circuits Syst. Video Techn., 11(6):703--715, 2001. Google Scholar
Digital Library
- K. Mikolajczyk and C. Schmid. Scale & affine invariant interest point detectors. Int. J. Comp. Vision, 60(1):63--86, 2004. Google Scholar
Digital Library
- K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. TPAMI, 27(10):1615--1630, 2005. Google Scholar
Digital Library
- K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. J. Van Gool. A comparison of affine region detectors. Int. J. Comp. Vision, 65(1--2):43--72, 2005. Google Scholar
Digital Library
- C.-W. Ngo, W. Zhao, and Y.-G. Jiang. Fast tracking of near-duplicate keyframes in broadcast domain with transitivity propagation. In ACM MM, pages 845--854, 2006. Google Scholar
Digital Library
- S. J. Pan and Q. Yang. A survey on transfer learning. IEEE TKDE, 22(10):1345--1359, 2010. Google Scholar
Digital Library
- O. A. B. Penatti, E. Valle, and R. Torres. Encoding spatial arrangement of visual words. In CIARP, volume 7042, pages 240--247, 2011. Google Scholar
Digital Library
- J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization: Improving particular object retrieval in large scale image databases. In CVPR, pages 1--8, Jun. 2008.Google Scholar
Cross Ref
- M. J. Pickering, D. Heesch, S. M. Rüger, R. O'Callaghan, and D. R. Bull. Video retrieval using global features in keyframes. In TREC, 2002.Google Scholar
- A. Rae, V. Murdock, P. Serdyukov, and P. Kelm. Working notes for the placing task at MediaEval 2011. In Working Notes Proc. MediaEval Workshop, volume 807, 2011.Google Scholar
- M. Rautiainen and D. S. Doermann. Temporal color correlograms for video retrieval. In ICPR, pages 267--270, 2002. Google Scholar
Digital Library
- P. Serdyukov, V. Murdock, and R. van Zwol. Placing flickr photos on a map. In ACM SIGIR, pages 484--491, 2009. Google Scholar
Digital Library
- J. Sivic and A. Zisserman. Video google: a text retrieval approach to object matching in videos. In ICCV, pages 1470--1477 vol.2, 2003. Google Scholar
Digital Library
- J. R. Smith, S. Srinivasan, A. Amir, S. Basu, G. Iyengar, C.-Y. Lin, M. R. Naphade, D. B. Ponceleon, and B. L. Tseng. Integrating features, models, and semantics for trec video retrieval. In TREC, 2001.Google Scholar
- T. Tuytelaars and K. Mikolajczyk. Local invariant feature detectors: a survey. Foundations and Trends in Computer Graphics and Vision, 3:177--280, 2008. Google Scholar
Digital Library
- K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek. Evaluating color descriptors for object and scene recognition. TPAMI, 32(9):1582--1596, 2010. Google Scholar
Digital Library
- J. C. van Gemert, C. J. Veenman, A. W. M. Smeulders, and J.-M. Geusebroek. Visual word ambiguity. TPAMI, 32:1271--1283, 2010. Google Scholar
Digital Library
- O. Van Laere, S. Schockaert, and B. Dhoedt. Finding locations of flickr resources using language models and similarity search. In ACM ICMR, pages 48:1--48:8, 2011. Google Scholar
Digital Library
- V. Viitaniemi and J. Laaksonen. Experiments on selection of codebooks for local image feature histograms. In Int. Conf. on Visual Inf. Systems: Web-Based Visual Inf. Search and Management, pages 126--137, 2008. Google Scholar
Digital Library
- L. Wu, Y. Guo, X. Qiu, Z. Feng, J. Rong, W. Jin, D. Zhou, R. Wang, and M. Jin. Fudan university at TRECVID 2003. In TRECVid, 2003.Google Scholar
- X. Wu, W. Zhao, and C.-W. Ngo. Near-duplicate keyframe retrieval with visual keywords and semantic context. In CIVR, pages 162--169, 2007. Google Scholar
Digital Library
Index Terms
A visual approach for video geocoding using bag-of-scenes




Comments