10.1145/2502081.2502123acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedings
research-article

Human vs machine: establishing a human baseline for multimodal location estimation

Authors Info & Claims
Published:21 October 2013

ABSTRACT

Over the recent years, the problem of video location estimation (i.e., estimating the longitude/latitude coordinates of a video without GPS information) has been approached with diverse methods and ideas in the research community and significant improvements have been made. So far, however, systems have only been compared against each other and no systematic study on human performance has been conducted. Based on a human-subject study with 11,900 experiments, this article presents a human baseline for location estimation for different combinations of modalities (audio, audio/video, audio/video/text). Furthermore, this article compares state-of-the-art location estimation systems with the human baseline. Although the overall performance of humans' multimodal video location estimation is better than current machine learning approaches, the difference is quite small: For 41% of the test set, the machine's accuracy was superior to the humans. We present case studies and discuss why machines did better for some videos and not for others. Our analysis suggests new directions and priorities for future work on the improvement of location inference algorithms.

References

  1. L. Cao, J. Yu, J. Luo, and T. Huang. Enhancing semantic and geographic annotation of web images via logistic canonical correlation regression. In Proceedings of the 17th ACM international conference on Multimedia, MM '09, pages 125--134, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Chatzichristofis and Y. Boutalis. Cedd: Color and edge directivity descriptor: A compact descriptor for image indexing and retrieval. Computer Vision Systems, pages 312--322, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Chatzichristofis and Y. Boutalis. Fcth: Fuzzy color and texture histogram-a low level feature for accurate image retrieval. In Image Analysis for Multimedia Interactive Services, 2008. WIAMIS'08. Ninth International Workshop on, pages 191--196. Ieee, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Choi, G. Friedland, V. Ekambaram, and K. Ramchandran. Multimodal location estimation of consumer media: Dealing with sparse training data. In 2012 IEEE International Conference on Multimedia and Expo (ICME), pages 43--48. IEEE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Choi, H. Lei, and G. Friedland. The 2011 ICSI Video Location Estimation System. In Proc. of MediaEval, 2011.Google ScholarGoogle Scholar
  6. D. J. Crandall, L. Backstrom, D. Huttenlocher, and J. Kleinberg. Mapping the world's photos. In Proc. of WWW '09, pages 761--770, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Friedland, J. Choi, H. Lei, and A. Janin. Multimodal Location Estimation on Flickr Videos. In Proc. of the 2011 ACM Workshop on Social Media, pages 23--28, Scottsdale, Arizona, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Friedland, O. Vinyals, and T. Darrell. Multimodal Location Estimation. In Proceedings of ACM Multimedia, pages 1245--1251, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Gallagher, D. Joshi, J. Yu, and J. Luo. Geo-location inference from image content and user tags. In Proceedings of IEEE CVPR. IEEE, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  10. L. Gottlieb, J. Choi, G. Friedland, P. Kelm, and T. Sikora. Pushing the Limits of Mechanical Turk: Qualifying the Crowd for Video Geo-Location. Proceedings of the 2012 ACM Workshop on Crowdsourcing for Multimedia (CrowdMM), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Hatch, S. Kajarekar, and A. Stolcke. Within-class covariance normalization for SVM-based speaker recognition. In Proceedings of ISCA Interspeech, volume 4, 2006.Google ScholarGoogle Scholar
  12. J. Hays and A. Efros. IM2GPS: estimating geographic information from a single image. In IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008, pages 1--8, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  13. S. Ioffe. Probabilistic linear discriminant analysis. Computer Vision--ECCV 2006, pages 531--542, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. P. G. Ipeirotis. Analyzing the amazon mechanical turk marketplace. XRDS, 17(2):16--21, Dec. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. Karger, S. Oh, and D. Shah. Budget-optimal crowdsourcing using low-rank matrix approximations. In Communication, Control, and Computing (Allerton), 2011 49th Annual Allerton Conference on, pages 284--291, sept. 2011.Google ScholarGoogle ScholarCross RefCross Ref
  16. P. Kelm, S. Schmiedeke, and T. Sikora. A hierarchical, multi-modal approach for placing videos on the map using millions of flickr photographs. In Proc. of SBNMA '11, pages 15--20, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Kittur, E. H. Chi, and B. Suh. Crowdsourcing user studies with mechanical turk. In Proceedings of the twenty-sixth annual SIGCHI conference on Human factors in computing systems, CHI '08, pages 453--456, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. O. V. Laere, S. Schockaert, and B. Dhoedt. Ghent university at the 2011 placing task. In Proc. of MediaEval, 2011.Google ScholarGoogle Scholar
  19. M. Larson, M. Soleymani, P. Serdyukov, S. Rudinac, C. Wartena, V. Murdock, G. Friedland, R. Ordelman, and G. J. Jones. Automatic Tagging and Geo-Tagging in Video Collections and Communities. In ACM International Conference on Multimedia Retrieval (ICMR 2011), pages 51:1--51:8, April 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. H. Lei, J. Choi, and G. Friedland. City-Identification on Flickr Videos Using Acoustic Features. Technical report, ICSI Technical Report TR-11-001, 2011.Google ScholarGoogle Scholar
  21. J. Luo, D. Joshi, J. Yu, and A. Gallagher. Geotagging in multimedia and computer vision--a survey. Multimedia Tools Appl., 51:187--211, Jan. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Mediaeval web site. http://www.multimediaeval.org.Google ScholarGoogle Scholar
  23. D. M. Mount and S. Arya. ANN: A library for approximate nearest neighbor searching. In CGC 2nd Annual Fall Workshop on Computational Geometry, pages 153--, 1997.Google ScholarGoogle Scholar
  24. A. Oliva and A. Torralba. Building the gist of a scene: The role of global image features in recognition. Progress in brain research, 155:23--36, 2006.Google ScholarGoogle Scholar
  25. M. C. Palmer. Calculation of distance traveled by fishing vessels using gps positional data: A theoretical evaluation of the sources of error. Fisheries Research, 89(1):57--64, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  26. T. Rattenbury and M. Naaman. Methods for extracting place semantics from Flickr tags. ACM Transactions on the Web (TWEB), 3(1), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. B. Russell, A. Torralba, K. Murphy, and W. Freeman. Labelme: A database and web-based tool for image annotation. International Journal of Computer Vision, 77:157--173, 2008. 10.1007/s11263-007-0090--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. G. Schindler, M. Brown, and R. Szeliski. City-scale location recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2007. CVPR'07, pages 1--7, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  29. P. Serdyukov, V. Murdock, and R. van Zwol. Placing Flickr photos on a map. In ACM SIGIR, pages 484--491, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Soufifar, M. Kockmann, L. Burget, O. Plchot, O. Glembek, and T. Svendsen. iVector approach to phonotactic language recognition. In Proc. of Interspeech, pages 2913--2916, 2011.Google ScholarGoogle Scholar
  31. H. Tamura, S. Mori, and T. Yamawaki. Textural features corresponding to visual perception. Systems, Man and Cybernetics, IEEE Transactions on, 8(6):460--473, 1978.Google ScholarGoogle Scholar
  32. M. Wainwright and M. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1:1--305, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. W. Zhang and J. Kosecka. Image based localization in urban environments. In 3D Data Processing, Visualization, and Transmission, 3rd Intl. Symposium on, pages 33--40, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Human vs machine: establishing a human baseline for multimodal location estimation

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      ACM Conferences cover image
      MM '13: Proceedings of the 21st ACM international conference on Multimedia
      October 2013
      1166 pages
      ISBN:9781450324045
      DOI:10.1145/2502081

      Copyright © 2013 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 21 October 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Qualifiers

      • research-article

      Acceptance Rates

      MM '13 Paper Acceptance Rate 47 of 235 submissions, 20%
      Overall Acceptance Rate 2,077 of 8,139 submissions, 26%

      Upcoming Conference

      MM '22

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!