skip to main content
research-article

Opinion Question Answering by Sentiment Clip Localization

Published:02 November 2015Publication History
Skip Abstract Section

Abstract

This article considers multimedia question answering beyond factoid and how-to questions. We are interested in searching videos for answering opinion-oriented questions that are controversial and hotly debated. Examples of questions include “Should Edward Snowden be pardoned?” and “Obamacare—unconstitutional or not?”. These questions often invoke emotional response, either positively or negatively, hence are likely to be better answered by videos than texts, due to the vivid display of emotional signals visible through facial expression and speaking tone. Nevertheless, a potential answer of duration 60s may be embedded in a video of 10min, resulting in degraded user experience compared to reading the answer in text only. Furthermore, a text-based opinion question may be short and vague, while the video answers could be verbal, less structured grammatically, and noisy because of errors in speech transcription. Direct matching of words or syntactic analysis of sentence structure, such as adopted by factoid and how-to question-answering, is unlikely to find video answers. The first problem, the answer localization, is addressed by audiovisual analysis of the emotional signals in videos for locating video segments likely expressing opinions. The second problem, questions and answers matching, is tackled by a deep architecture that nonlinearly matches text words in questions and speeches in videos. Experiments are conducted on eight controversial topics based on questions crawled from Yahoo! Answers and Internet videos from YouTube.

References

  1. Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010. SentiWordNet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In Proceedings of the 7th International Conference on Language Resources and Evaluation.Google ScholarGoogle Scholar
  2. Damian Borth, Rongrong Ji, Tao Chen, Thomas Breuel, and Shih-Fu Chang. 2013. Large-scale visual sentiment ontology and detectors using adjective noun pairs. In Proceedings of the 21st ACM International Conference on Multimedia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Eric Brill, Jimmy Lin, Michele Banko, Susan Dumais, and Andrew Ng. 2001. Data-intensive question answering. In Proceedings of the 10th Text REtrieval Conference (TREC).Google ScholarGoogle Scholar
  4. Jinwei Cao and Jay F. Nunamaker. 2004. Question answering on lecture videos: A multifaceted approach. In Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Tao Chen, Felix X. Yu, Jiawei Chen, Yin Cui, Yan-Ying Chen, and Shih-Fu Chang. 2014. Object-based visual sentiment concept analysis and application. In Proceedings of the ACM International Conference on Multimedia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Tat-Seng Chua, Richang Hong, Guangda Li, and Jinhui Tang. 2009. From text question-answering to multimedia QA on web-scale media resources. In Proceedings of the 1st ACM Workshop on Large-scale Multimedia Retrieval and Mining. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Everingham, J. Sivic, and A. Zisserman. 2006. “Hello! My name is . . . Buffy” -- Automatic naming of characters in TV video. In Proceedings of the British Machine Vision Conference.Google ScholarGoogle Scholar
  8. Ulf Hermjakob, Abdessamad Echihabi, and Daniel Marcu. 2002. Natural language based reformulation resource and web exploitation for question answering. In Proceedings of TREC 2002.Google ScholarGoogle Scholar
  9. Gary Kacmarcik. 2005. Multi-modal question-answering: Questions without keyboards. In Asia Federation of Natural Language Processing.Google ScholarGoogle Scholar
  10. Elie Khoury, Paul Gay, and Jean-Marc Odobez. 2013. Fusing matching and biometric similarity measures for face diarization in video. In Proceedings of the 3rd ACM Conference on International Conference on Multimedia Retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Y. Lecun, L. Bottou, G. B. Orr, and K. R. Müller. 1998. Efficient backprop. In Neural Networks: Tricks of the Trade. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yue-Shi Lee, Yu-Chieh Wu, and Jie-Chi Yang. 2009. BVideoQA: Online English/Chinese bilingual video question answering. Journal of the American Society for Information Science and Technology 509--525. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Guangda Li, Haojie Li, Zhaoyan Ming, Richang Hong, Sheng Tang, and Tat-Seng Chua. 2010. Question answering over community-contributed web videos. IEEE MultiMedia 46--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Zhengdong Lu and Hang Li. 2013. A deep architecture for matching short texts. In Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  15. Jana Machajdik and Allan Hanbury. 2010. Affective image classification using features inspired by psychology and art theory. In Proceedings of the International Conference on Multimedia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. David Mimno, Wei Li, and Andrew McCallum. 2007. Mixtures of hierarchical topics with Pachinko allocation. In Proceedings of the 24th International Conference on Machine Learning. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Liqiang Nie, Meng Wang, Zhengjun Zha, Guangda Li, and Tat-Seng Chua. 2011. Multimedia answering: Enriching text QA with media information. In Proceedings of the 34th International ACM Conference on Research and Development in Information Retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Dragomir R. Radev, Hong Qi, Zhiping Zheng, Sasha Blair-Goldensohn, Zhu Zhang, Weiguo Fan, and John Prager. 2001. Mining the web for answers to natural language questions. In Proceedings of the 10th International Conference on Information and Knowledge Management. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford. 1996. Okapi at TREC-3. 109--126.Google ScholarGoogle Scholar
  21. Roman Rosipal and Nicole Krmer. 2006. Overview and recent advances in partial least squares. In Subspace, Latent Structure and Feature Selection Techniques. 34--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Mickael Rouvier, Gregor Dupuy, Paul Gay, Elie Khoury, Teva Merlin, and Sylvain Meignier. 2013. An open-source state-of-the-art toolbox for broadcast news diarization. In INTERSPEECH.Google ScholarGoogle Scholar
  23. Jianbo Shi and Carlo Tomasi. 1994. Good features to track. In IEEE Conference on Computer Vision and Pattern Recognition. 593--600.Google ScholarGoogle Scholar
  24. Paul Viola and Michael J. Jones. 2004. Robust real-time face detection. International Journal of Computer Vision 57, 2, 137--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Kai Wang, Zhaoyan Ming, and Tat-Seng Chua. 2009. A syntactic tree matching approach to finding similar questions in community-based QA services. In Proceedings of the 32nd International ACM Conference on Research and Development in Information Retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Wei Wu, Hang Li, and Jun Xu. 2013. Learning query and document similarities from click-through bipartite graph with metadata. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Yu-Chyeh Wu, Chia Hui Chang, and Yue-Shi Lee. 2004. CLVQ: Cross-language video question/answering system. In Proceedings of the IEEE 6th International Symposium on Multimedia Software Engineering. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Yu-Chieh Wu and Jie-Chi Yang. 2008. A robust passage retrieval algorithm for video question answering. IEEE Transactions on Circuits and Systems for Video Technology 10, 1411--1421. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Hui Yang, Tat-Seng Chua, Shuguang Wang, and Chun-Keat Koh. 2003. Structured use of external knowledge for event-based open domain question answering. In Proceedings of the 26th Annual International ACM Conference on Research and Development in Information Retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Tom Yeh, John J. Lee, and Trevor Darrell. 2008. Photo-based question answering. In Proceedings of the 16th ACM International Conference on Multimedia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Wei Zhang, Lei Pang, and Chong-Wah Ngo. 2012. Snap-and-ask: Answering multimodal question by naming visual instance. In Proceedings of the 20th ACM International Conference on Multimedia. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Opinion Question Answering by Sentiment Clip Localization

    Recommendations

    Reviews

    Christoph F. Strnadl

    This application of artificial intelligence (AI) demonstrates two things: the astonishing types of questions algorithms can answer today, and (implicitly) what challenges and limitations we currently still face despite the current renewed hype in AI and deep learning. The authors develop and implement a three-phase algorithm for answering opinion questions (for example, "Is Obamacare unconstitutional or not__?__") based on the analysis of large sets of video clips. Phase one detects sentiment-oriented (that is, positive, negative, or neutral) speeches in video streams. Phase two then locates (within these speeches) segments where opinion holders (as opposed to other types of speakers like moderators) express their views. The third phase then tries to match potential answers (that is, spoken words of identified opinion holders) to the original question by treating this as a form of translation problem from a "question" vocabulary to an "answer" vocabulary. Whereas steps one and two use existing algorithms, the third step extends a published two-layer neural network to a four-layer neural network endowed with an increased number of neurons in and more connections between the layers. Despite the authors' claim, however, this is not really a deep learning neural network because intra-layer connections or advanced features (for example, neurons possessing memory) are missing altogether. In total, the model contains 506,840 parameters, which have been learned from various (pre-analyzed) data sets. A comparison with four existing models demonstrates that the performance of the new model is approximately 20 percent better than the second best. The study also finds that users prefer the identified video answers to the extracted text answers with statistical significance. The work is geared toward the professional or academic fluent in Bayesian statistics as applied to sentiment analysis and neural networks because prior models used or even crucial acronyms (for example, discounted cumulative gain (DCG)) are not explained. Online Computing Reviews Service

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!