Abstract
Sentiments and opinions expressed in Web pages towards objects, entities, and products constitute an important portion of the textual content available in the Web. In the last decade, the analysis of such content has gained importance due to its high potential for monetization. Despite the vast interest in sentiment analysis, somewhat surprisingly, the discovery of sentimental or opinionated Web content is mostly ignored. This work aims to fill this gap and addresses the problem of quickly discovering and fetching the sentimental content present in the Web. To this end, we design a sentiment-focused Web crawling framework. In particular, we propose different sentiment-focused Web crawling strategies that prioritize discovered URLs based on their predicted sentiment scores. Through simulations, these strategies are shown to achieve considerable performance improvement over general-purpose Web crawling strategies in discovery of sentimental Web content.
- Ahmed Abbasi, Hsinchun Chen, and Arab Salem. 2008. Sentiment analysis in multiple languages: feature selection for opinion classification in Web forums. ACM Trans. Inf. Syst. 26, 3, 12:1--12:34. Google Scholar
Digital Library
- Ahmed Abbasi, Tianjun Fu, Daniel Zeng, and Donald Adjeroh. 2013. Crawling Credible Online Medical Sentiments for Social Intelligence. In Proceedings of the ASE/IEEE International Conference on Social Computing. 254--263. Google Scholar
Digital Library
- Dirk Ahlers and Susanne Boll. 2009. Adaptive geospatially focused crawling. In Proceedings of the 18th ACM International Conference on Information and Knowledge Management. 445--454. Google Scholar
Digital Library
- Ismail Sengor Altingovde and Ozgur Ulusoy. 2004. Exploiting interclass rules for focused crawling. IEEE Intell. Syst. 19, 6, 66--73. Google Scholar
Digital Library
- Andrea Esuli Stefano Baccianella and Fabrizio Sebastiani. 2010. SentiWordNet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In Proceedings of the 7th Conference on International Language Resources and Evaluation.Google Scholar
- Xue Bai. 2011. Predicting consumer sentiments from online text. Decision Support Syst. 50, 4, 732--742. Google Scholar
Digital Library
- Sotiris Batsakis, Euripides G. M. Petrakis, and Evangelos Milios. 2009. Improving the performance of focused web crawlers. Data Knowl. Eng. 68, 10, 1001--1013. Google Scholar
Digital Library
- Philip Beineke, Trevor Hastie, Christopher Manning, and Shivakumar Vaithyanathan. 2004. Exploring sentiment summarization. In Proceedings of the AAAI Spring Symposium on Exploring Attitude and Affect in Text: Theories and Applications. 1--4.Google Scholar
- Krishna Bharat, Andrei Broder, Jeffrey Dean, and Monika R. Henzinger. 2000. A comparison of techniques to find mirrored hosts on the WWW. J. Amer. Soc. Inf. Sci. Technol. 51, 12, 1114--1122. Google Scholar
Digital Library
- Andrei Z. Broder, Marc Najork, and Janet L. Wiener. 2003. Efficient URL caching for World Wide Web crawling. In Proceedings of the 12th International Conference on World Wide Web. 679--689. Google Scholar
Digital Library
- Soumen Chakrabarti, Martin van den Berg, and Byron Dom. 1999. Focused crawling: A new approach to topic-specific Web resource discovery. Computer Networks 31, 11--16, 1623--1640. DOI:http://dx.doi.org/10.1016/S1389-1286(99)00052-3 Google Scholar
Digital Library
- Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 3, 27:1--27:27. Google Scholar
Digital Library
- Sergiu Chelaru, Ismail Sengör Altingövde, Stefan Siersdorfer, and Wolfgang Nejdl. 2013. Analyzing, detecting, and exploiting sentiment in web queries. ACM Trans. Web 8, 6, 1. Google Scholar
Digital Library
- YoungSik Choi, KiJoo Kim, and MunSu Kang. 2005. A focused crawling for the web resource discovery using a modified proximal support vector machines. In Proceedings of the International Conference on Computational Science and its Applications. 186--194. Google Scholar
Digital Library
- Yoonjung Choi, Youngho Kim, and Sung-Hyon Myaeng. 2009. Domain-specific sentiment analysis using contextual feature generation. In Proceedings of the 1st International CIKM Workshop on Topic-Sentiment Analysis for Mass Opinion. 37--44. Google Scholar
Digital Library
- Gordon V. Cormack, Mark D. Smucker, and Charles L. Clarke. 2011. Efficient and effective spam filtering and re-ranking for large web datasets. Inf. Retrieval 14, 5, 441--465. Google Scholar
Digital Library
- Kushal Dave, Steve Lawrence, and David M. Pennock. 2003. Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In Proceedings of the 12th International Conference on World Wide Web. 519--528. Google Scholar
Digital Library
- Marc Ehrig and Alexander Maedche. 2003. Ontology-focused crawling of web documents. In Proceedings of the ACM Symposium on Applied Computing. 1174--1178. Google Scholar
Digital Library
- David Eichmann. 1995. Ethical web agents. Comput. Networks ISDN Syst. 28, 1--2, 127--136. Google Scholar
Digital Library
- Tianjun Fu, Ahmed Abbasi, Daniel Zeng, and Hsinchun Chen. 2012. Sentimental spidering: leveraging opinion information in focused crawlers. ACM Trans. Inf. Syst. 30, 4, 24. Google Scholar
Digital Library
- Shima Gerani, Mark J. Carman, and Fabio Crestani. 2009. Investigating learning approaches for blog post opinion retrieval. In Proceedings of the 31st European Conference on Information Retrieval. 313--324. Google Scholar
Digital Library
- Namrata Godbole, Manjunath Srinivasaiah, and Steven Skiena. 2007. Large-scale sentiment analysis for news and blogs. In Proceedings of the International Conference on Weblogs and Social Media.Google Scholar
- Michelle L. Gregory, Nancy Chinchor, Paul Whitney, Richard Carter, Elizabeth Hetzler, and Alan Turner. 2006. User-directed sentiment analysis: Visualizing the affective content of documents. In Proceedings of the Workshop on Sentiment and Subjectivity in Text. 23--30. Google Scholar
Digital Library
- Allan Heydon and Marc Najork. 1999. Mercator: a scalable, extensible web crawler. World Wide Web 2, 4, 219--229. Google Scholar
Digital Library
- Judy Johnson, Kostas Tsioutsiouliklis, and C. Lee Giles. 2003. Evolving strategies for focused web crawling. In Proceedings of the 20th International Conference on Machine Learning. 298--305.Google Scholar
- Onur Kucuktunc, B. Barla Cambazoglu, Ingmar Weber, and Hakan Ferhatosmanoglu. 2012. A large-scale sentiment analysis for Yahoo! Answers. In Proceedings of the 5th ACM International Conference on Web Search and Data Mining. 633--642. Google Scholar
Digital Library
- Kevin Lerman, Sasha Blair-Goldensohn, and Ryan McDonald. 2009. Sentiment summarization: evaluating and learning user preferences. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. 514--522. Google Scholar
Digital Library
- Hongyu Liu, Evangelos Milios, and Jeannette Janssen. 2004. Probabilistic models for focused web crawling. In Proceedings of the 6th ACM International Workshop on Web Information and Data Management. 16--22. Google Scholar
Digital Library
- Tetsuya Nasukawa and Jeonghee Yi. 2003. Sentiment analysis: capturing favorability using natural language processing. In Proceedings of the 2nd International Conference on Knowledge Capture. 70--77. Google Scholar
Digital Library
- Neil O'Hare, Michael Davy, Adam Bermingham, Paul Ferguson, Píaraic Sheridan, Cathal Gurrin, and Alan F. Smeaton. 2009. Topic-dependent sentiment analysis of financial blogs. In Proceedings of the 1st International CIKM Workshop on Topic-Sentiment Analysis for Mass Opinion. 9--16. Google Scholar
Digital Library
- Christopher Olston and Marc Najork. 2010.Web crawling. Found. Trends Inf. Retrieval 4, 3, 175--246. Google Scholar
Digital Library
- Bo Pang and Lillian Lee. 2008. Opinion mining and sentiment analysis. Found. Trends Inf. Retrieval 2, 1--135. Issue 1--2. Google Scholar
Digital Library
- Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 79--86. Google Scholar
Digital Library
- Gautam Pant and Padmini Srinivasan. 2005. Learning to crawl: comparing classification schemes. ACM Trans. Inf. Syst. 23, 4, 430--462. Google Scholar
Digital Library
- Jialun Qin, Yilu Zhou, and Michael Chau. 2004. Building domain-specific web collections for scientific digital libraries: A meta-search enhanced focused crawling method. In Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries. 135--141. Google Scholar
Digital Library
- Mike Thelwall, Kevan Buckley, and Georgios Paltoglou. 2011. Sentiment in Twitter events. J. Amer. Soc. Inf. Sci. Technol. 62, 2, 406--418. Google Scholar
Digital Library
- Mike Thelwall, Kevan Buckley, and Georgios Paltoglou. 2012. Sentiment strength detection for the social web. J. Amer. Soc. Inf. Sci. Technol. 63, 1, 163--173. Google Scholar
Digital Library
- Mike Thelwall, Kevan Buckley, Georgios Paltoglou, Di Cai, and Arvid Kappas. 2010. Sentiment strength detection in short informal text. J. Amer. Soc. Inf. Sci. Technol. 61, 12, 2544--2558. Google Scholar
Digital Library
- Tun Thura Thet, Jin-Cheon Na, Christopher S. G. Khoo, and Subbaraj Shakthikumar. 2009. Sentiment analysis of movie reviews on discussion boards using a linguistic approach. In Proceedings of the 1st International CIKM Workshop on Topic-Sentiment Analysis for Mass Opinion. 81--84. Google Scholar
Digital Library
- Peter D. Turney. 2002. Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 417--424. Google Scholar
Digital Library
- Wouter van Atteveldt, Jan Kleinnijenhuis, Nel Ruigrok, and Stefan Schlobach. 2008. Good news or bad news? Conducting sentiment analysis on Dutch text to distinguish between positive and negative relations. J. Inf. Tech. Politics 5, 1, 73--94.Google Scholar
Cross Ref
- A. Gural Vural, B. Barla Cambazoglu, and Pinar Senkul. 2012a. Sentiment-focused web crawling. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management. 2020--2024. Google Scholar
Digital Library
- A. Gural Vural, B. Barla Cambazoglu, Pinar Senkul, and Ozge Tokgoz. 2012b. A framework for sentiment analysis in Turkish: Application to polarity detection of movie reviews in Turkish. In Proceedings of the 27th International Symposium on Computer and Information Sciences. 437--445.Google Scholar
- Xiaolong Wang, Furu Wei, Xiaohua Liu, Ming Zhou, and Ming Zhang. 2011. Topic sentiment analysis in Twitter: A graph-based hashtag sentiment classification approach. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 1031--1040. Google Scholar
Digital Library
- Jeonghee Yi, Tetsuya Nasukawa, Razvan Bunescu, and Wayne Niblack. 2003. Sentiment analyzer: Extracting sentiments about a given topic using natural language processing techniques. In Proceedings of the 3rd IEEE International Conference on Data Mining. 427--434. Google Scholar
Digital Library
- Meiyappan Yuvarani, N. Ch. Sriman Narayana Iyengar, and Arputharaj Kannan. 2006. LSCrawler: A framework for an enhanced focused web crawler based on link semantics. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence. 794--800. Google Scholar
Digital Library
- Changli Zhang, Daniel Zeng, Jiexun Li, Fei-Yue Wang, and Wanli Zuo. 2009. Sentiment analysis of Chinese documents: From sentence to document level. J. Amer. Soc. Inf. Sci. Technol. 60, 12, 2474--2487. Google Scholar
Digital Library
- Wei Zhang, Clement Yu, and Weiyi Meng. 2007. Opinion retrieval from blogs. In Proceedings of the 16th ACM International Conference on Information and Knowledge Management. 831--840. Google Scholar
Digital Library
Index Terms
Sentiment-Focused Web Crawling
Recommendations
Sentiment Crawling: Extremist Content Collection through a Sentiment Analysis Guided Web-Crawler
ASONAM '15: Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015As the data generated on the internet exponentially increases, developing guided data collection methods become more and more essential to the research process. This paper proposes an approach to building a self-guiding web-crawler to collect data ...
Sentiment-focused web crawling
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge managementThe sentiments and opinions that are expressed in web pages towards objects, entities, and products constitute an important portion of the textual content available in the Web. Despite the vast interest in sentiment analysis and opinion mining, somewhat ...
Exploiting Interclass Rules for Focused Crawling
A focused crawler is an agent that concentrates on a particular target topic and tries to visit and gather only relevant pages from the Web. A crucial issue for a focused crawler is the underlying heuristic for deciding the page to visit next. The ...






Comments