Abstract
There has been a recent swell of interest in the analysis of blog comments. However, much of the work focuses on detecting comment spam in the blogsphere. An important issue that has been neglected so far is the identification of diversionary comments. Diversionary comments are defined as comments that divert the topic from the original post. A possible purpose is to distract readers from the original topic and draw attention to a new topic. We categorize diversionary comments into five types based on our observations and propose an effective framework to identify and flag them. To the best of our knowledge, the problem of detecting diversionary comments has not been studied so far. We solve the problem in two different ways: (i) rank all comments in descending order of being diversionary and (ii) consider it as a classification problem. Our evaluation on 4,179 comments under 40 different blog posts from Digg and Reddit shows that the proposed method achieves the high mean average precision of 91.9% when the problem is considered as a ranking problem and 84.9% of F-measure as a classification problem. Sensitivity analysis indicates that the effectiveness of the method is stable under different parameter settings.
- Erik Aumayr, Jeffrey Chan, and Conor Hayes. 2011. Reconstruction of threaded conversations in online discussion forums. In Proceedings of the International AAAI Conference on Weblogs and Social Media.Google Scholar
- Ricardo Baeza-Yates and Berthier Ribeiro-Neto. 2008. Modern Information Retrieval (2nd ed.). Addison-Wesley. Google Scholar
Digital Library
- Somnath Banerjee, Krishnan Ramanathan, and Ajay Gupta. 2007. Clustering short texts using Wikipedia. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). ACM, New York, NY, 787--788. Google Scholar
Digital Library
- Eric Bengtson and Dan Roth. 2008. Understanding the value of features for coreference resolution. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’08). 294--303. http://dl.acm.org/citation.cfm?id=1613715.1613756 Google Scholar
Digital Library
- Archana Bhattarai, Vasile Rus, and Dipankar Dasgupta. 2009. Characterizing comment spam in the blogosphere through content analysis. In Proceedings of the IEEE Symposium on Computational Intelligence in Cyber Security (CICS’09).Google Scholar
Cross Ref
- Christopher M. Bishop. 2007. Pattern Recognition and Machine Learning. Information Science and Statistics Series. Springer. Google Scholar
Digital Library
- Enrico Blanzieri and Anton Bryl. 2008. A survey of learning-based techniques of email spam filtering. Artificial Intelligence Review 29, 1, 63--92. DOI:http://dx.doi.org/10.1007/s10462-009-9109-6 Google Scholar
Digital Library
- David M. Blei, Andrew Y. Ng, Michael I. Jordan, and John Lafferty. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993--1022. Google Scholar
Digital Library
- Carlos Castillo and Brian D. Davison. 2010. Adversarial Web search. Foundations and Trends in Information Retrieval 4, 5, 377--486. Google Scholar
Digital Library
- Carlos Castillo, Debora Donato, Luca Becchetti, Paolo Boldi, Stefano Leonardi, Massimo Santini, and Sebastiano Vigna. 2006. A reference collection for Web spam. SIGIR Forum 40, 11--24. Google Scholar
Digital Library
- Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 37--46.Google Scholar
Cross Ref
- Gordon V. Cormack. 2008. Email spam filtering: A systematic review. Foundation and Trends in Information Retrieval 1, 4, 335--455. DOI:http://dx.doi.org/10.1561/1500000006 Google Scholar
Digital Library
- Gordon V. Cormack, José María Gómez Hidalgo, and Enrique Puertas Sánz. 2007. Spam filtering for short messages. In Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM’07). ACM, New York, NY, 313--320. DOI:http://dx.doi.org/10.1145/1321440.1321486 Google Scholar
Digital Library
- Yoav Freund and Robert E. Schapire. 1999. Large margin classification using the perceptron algorithm. Machine Learning 37, 3, 277--296. DOI:http://dx.doi.org/10.1023/A:1007662407062 Google Scholar
Digital Library
- Bent Fuglede and Flemming Topsoe. 2004. Jensen-Shannon divergence and Hilbert space embedding. In Proceedings of the International Symposium on Information Theory (ISIT’04).Google Scholar
Cross Ref
- Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI’07). 1606--1611. http://dl.acm.org/citation.cfm?id=1625275.1625535 Google Scholar
Digital Library
- Tom Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101, 1, 5228--5235.Google Scholar
Cross Ref
- Gregor Heinrich. 2004. Parameter Estimation for Text Analysis. Technical Report. University of Lepzig, Germany.Google Scholar
- Liangjie Hong and Brian D. Davison. 2010. Empirical study of topic modeling in Twitter. In Proceedings of the 1st Workshop on Social Media Analytics (SOMA’10). ACM, New York, NY, 80--88. Google Scholar
Digital Library
- Xia Hu, Nan Sun, Chao Zhang, and Tat-Seng Chua. 2009. Exploiting internal and external semantics for the clustering of short texts using world knowledge. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). ACM, New York, NY, 919--928. Google Scholar
Digital Library
- Molly Ireland, Amy Gonzales, James W. Pennebaker, Cindy K. Chung, and Roger J. Booth. 2007. The Development and Psychometric Properties of LIWC2007. LIWC.net, Austin, TX.Google Scholar
- Nitin Jindal and Bing Liu. 2008. Opinion spam and analysis. In Proceedings of the 2008 International Conference on Web Search and Data Mining (WSDM’08). ACM, New York, NY, 219--230. DOI:http://dx.doi.org/10.1145/1341531.1341560 Google Scholar
Digital Library
- Nitin Jindal, Bing Liu, and Ee-Peng Lim. 2010. Finding unusual review patterns using unexpected rules. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM’10). ACM, New York, NY, 1549--1552. DOI:http://dx.doi.org/10.1145/1871437.1871669 Google Scholar
Digital Library
- Solomon Kullback. 2008. Information Theory and Statistics. Wiley.Google Scholar
- J. Richard Landis and Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics. 33, 1, 159--174.Google Scholar
- François Mairesse, Marilyn A. Walker, Matthias R. Mehl, and Roger K. Moore. 2007. Using linguistic cues for the automatic recognition of personality in conversation and text. Journal of Artificial Intelligence Research 30, 1, 457--500. http://dl.acm.org/citation.cfm?id=1622637.1622649 Google Scholar
Digital Library
- Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY. Google Scholar
Digital Library
- Juan Martinez-Romo and Lourdes Araujo. 2009. Web spam identification through language model analysis. In Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’09). ACM, New York, NY, 21--28. DOI:http://dx.doi.org/10.1145/1531914.1531920 Google Scholar
Digital Library
- Edgar Meij, Wouter Weerkamp, and Maarten de Rijke. 2012. Adding semantics to microblog posts. In Proceedings of the 5th ACM International Conference on Web Search and Data Mining (WSDM’12). ACM, New York, NY, 563--572. Google Scholar
Digital Library
- Gilad Mishne. 2005. Blocking blog spam with language model disagreement. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’05).Google Scholar
- Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fetterly. 2006. Detecting spam Web pages through content analysis. In Proceedings of the 15th International Conference on World Wide Web. Google Scholar
Digital Library
- Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T. Hancock. 2011. Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (HLT’11). 309--319. http://dl.acm.org/citation.cfm?id=2002472.2002512 Google Scholar
Digital Library
- Gerard Salton, Andrew Wong, and Chungshu S. Yang. 1974. A Vector Space Model for Automatic Indexing. Technical Report. Ithaca, NY. Google Scholar
Digital Library
- David Sculley and Gabriel M. Wachman. 2007. Relaxed online SVMs for spam filtering. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). ACM, New York, NY, 415--422. DOI:http://dx.doi.org/10.1145/1277741.1277813 Google Scholar
Digital Library
- Mark Steyvers and Tom Griffiths. 2007. Probabilistic topic models. In Handbook of Latent Semantic Analysis, T. K. Landauer, D. S. McNamara, S. Dennis, and W. Kintsch (Eds.). Lawrence Erlbaum Associates, 427--448.Google Scholar
- Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2004. Hierarchical Dirichlet processes. Journal of the American Statistical Association 101, 476, 1566--1581.Google Scholar
- Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Google Scholar
Digital Library
- Manos Tsagkias, Maarten de Rijke, and Wouter Weerkamp. 2011. Linking online news and social media. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM’11). ACM, New York, NY, 565--574. Google Scholar
Digital Library
- Dan Twining, Matthew M. Williamson, Miranda J. F. Mowbray, and Maher Rahmouni. 2004. Email prioritization: Reducing delays on legitimate mail caused by junk mail. In Proceedings of the USENIX Annual Technical Conference (ATEC’04). 4. http://dl.acm.org/citation.cfm?id=1247415.1247419 Google Scholar
Digital Library
- Hongning Wang, Chi Wang, ChengXiang Zhai, and Jiawei Han. 2011a. Learning online discussion structures by conditional random fields. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 435--444. Google Scholar
Digital Library
- Li Wang, Marco Lui, Su Nam Kim, Joakim Nivre, and Timothy Baldwin. 2011b. Predicting thread discourse structure over technical Web forums. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 13--25. Google Scholar
Digital Library
- Yi-Min Wang, Ming Ma, Yuan Niu, and Hao Chen. 2007. Spam double-funnel: Connecting Web spammers with advertisers. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). ACM, New York, NY, 291--300. DOI:http://dx.doi.org/10.1145/1242572.1242612 Google Scholar
Digital Library
- Tae Yano, William W. Cohen, and Noah A. Smith. 2009. Predicting response to political blog posts with topic models. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’09). 477--485. Google Scholar
Digital Library
- Mingliang Zhu, Weiming Hu, and Ou Wu. 2008. Topic detection and tracking for threaded discussion communities. In Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Volume 01. IEEE, Los Alamitos, CA, 77--83. Google Scholar
Digital Library
- Li Zhuang, John Dunagan, Daniel R. Simon, Helen J. Wang, and J. Doug Tygar. 2008. Characterizing botnets from email spam records. In Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats (LEET’08). Article No. 2. http://dl.acm.org/citation.cfm?id=1387709.1387711 Google Scholar
Digital Library
Index Terms
Diversionary Comments under Blog Posts
Recommendations
Diversionary comments under political blog posts
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge managementAn important issue that has been neglected so far is the identification of diversionary comments. Diversionary comments under political blog posts are defined as comments that deliberately twist the bloggers' intention and divert the topic to another ...
Comments-oriented blog summarization by sentence extraction
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge managementMuch existing research on blogs focused on posts only, ignoring their comments. Our user study conducted on summarizing blog posts, however, showed that reading comments does change one's understanding about blog posts. In this research, we aim to ...
A Novel Hybrid HDP-LDA Model for Sentiment Analysis
WI-IAT '13: Proceedings of the 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 01Sentiment analysis studies the public opinions towards an entity, and it is an important research area in data mining. Recently, a lot of sentiment analysis models have been proposed, including supervised and unsupervised approaches. However, the role ...






Comments