skip to main content
research-article

Diversionary Comments under Blog Posts

Published:24 September 2015Publication History
Skip Abstract Section

Abstract

There has been a recent swell of interest in the analysis of blog comments. However, much of the work focuses on detecting comment spam in the blogsphere. An important issue that has been neglected so far is the identification of diversionary comments. Diversionary comments are defined as comments that divert the topic from the original post. A possible purpose is to distract readers from the original topic and draw attention to a new topic. We categorize diversionary comments into five types based on our observations and propose an effective framework to identify and flag them. To the best of our knowledge, the problem of detecting diversionary comments has not been studied so far. We solve the problem in two different ways: (i) rank all comments in descending order of being diversionary and (ii) consider it as a classification problem. Our evaluation on 4,179 comments under 40 different blog posts from Digg and Reddit shows that the proposed method achieves the high mean average precision of 91.9% when the problem is considered as a ranking problem and 84.9% of F-measure as a classification problem. Sensitivity analysis indicates that the effectiveness of the method is stable under different parameter settings.

References

  1. Erik Aumayr, Jeffrey Chan, and Conor Hayes. 2011. Reconstruction of threaded conversations in online discussion forums. In Proceedings of the International AAAI Conference on Weblogs and Social Media.Google ScholarGoogle Scholar
  2. Ricardo Baeza-Yates and Berthier Ribeiro-Neto. 2008. Modern Information Retrieval (2nd ed.). Addison-Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Somnath Banerjee, Krishnan Ramanathan, and Ajay Gupta. 2007. Clustering short texts using Wikipedia. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). ACM, New York, NY, 787--788. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Eric Bengtson and Dan Roth. 2008. Understanding the value of features for coreference resolution. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’08). 294--303. http://dl.acm.org/citation.cfm?id=1613715.1613756 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Archana Bhattarai, Vasile Rus, and Dipankar Dasgupta. 2009. Characterizing comment spam in the blogosphere through content analysis. In Proceedings of the IEEE Symposium on Computational Intelligence in Cyber Security (CICS’09).Google ScholarGoogle ScholarCross RefCross Ref
  6. Christopher M. Bishop. 2007. Pattern Recognition and Machine Learning. Information Science and Statistics Series. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Enrico Blanzieri and Anton Bryl. 2008. A survey of learning-based techniques of email spam filtering. Artificial Intelligence Review 29, 1, 63--92. DOI:http://dx.doi.org/10.1007/s10462-009-9109-6 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. David M. Blei, Andrew Y. Ng, Michael I. Jordan, and John Lafferty. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Carlos Castillo and Brian D. Davison. 2010. Adversarial Web search. Foundations and Trends in Information Retrieval 4, 5, 377--486. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Carlos Castillo, Debora Donato, Luca Becchetti, Paolo Boldi, Stefano Leonardi, Massimo Santini, and Sebastiano Vigna. 2006. A reference collection for Web spam. SIGIR Forum 40, 11--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 37--46.Google ScholarGoogle ScholarCross RefCross Ref
  12. Gordon V. Cormack. 2008. Email spam filtering: A systematic review. Foundation and Trends in Information Retrieval 1, 4, 335--455. DOI:http://dx.doi.org/10.1561/1500000006 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Gordon V. Cormack, José María Gómez Hidalgo, and Enrique Puertas Sánz. 2007. Spam filtering for short messages. In Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM’07). ACM, New York, NY, 313--320. DOI:http://dx.doi.org/10.1145/1321440.1321486 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Yoav Freund and Robert E. Schapire. 1999. Large margin classification using the perceptron algorithm. Machine Learning 37, 3, 277--296. DOI:http://dx.doi.org/10.1023/A:1007662407062 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Bent Fuglede and Flemming Topsoe. 2004. Jensen-Shannon divergence and Hilbert space embedding. In Proceedings of the International Symposium on Information Theory (ISIT’04).Google ScholarGoogle ScholarCross RefCross Ref
  16. Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI’07). 1606--1611. http://dl.acm.org/citation.cfm?id=1625275.1625535 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Tom Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101, 1, 5228--5235.Google ScholarGoogle ScholarCross RefCross Ref
  18. Gregor Heinrich. 2004. Parameter Estimation for Text Analysis. Technical Report. University of Lepzig, Germany.Google ScholarGoogle Scholar
  19. Liangjie Hong and Brian D. Davison. 2010. Empirical study of topic modeling in Twitter. In Proceedings of the 1st Workshop on Social Media Analytics (SOMA’10). ACM, New York, NY, 80--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Xia Hu, Nan Sun, Chao Zhang, and Tat-Seng Chua. 2009. Exploiting internal and external semantics for the clustering of short texts using world knowledge. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). ACM, New York, NY, 919--928. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Molly Ireland, Amy Gonzales, James W. Pennebaker, Cindy K. Chung, and Roger J. Booth. 2007. The Development and Psychometric Properties of LIWC2007. LIWC.net, Austin, TX.Google ScholarGoogle Scholar
  22. Nitin Jindal and Bing Liu. 2008. Opinion spam and analysis. In Proceedings of the 2008 International Conference on Web Search and Data Mining (WSDM’08). ACM, New York, NY, 219--230. DOI:http://dx.doi.org/10.1145/1341531.1341560 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Nitin Jindal, Bing Liu, and Ee-Peng Lim. 2010. Finding unusual review patterns using unexpected rules. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM’10). ACM, New York, NY, 1549--1552. DOI:http://dx.doi.org/10.1145/1871437.1871669 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Solomon Kullback. 2008. Information Theory and Statistics. Wiley.Google ScholarGoogle Scholar
  25. J. Richard Landis and Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics. 33, 1, 159--174.Google ScholarGoogle Scholar
  26. François Mairesse, Marilyn A. Walker, Matthias R. Mehl, and Roger K. Moore. 2007. Using linguistic cues for the automatic recognition of personality in conversation and text. Journal of Artificial Intelligence Research 30, 1, 457--500. http://dl.acm.org/citation.cfm?id=1622637.1622649 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Juan Martinez-Romo and Lourdes Araujo. 2009. Web spam identification through language model analysis. In Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’09). ACM, New York, NY, 21--28. DOI:http://dx.doi.org/10.1145/1531914.1531920 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Edgar Meij, Wouter Weerkamp, and Maarten de Rijke. 2012. Adding semantics to microblog posts. In Proceedings of the 5th ACM International Conference on Web Search and Data Mining (WSDM’12). ACM, New York, NY, 563--572. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Gilad Mishne. 2005. Blocking blog spam with language model disagreement. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’05).Google ScholarGoogle Scholar
  31. Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fetterly. 2006. Detecting spam Web pages through content analysis. In Proceedings of the 15th International Conference on World Wide Web. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T. Hancock. 2011. Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (HLT’11). 309--319. http://dl.acm.org/citation.cfm?id=2002472.2002512 Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Gerard Salton, Andrew Wong, and Chungshu S. Yang. 1974. A Vector Space Model for Automatic Indexing. Technical Report. Ithaca, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. David Sculley and Gabriel M. Wachman. 2007. Relaxed online SVMs for spam filtering. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). ACM, New York, NY, 415--422. DOI:http://dx.doi.org/10.1145/1277741.1277813 Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Mark Steyvers and Tom Griffiths. 2007. Probabilistic topic models. In Handbook of Latent Semantic Analysis, T. K. Landauer, D. S. McNamara, S. Dennis, and W. Kintsch (Eds.). Lawrence Erlbaum Associates, 427--448.Google ScholarGoogle Scholar
  36. Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2004. Hierarchical Dirichlet processes. Journal of the American Statistical Association 101, 476, 1566--1581.Google ScholarGoogle Scholar
  37. Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Manos Tsagkias, Maarten de Rijke, and Wouter Weerkamp. 2011. Linking online news and social media. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM’11). ACM, New York, NY, 565--574. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Dan Twining, Matthew M. Williamson, Miranda J. F. Mowbray, and Maher Rahmouni. 2004. Email prioritization: Reducing delays on legitimate mail caused by junk mail. In Proceedings of the USENIX Annual Technical Conference (ATEC’04). 4. http://dl.acm.org/citation.cfm?id=1247415.1247419 Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Hongning Wang, Chi Wang, ChengXiang Zhai, and Jiawei Han. 2011a. Learning online discussion structures by conditional random fields. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 435--444. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Li Wang, Marco Lui, Su Nam Kim, Joakim Nivre, and Timothy Baldwin. 2011b. Predicting thread discourse structure over technical Web forums. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 13--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Yi-Min Wang, Ming Ma, Yuan Niu, and Hao Chen. 2007. Spam double-funnel: Connecting Web spammers with advertisers. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). ACM, New York, NY, 291--300. DOI:http://dx.doi.org/10.1145/1242572.1242612 Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Tae Yano, William W. Cohen, and Noah A. Smith. 2009. Predicting response to political blog posts with topic models. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’09). 477--485. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Mingliang Zhu, Weiming Hu, and Ou Wu. 2008. Topic detection and tracking for threaded discussion communities. In Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Volume 01. IEEE, Los Alamitos, CA, 77--83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Li Zhuang, John Dunagan, Daniel R. Simon, Helen J. Wang, and J. Doug Tygar. 2008. Characterizing botnets from email spam records. In Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats (LEET’08). Article No. 2. http://dl.acm.org/citation.cfm?id=1387709.1387711 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Diversionary Comments under Blog Posts

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!