skip to main content
research-article

A Graph Analytical Approach for Topic Detection

Published:01 December 2013Publication History
Skip Abstract Section

Abstract

Topic detection with large and noisy data collections such as social media must address both scalability and accuracy challenges. KeyGraph is an efficient method that improves on current solutions by considering keyword cooccurrence. We show that KeyGraph has similar accuracy when compared to state-of-the-art approaches on small, well-annotated collections, and it can successfully filter irrelevant documents and identify events in large and noisy social media collections. An extensive evaluation using Amazon’s Mechanical Turk demonstrated the increased accuracy and high precision of KeyGraph, as well as superior runtime performance compared to other solutions.

References

  1. Aggarwal, C. and Subbian, K. 2012. Event detection in social streams. In Proceedings of the SIAM International Conference on Data Mining (SDM). 624--635.Google ScholarGoogle Scholar
  2. Allan, J., Papka, R., and Lvrenko, V. 1998. On-line new event detection and tracking. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Al Sumait, L., Barbará, D., and Domeniconi, C. 2008. On-line lDA: Adaptive topic models for mining text streams with applications to topic detection and tracking. In Proceedings of the International Conference on Data Mining (ICDM). 3--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Asuncion, A., Welling, M., Smyth, P., and Teh, Y.-W. 2009. On smoothing and inference for topic models. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Asur, S., Huberman, B. A., Szabó, G., and Wang, C. 2011. Trends in social media: Persistence and decay. In Proceedings of the 5th International AAAI Conference on Weblogs and Social Media (ICWSM).Google ScholarGoogle Scholar
  6. Becker, H., Naaman, M., and Gravano, L. 2010. Learning similarity metrics for event identification in social media. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM). 291--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Brants, T., Chen, F., and Farahat, A. 2003. A system for new event detection. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 330--337. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Bun, K. K., Ishizuka, M., and Ishizuka, B. M. 2002. Topic extraction from news archive using tf*pdf algorithm. In Proceedings of the 3rd International Conference on Web Informtion Systems Engineering (WISE). 73--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cataldi, M., Di Caro, L., and Schifanella, C. 2010. Emerging topic detection on twitter based on temporal and social terms evaluation. In Proceedings of the 10th International Workshop on Multimedia Data Mining (MDMKDD). 4:1--4:10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S., and Blei, D. M. 2009. Reading tea leaves: How humans interpret topic models. J. Neural Inform. Process. Syst. 31.Google ScholarGoogle Scholar
  12. Cutting, D. R., Karger, D. R., Pedersen, J. O., and Tukey, J. W. 1992. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 318--329. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Dhillon, I. S. 2001. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 269--274. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. He, Q., Chang, K., and Lim, E.-P. 2007a. Analyzing feature trajectories for event detection. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 207--214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. He, Q., Chang, K., and Lim, E.-P. 2007b. Using burstiness to improve clustering of topics in news streams. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM). 493--498. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Hu, Y., John, A., Seligmann, D. D., and Wang, F. 2012. What were the tweets about? Topical associations between public events and twitter feeds. In Proceedings of the 6th International AAAI Conference on Weblogs and Social Media (ICWSM).Google ScholarGoogle Scholar
  17. Kernighan, B. W. and Lin, S. 1970. An efficient heuristic procedure for partitioning graphs. Bell Syst. Tech. J. 49, 1, 291--307.Google ScholarGoogle ScholarCross RefCross Ref
  18. LDA-Blei. C implementation of variational expectation maximization for latent Dirichlet allocation (LDA). http://www.cs.princeton.edu/~blei/lda-c/index.html.Google ScholarGoogle Scholar
  19. LDA-Mallet. Machine learning for language toolkit. http://mallet.cs.umass.edu/topics.php.Google ScholarGoogle Scholar
  20. Li, H. and Yamanishi, K. 2000. Topic analysis using a finite mixture model. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP). 35--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Li, J., Huang, L., Bai, T., Wang, Z., and Chen, H. 2012. CDBIA: A dynamic community detection method based on incremental analysis. In Proceedings of the International Conference on Systems and Informatics (ICSAI). 2224--2228.Google ScholarGoogle Scholar
  22. Li, Z., Wang, B., Li, M., and Ma, W.-Y. 2005. A probabilistic model for retrospective news event detection. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Mori, M., Miura, T., and Shioya, I. 2004. Extracting events from web pages. In Proceedings of the International Conference on Advances in Intelligent Systems - Theory and Applications (AISTA).Google ScholarGoogle Scholar
  24. Mori, M., Miura, T., and Shioya, I. 2006. Topic detection and tracking for news web pages. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI). 338--342. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Newman, D., Lau, J. H., Grieser, K., and Baldwin, T. 2010. Automatic evaluation of topic coherence. In Proceedings of the Annual Conference of the North American Chapter of the Association for Human Language Technologies (HLT). 100--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Newman, M. E. J. 2004. Detecting community structure in networks. Euro. Phys. J. B---Condensed Matter and Complex Systems 38, 2, 321--330.Google ScholarGoogle ScholarCross RefCross Ref
  27. Ohsawa, Y., Benson, N. E., and Yachida, M. 1998. Keygraph: Automatic indexing by co-occurrence graph based on building construction metaphor. In Proceedings of the Advances in Digital Libraries Conference (ADL). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Pereira, F., Tishby, N., and Lee, L. 1993. Distributional clustering of english words. In Proceedings of the 31st Annual Meeting of Association for Computational Linguistics (ACL). 183--190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Prabowo, R., Thelwall, M., Hellsten, I., and Scharnhorst, A. 2008. Evolving debates in online communication: A graph analytical approach. Internet Res.: Electron Netw. App. Policy 18, 5, 520--540.Google ScholarGoogle ScholarCross RefCross Ref
  30. Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., and Parisi, D. 2004. Defining and identifying communities in networks. Proc. Natl. Acad. Sci. 101, 9, 2658--2663.Google ScholarGoogle Scholar
  31. Ruan, N., Jin, R., Lee, V., and Huang, K. 2009a. Dynamic module discovery in temporal complex networks. In Proceedings of the 2nd International Workshop on Analysis of Dynamic Networks, in Conjunction with SIAM International Conference on Data Mining.Google ScholarGoogle Scholar
  32. Ruan, N., Jin, R., Lee, V., and Huang, K. 2009b. A sparsification approach for temporal graphical model decomposition. In Proceedings of the International Conference on Data Mining (ICDM). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Salton, G. and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Inform. Proc. Manage. 24, 513--523. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Sayyadi, H., Hurst, M., and Maykov, A. 2009. Event detection and tracking in social streams. In Proceedings of the 3rd International Conference on Weblogs and Social Media (ICWSM).Google ScholarGoogle Scholar
  35. Steyvers, M. and Griffiths, T. 2007. Probabilistic Topic Models. Lawrence Erlbaum Associates.Google ScholarGoogle Scholar
  36. Tantrum, J., Murua, A., and Stuetzle, W. 2002. Hierarchical model-based clustering of large datasets through fractionation and refractionation. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 183--190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. 2004. J. Amer. Statist. Assoc. 476, 1566--1581.Google ScholarGoogle Scholar
  38. Toda, H. and Kataoka, R. 2005. A search result clustering method using informatively named entities. In Proceedings of the 7th Annual ACM International Workshop on Web Information and Data Management (WIDM). 81--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Wang, C., Blei, D. M., and Heckerman, D. 2008. Continuous time dynamic topic models. In Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence (UAI). 579--586.Google ScholarGoogle Scholar
  40. Wang, X. and McCallum, A. 2006. Topics over time: A non-Markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 424--433. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Wartena, C. and Brussee, R. 2008. Topic detection by clustering keywords. In Proceedings of the IEEE Computer Society DEXA Workshops. 54--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Yang, Y., Pierce, T., and Carbonell, J. G. 1998. A study on retrospective and on-line event detection. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Graph Analytical Approach for Topic Detection

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!