Abstract
Topic detection with large and noisy data collections such as social media must address both scalability and accuracy challenges. KeyGraph is an efficient method that improves on current solutions by considering keyword cooccurrence. We show that KeyGraph has similar accuracy when compared to state-of-the-art approaches on small, well-annotated collections, and it can successfully filter irrelevant documents and identify events in large and noisy social media collections. An extensive evaluation using Amazon’s Mechanical Turk demonstrated the increased accuracy and high precision of KeyGraph, as well as superior runtime performance compared to other solutions.
- Aggarwal, C. and Subbian, K. 2012. Event detection in social streams. In Proceedings of the SIAM International Conference on Data Mining (SDM). 624--635.Google Scholar
- Allan, J., Papka, R., and Lvrenko, V. 1998. On-line new event detection and tracking. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). Google Scholar
Digital Library
- Al Sumait, L., Barbará, D., and Domeniconi, C. 2008. On-line lDA: Adaptive topic models for mining text streams with applications to topic detection and tracking. In Proceedings of the International Conference on Data Mining (ICDM). 3--12. Google Scholar
Digital Library
- Asuncion, A., Welling, M., Smyth, P., and Teh, Y.-W. 2009. On smoothing and inference for topic models. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI). Google Scholar
Digital Library
- Asur, S., Huberman, B. A., Szabó, G., and Wang, C. 2011. Trends in social media: Persistence and decay. In Proceedings of the 5th International AAAI Conference on Weblogs and Social Media (ICWSM).Google Scholar
- Becker, H., Naaman, M., and Gravano, L. 2010. Learning similarity metrics for event identification in social media. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM). 291--300. Google Scholar
Digital Library
- Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993--1022. Google Scholar
Digital Library
- Brants, T., Chen, F., and Farahat, A. 2003. A system for new event detection. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 330--337. Google Scholar
Digital Library
- Bun, K. K., Ishizuka, M., and Ishizuka, B. M. 2002. Topic extraction from news archive using tf*pdf algorithm. In Proceedings of the 3rd International Conference on Web Informtion Systems Engineering (WISE). 73--82. Google Scholar
Digital Library
- Cataldi, M., Di Caro, L., and Schifanella, C. 2010. Emerging topic detection on twitter based on temporal and social terms evaluation. In Proceedings of the 10th International Workshop on Multimedia Data Mining (MDMKDD). 4:1--4:10. Google Scholar
Digital Library
- Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S., and Blei, D. M. 2009. Reading tea leaves: How humans interpret topic models. J. Neural Inform. Process. Syst. 31.Google Scholar
- Cutting, D. R., Karger, D. R., Pedersen, J. O., and Tukey, J. W. 1992. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 318--329. Google Scholar
Digital Library
- Dhillon, I. S. 2001. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 269--274. Google Scholar
Digital Library
- He, Q., Chang, K., and Lim, E.-P. 2007a. Analyzing feature trajectories for event detection. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 207--214. Google Scholar
Digital Library
- He, Q., Chang, K., and Lim, E.-P. 2007b. Using burstiness to improve clustering of topics in news streams. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM). 493--498. Google Scholar
Digital Library
- Hu, Y., John, A., Seligmann, D. D., and Wang, F. 2012. What were the tweets about? Topical associations between public events and twitter feeds. In Proceedings of the 6th International AAAI Conference on Weblogs and Social Media (ICWSM).Google Scholar
- Kernighan, B. W. and Lin, S. 1970. An efficient heuristic procedure for partitioning graphs. Bell Syst. Tech. J. 49, 1, 291--307.Google Scholar
Cross Ref
- LDA-Blei. C implementation of variational expectation maximization for latent Dirichlet allocation (LDA). http://www.cs.princeton.edu/~blei/lda-c/index.html.Google Scholar
- LDA-Mallet. Machine learning for language toolkit. http://mallet.cs.umass.edu/topics.php.Google Scholar
- Li, H. and Yamanishi, K. 2000. Topic analysis using a finite mixture model. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP). 35--44. Google Scholar
Digital Library
- Li, J., Huang, L., Bai, T., Wang, Z., and Chen, H. 2012. CDBIA: A dynamic community detection method based on incremental analysis. In Proceedings of the International Conference on Systems and Informatics (ICSAI). 2224--2228.Google Scholar
- Li, Z., Wang, B., Li, M., and Ma, W.-Y. 2005. A probabilistic model for retrospective news event detection. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). Google Scholar
Digital Library
- Mori, M., Miura, T., and Shioya, I. 2004. Extracting events from web pages. In Proceedings of the International Conference on Advances in Intelligent Systems - Theory and Applications (AISTA).Google Scholar
- Mori, M., Miura, T., and Shioya, I. 2006. Topic detection and tracking for news web pages. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI). 338--342. Google Scholar
Digital Library
- Newman, D., Lau, J. H., Grieser, K., and Baldwin, T. 2010. Automatic evaluation of topic coherence. In Proceedings of the Annual Conference of the North American Chapter of the Association for Human Language Technologies (HLT). 100--108. Google Scholar
Digital Library
- Newman, M. E. J. 2004. Detecting community structure in networks. Euro. Phys. J. B---Condensed Matter and Complex Systems 38, 2, 321--330.Google Scholar
Cross Ref
- Ohsawa, Y., Benson, N. E., and Yachida, M. 1998. Keygraph: Automatic indexing by co-occurrence graph based on building construction metaphor. In Proceedings of the Advances in Digital Libraries Conference (ADL). Google Scholar
Digital Library
- Pereira, F., Tishby, N., and Lee, L. 1993. Distributional clustering of english words. In Proceedings of the 31st Annual Meeting of Association for Computational Linguistics (ACL). 183--190. Google Scholar
Digital Library
- Prabowo, R., Thelwall, M., Hellsten, I., and Scharnhorst, A. 2008. Evolving debates in online communication: A graph analytical approach. Internet Res.: Electron Netw. App. Policy 18, 5, 520--540.Google Scholar
Cross Ref
- Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., and Parisi, D. 2004. Defining and identifying communities in networks. Proc. Natl. Acad. Sci. 101, 9, 2658--2663.Google Scholar
- Ruan, N., Jin, R., Lee, V., and Huang, K. 2009a. Dynamic module discovery in temporal complex networks. In Proceedings of the 2nd International Workshop on Analysis of Dynamic Networks, in Conjunction with SIAM International Conference on Data Mining.Google Scholar
- Ruan, N., Jin, R., Lee, V., and Huang, K. 2009b. A sparsification approach for temporal graphical model decomposition. In Proceedings of the International Conference on Data Mining (ICDM). Google Scholar
Digital Library
- Salton, G. and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Inform. Proc. Manage. 24, 513--523. Google Scholar
Digital Library
- Sayyadi, H., Hurst, M., and Maykov, A. 2009. Event detection and tracking in social streams. In Proceedings of the 3rd International Conference on Weblogs and Social Media (ICWSM).Google Scholar
- Steyvers, M. and Griffiths, T. 2007. Probabilistic Topic Models. Lawrence Erlbaum Associates.Google Scholar
- Tantrum, J., Murua, A., and Stuetzle, W. 2002. Hierarchical model-based clustering of large datasets through fractionation and refractionation. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 183--190. Google Scholar
Digital Library
- Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. 2004. J. Amer. Statist. Assoc. 476, 1566--1581.Google Scholar
- Toda, H. and Kataoka, R. 2005. A search result clustering method using informatively named entities. In Proceedings of the 7th Annual ACM International Workshop on Web Information and Data Management (WIDM). 81--86. Google Scholar
Digital Library
- Wang, C., Blei, D. M., and Heckerman, D. 2008. Continuous time dynamic topic models. In Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence (UAI). 579--586.Google Scholar
- Wang, X. and McCallum, A. 2006. Topics over time: A non-Markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 424--433. Google Scholar
Digital Library
- Wartena, C. and Brussee, R. 2008. Topic detection by clustering keywords. In Proceedings of the IEEE Computer Society DEXA Workshops. 54--58. Google Scholar
Digital Library
- Yang, Y., Pierce, T., and Carbonell, J. G. 1998. A study on retrospective and on-line event detection. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). Google Scholar
Digital Library
Index Terms
A Graph Analytical Approach for Topic Detection
Recommendations
A semantic approach for topic-based polarity detection: a case study in the Spanish language
AbstractIn recent years, surprising amounts of news, messages, and reviews of products and services are generated in the online social media. Several efforts are being dedicated to detecting topics, as well as mining opinions in these unstructured texts. ...
Semantic-based topic detection using Markov decision processes
In the field of text mining, topic modeling and detection are fundamental problems in public opinion monitoring, information retrieval, social media analysis, and other activities. Document clustering has been used for topic detection at the document ...
A hybrid term-term relations analysis approach for topic detection
We extract co-occurrence term relations using IdeaGraph.We extract semantic term relations using topic model.We fuse multiple types of relations to form a coupled term graph.We extract topics from the graph using a graph analytical approach. Topic ...






Comments