ABSTRACT
Clusters in document streams, such as online news articles, can be induced by their textual contents, as well as by the temporal dynamics of their arriving patterns. Can we leverage both sources of information to obtain a better clustering of the documents, and distill information that is not possible to extract using contents only? In this paper, we propose a novel random process, referred to as the Dirichlet-Hawkes process, to take into account both information in a unified framework. A distinctive feature of the proposed model is that the preferential attachment of items to clusters according to cluster sizes, present in Dirichlet processes, is now driven according to the intensities of cluster-wise self-exciting temporal point processes, the Hawkes processes. This new model establishes a previously unexplored connection between Bayesian Nonparametrics and temporal Point Processes, which makes the number of clusters grow to accommodate the increasing complexity of online streaming contents, while at the same time adapts to the ever changing dynamics of the respective continuous arrival time. We conducted large-scale experiments on both synthetic and real world news articles, and show that Dirichlet-Hawkes processes can recover both meaningful topics and temporal dynamics, which leads to better predictive performance in terms of content perplexity and arrival time of future documents.
Supplemental Material
- O. Aalen, O. Borgan, and H. Gjessing. Survival and event history analysis: a process point of view. Springer, 2008.Google Scholar
Cross Ref
- A. Ahmed, J. Eisenstein, Q. Ho, E. P. Xing, A. J. Smola, and C. H. Teo. The topic-cluster model. In Artificial Intelligence and Statistics AISTATS, 2011.Google Scholar
- A. Ahmed, Q. Ho, J. Eisenstein, E. Xing, A. Smola, and C. Teo. Unified analysis of streaming news. In Proceedings of WWW, Hyderabad, India, 2011. IW3C2, Sheridan Printing. Google Scholar
Digital Library
- A. Ahmed and E. Xing. Dynamic non-parametric mixture models and the recurrent chinese restaurant process: with applications to evolutionary clustering. In SDM, pages 219--230. SIAM, 2008.Google Scholar
Cross Ref
- C. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of Statistics, 2:1152--1174, 1974.Google Scholar
Cross Ref
- D. Blei and P. Frazier. Distance dependent chinese restaurant processes. In ICML, pages 87--94, 2010.Google Scholar
- D. M. Blei and J. D. Lafferty. Dynamic topic models. In ICML, pages 113--120, 2006. Google Scholar
Digital Library
- D. Daley and D. Vere-Jones. An introduction to the theory of point processes: volume II: general theory and structure, volume 2. Springer, 2007.Google Scholar
- Q. Diao and J. Jiang. Recurrent chinese restaurant process with a duration-based discount for event identification from twitter. In SDM, 2014.Google Scholar
Cross Ref
- A. Doucet, J. F. de Freitas, K. Murphy, and S. Russell. Rao-blackwellised particle filtering for dynamic bayesian networks. In C. Boutilier and M. Goldszmidt, editors, UAI, pages 176--183, SF, CA, 2000. Google Scholar
Digital Library
- A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methods in Practice. Springer-Verlag, 2001.Google Scholar
Cross Ref
- N. Du, L. Song, A. Smola, and M. Yuan. Learning networks of heterogeneous influence. In NIPS, pages 2789--2797, 2012.Google Scholar
- N. Du, L. Song, H. Woo, and H. Zha. Uncover Topic-Sensitive Information Diffusion Networks. In Artificial Intelligence and Statistics (AISTATS), 2013.Google Scholar
- M. Farajtabar, N. Du, M. Gomez-Rodriguez, I. Valera, H. Zha, and L. Song. Shaping Social Activity by Incentivizing Users. In NIPS, 2014.Google Scholar
Digital Library
- J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL, 2005. Google Scholar
Digital Library
- T. Griffiths and Z. Ghahramani. The indian buffet process: An introduction and review. Journal of Machine Learning Research, 12:1185--1224, 2011. Google Scholar
Digital Library
- A. G. Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83--90, 1971.Google Scholar
Digital Library
- N. L. Hjort, C. Holmes, P. Muller, and S. G. Walker. Bayesian Nonparametrics. Cambridge University Press, 2010.Google Scholar
Cross Ref
- J. Kingman. On doubly stochastic poisson processes. Mathematical Proceedings of the Cambridge Philosophical Society, pages 923--930, 1964.Google Scholar
Cross Ref
- J. F. C. Kingman. Poisson processes, volume 3. Oxford university press, 1992.Google Scholar
- L. Li, H. Deng, A. Dong, Y. Chang, and H. Zha. Identifying and labeling search tasks via query-based hawkes processes. In KDD, pages 731--740, 2014. Google Scholar
Digital Library
- C. Suen, S. Huang, C. Eksombatchai, R. Sosic, and J. Leskovec. Nifty: A system for large scale information flow tracking and clustering. In WWW, 2013. Google Scholar
Digital Library
- Y. W. Teh. A hierarchical bayesian language model based on pitman-yor processes. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 985--992, 2006. Google Scholar
Digital Library
- X. Wang and A. McCallum. Topics over time: A non-markov continuous-time model of topical trends. In KDD, 2006. Google Scholar
Digital Library
Index Terms
Dirichlet-Hawkes Processes with Applications to Clustering Continuous-Time Document Streams
Recommendations
Recurrent Marked Temporal Point Processes: Embedding Event History to Vector
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningLarge volumes of event data are becoming increasingly available in a wide variety of applications, such as healthcare analytics, smart cities and social network analysis. The precise time interval or the exact distance between two events carries a great ...
Using Dirichlet Marked Hawkes Processes for Insider Threat Detection
Malicious insiders cause significant loss to organizations. Due to an extremely small number of malicious activities from insiders, insider threat is hard to detect. In this article, we present a Dirichlet Marked Hawkes Process (DMHP) to detect malicious ...
Hawkes processes for events in social media
Frontiers of Multimedia ResearchThis chapter provides an accessible introduction for point processes, and especially Hawkes processes, for modeling discrete, inter-dependent events over continuous time. We start by reviewing the definitions and key concepts in point processes. We then ...





Comments