ABSTRACT
Searching for information about people in search engines is a common and straightforward task that is often hampered by name ambiguities. While users are interested in information about a single person, results pages usually comprise many persons with the same name. There are several approaches to tackle personal name disambiguation; however, it is still a challenge to understand the impact of each approach alone. In this paper, we present a plugin-based framework that aims to compare and to identify the most promising approaches for name disambiguation. This framework enabled us to merge different approaches to find good combinations for this task and to compare state-of-the-art solutions using a common dataset. Preliminary results support the greater impact of biographical information to aid in clustering, the use of comprehensive texts instead of only metadata and TF-IDF instead of more complex approaches.
References
- Artiles, J., Gonzalo, J., Sekine, S. (2009). Weps 2 evaluation campaign: overview of the web people search clustering task. In WePS 2 Evaluation Workshop. WWW Conference 2009.Google Scholar
- Bagga, A. and Baldwin, B. (1998). Entity-based cross-document coreferencing using the vector space model. Proc. of the international conference on Computational linguistics. Google Scholar
Digital Library
- Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. J. Mach. Learn. Res., 3:993--1022. Google Scholar
Digital Library
- Chen, Y. and Martin, J. (2007). Cu-comsem: exploring rich features for unsupervised web personal name disambiguation. In proc. of SemEval '07. Google Scholar
Digital Library
- Chen, Y, Lee, Sym, and Huang, C.-R. (2009). Polyuhk: A robust information extraction system for web personal names. In 2nd WePS, 18th WWW Conference.Google Scholar
- Cohen, W., Ravikumar, P., and S., F. (2003). A comparison of string metrics for matching names and records. Proc. of the IJCAI Workshop on Information Integration on the Web.Google Scholar
- Hofmann, T. (1999). Probabilistic latent semantic indexing. In SIGIR '99: Proceedings of the ACM SIGIR conference on Research and development in information retrieval. Google Scholar
Digital Library
- Ikeda, M., Ono, S., Sato, I., Yoshida, M., and Nagawaka, H. (2009). Person name disambiguation on the web by twostage clustering. In 2nd WePS, 18th WWW Conference.Google Scholar
- Landauer, T. K., Foltz, P. W., and Laham, D. (1998). An introduction to latent semantic analysis. In Discourse processes, pages 259--284.Google Scholar
- Mann, G. S. and Yarowsky, D. (2003). Unsupervised personal name disambiguation. In Proc. of conference on Natural language learning at HLT-NAACL, NJ, USA. ACL. Google Scholar
Digital Library
- Nuray-Turan, R., Kalashnikov, D. V., and Mehrotra, S. (2012). Exploiting web querying for web people search. ACM Trans. Datab. Syst. 37, 1, 41 pages. Google Scholar
Digital Library
- Pedersen, T., Purandare, A., and Kulkarni, A. (2005). Name discrimination by clustering similar contexts. In International Conference on Intelligent Text Processing and Computational Linguistics, pages 220--231. Springer. Google Scholar
Digital Library
- Ratinov, L. and Roth, D. (2009). Design Challenges and Misconceptions in Named Entity Recognition. In CoNLL. Google Scholar
Digital Library
- Schmid, H.(1994). Probabilistic part-of-speech tagging using decision trees. In Proc. of the International Conference on New Methods in Language Processing, 1994.Google Scholar
- Song, Y., Huang, J., Councill, I., Li, J., and Giles, C. (2007). Efficient topic-based unsupervised name disambiguation. Proc. of ACM/IEEE-CS joint conference on Digital libraries. Google Scholar
Digital Library
- Tang, J. et al. (2012) A unified probabilistic framework for name disambiguation in digital library. Knowledge and Data Engineering, IEEE Transactions on 24.6: 975--987. Google Scholar
Digital Library
Index Terms
Towards a fair comparison between name disambiguation approaches





Comments