ABSTRACT
This paper describes how to automatically cross-reference documents with Wikipedia: the largest knowledge base ever known. It explains how machine learning can be used to identify significant terms within unstructured text, and enrich it with links to the appropriate Wikipedia articles. The resulting link detector and disambiguator performs very well, with recall and precision of almost 75%. This performance is constant whether the system is evaluated on Wikipedia articles or "real world" documents.
This work has implications far beyond enriching documents with explanatory links. It can provide structured knowledge about any unstructured fragment of text. Any task that is currently addressed with bags of words - indexing, clustering, retrieval, and summarization to name a few - could use the techniques described here to draw on a vast network of concepts and semantics.
References
- Auer, S. and Bizer, C. and Kobilarov, G. and Lehmann, J. and Cyganiak, R. and Ives, Z. (2007) DBpedia: A Nucleus for a Web of Open Data. In Proceedings of the 6th International Semantic Web Conference, Busan, Korea. Google Scholar
Digital Library
- Banerjee, S. and Ramanathan, K. and Gupta, A. (2007) Clustering short texts using Wikipedia. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, Amsterdam, pp. 787--788. Google Scholar
Digital Library
- Barr, J. and Cabrera, L. F. (2006) AI gets a brain. In ACM Queue 4(4), pp. 24--29. Google Scholar
Digital Library
- David, C., L. Giroux, S. Bertrand-Gastaldy, and D. Lanteigne (1995) Indexing as problem solving: A cognitive approach to consistency. In Proceedings of the ASIS Annual Meeting, Medford, NJ, pp. 49--55.Google Scholar
- Dolan, S. (2008) Six Degrees of Wikipedia. Retrieved June 2008 from www.netsoc.tcd.ie/~mu/wiki/Google Scholar
- Drenner, S., Harper, M., Frankowski, D., Riedl, J. and Terveen, L. (2006) Insert movie reference here: a system to bridge conversation and item-oriented web sites. In Proceedings of the SIGCHI conference on Human Factors in computing systems, New York, NY, pp. 951--954 Google Scholar
Digital Library
- Gabrilovich, E. and Markovitch, S. (2007) Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proceedings of the Twenty-First National Conference on Artificial Intelligence, Boston, MA.Google Scholar
- Howe, J. (2006) The Rise of Crowdsourcing. In Wired Magazine 14(6).Google Scholar
- Lih, A. (2004) Wikipedia as Participatory Journalism: Reliable Sources? Metrics for evaluating collaborative media as a news resource. In Proceedings of the 5th International Symposium on Online Journalism, Austin, Texas.Google Scholar
- Maron, M. E. (1977) On indexing, retrieval and the meaning of about. In Journal of the American Society for Information Science 28(1), pp. 38--43Google Scholar
Cross Ref
- Medelyan, O., Witten, I. H. and Milne, D. (2008) Topic Indexing with Wikipedia. In Proceedings of the AAAI 2008 Workshop on Wikipedia and Artificial Intelligence (WIKIAI 2008), Chicago, IL.Google Scholar
- Mihalcea, R. and Csomai, A. (2007) Wikify!: linking documents to encyclopedic knowledge. In Proceedings of the 16th ACM Conference on Information and Knowledge management (CIKM'07), Lisbon, Portugal, pp. 233--242 Google Scholar
Digital Library
- Milne, D., Witten, I. H. and Nichols, D. M. (2007). A Knowledge-Based Search Engine Powered by Wikipedia. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM'2007), Lisbon, Portugal. Google Scholar
Digital Library
- Milne, D., and Witten, I. H. (2008) An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In Proceedings of the AAAI 2008 Workshop on Wikipedia and Artificial Intelligence (WIKIAI 2008), Chicago, IL.Google Scholar
- Mossberg, W. (2001) New Windows XP Feature Can Re-Edit Others' Sites. The Wall Street Journal, June 2001Google Scholar
- Ponzetto, S. P. and Strube, M. (2007) Deriving a Large Scale Taxonomy from Wikipedia. In Proceedings of the 22st National Conference on Artificial Intelligence (AAAI'07), Vancouver, British Columbia, pp. 1440--1445. Google Scholar
Digital Library
- Quinlan, J. R. (1993) C4. 5: Programs for Machine Learning. Morgan Kaufmann Google Scholar
Digital Library
- Suchanek, F. M. and Kasneci, G. and Weikum, G. (2007) Yago: a core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web (WWW'07), Alberta, Canada, pp. 697--706. Google Scholar
Digital Library
- Völkel, M. and Krötzsch, M. and Vrandecic, D. and Haller, H. and Studer, R. (2006) Semantic Wikipedia. In Proceedings of the 15th international conference on World Wide Web (WWW'06), Edinburgh, Scotland, pp. 585--594 Google Scholar
Digital Library
Index Terms
Learning to link with wikipedia





Comments