Abstract
Throughout the world, people can post information about their local area in their own languages using social networking services. Multilingual short text clustering is an important task to organize such information, and it can be applied to various applications, such as event detection and summarization. However, measuring the relatedness between short texts written in various languages is a challenging problem. In addition to handling multiple languages, the semantic gaps among all languages must be considered. In this article, we propose two Wikipedia-based semantic relatedness measurement methods for multilingual short text clustering. The proposed methods solve the semantic gap problem by incorporating the inter-language links of Wikipedia into Extended Naive Bayes (ENB), a probabilistic method that can be applied to measure semantic relatedness among monolingual short texts. The proposed methods represent a multilingual short text as a vector of the English version of Wikipedia articles (entities). By transferring texts to a unified vector space, the relatedness between texts in different languages with similar meanings can be increased. We also propose an approach that can improve clustering performance and reduce the processing time by eliminating language-specific entities in the unified vector space. Experimental results on multilingual Twitter message clustering revealed that the proposed methods outperformed cross-lingual explicit semantic analysis, a previously proposed method to measure relatedness between texts in different languages. Moreover, the proposed methods were comparable to ENB applied to texts translated into English using a proprietary translation service. The proposed methods enabled relatedness measurements for multilingual short text clustering without requiring machine translation processes.
- Somnath Banerjee, Krishnan Ramanathan, and Ajay Gupta. 2007. Clustering short texts using Wikipedia. In Proceedings of the 30th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 787--788. Google Scholar
Digital Library
- Hila Becker, Mor Naaman, and Luis Gravano. 2010. Learning similarity metrics for event identification in social media. In Proceedings of the 3rd International Conference on Web Search and Web Data Mining. ACM, 291--300. Google Scholar
Digital Library
- Hila Becker, Mor Naaman, and Luis Gravano. 2011. Beyond trending topics: Real-world event identification on Twitter. In Proceedings of the 5th International Conference on Web and Social Media. AAAI Press, 438--441.Google Scholar
- Alexander Budanitsky and Graeme Hirst. 2006. Evaluating WordNet-based measures of lexical semantic relatedness. Computational Linguistics 32, 1 (2006), 13--47. Google Scholar
Digital Library
- Ming Wei Chang, Lev Ratinov, Dan Roth, and Vivek Srikumar. 2008. Importance of semantic representation: Dataless classification. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence, Vol. 2. AAAI Press, 830--835. Google Scholar
Digital Library
- Mohamed Chebel, Chiraz Latiri, and Éric Gaussier. 2015. Multilingual documents clustering based on closed concepts mining. In Proceedings of the 26th International Conference on Database and Expert Systems Applications. Springer, 517--524. Google Scholar
Digital Library
- Dong Deng, Guoliang Li, and Jianhua Feng. 2012. An efficient trie-based method for approximate entity extraction with edit-distance constraints. In Proceedings of the 28th IEEE International Conference on Data Engineering. IEEE Computer Society, 762--773. Google Scholar
Digital Library
- Ofer Egozi, Evgeniy Gabrilovich, and Shaul Markovitch. 2008. Concept-based feature generation and selection for information retrieval. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence, Vol. 2. AAAI Press, 1132--1137. Google Scholar
Digital Library
- Maike Erdmann, Kotaro Nakayama, Takahiro Hara, and Shojiro Nishio. 2009. Improving the extraction of bilingual terminology from Wikipedia. ACM Transactions on Multimedia Computing, Communications and Applications 5, 4, Article 31 (2009), 17 pages. Google Scholar
Digital Library
- David Kirk Evans, Judith L. Klavans, and Kathleen R. McKeown. 2004. Columbia newsblaster: Multilingual news summarization on the web. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. ACL, 1--4. Google Scholar
Digital Library
- C. Fellbaum. 1998. WordNet: An Electronic Lexical Database. MIT Press.Google Scholar
- Paolo Ferragina and Ugo Scaiella. 2010. TAGME: On-the-fly annotation of short text fragments (by Wikipedia entities). In Proceedings of the 19th ACM Conference on Information and Knowledge Management. ACM, 1625--1628. Google Scholar
Digital Library
- Elena Filatova. 2009. Multilingual Wikipedia, summarization, and information trustworthiness. In Proceedings of the 19th ACM Conference on Information and Knowledge Management. ACM, 1625--1628.Google Scholar
- Edward Fredkin. 1960. Trie memory. Communications of the ACM 3, 9 (1960), 490--499. Google Scholar
Digital Library
- Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence. Morgan Kaufmann Publishers Inc., 1606--1611. Google Scholar
Digital Library
- Samer Hassan and Rada Mihalcea. 2009. Cross-lingual semantic relatedness using encyclopedic knowledge. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Vol. 3. ACL, 1192--1201. Google Scholar
Digital Library
- Jian Hu, Lujun Fang, Yang Cao, Hua Jun Zeng, Hua Li, Qiang Yang, and Zheng Chen. 2008. Enhancing text clustering by leveraging Wikipedia semantics. In Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 179--186. Google Scholar
Digital Library
- Xiaohua Hu, Xiaodan Zhang, Caimei Lu, Eun K. Park, and Xiaohua Zhou. 2009. Exploiting Wikipedia as external knowledge for document clustering. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 389--396. Google Scholar
Digital Library
- Wen Hua, Zhongyuan Wang, Haixun Wang, Kai Zheng, and Xiaofang Zhou. 2017. Understand short texts by harvesting and analyzing semantic knowledge. IEEE Transactions on Knowledge and Data Engineering 29, 3 (2017), 499--512. Google Scholar
Digital Library
- Masahiro Ito, Kotaro Nakayama, Takahiro Hara, and Shojiro Nishio. 2008. Association thesaurus construction methods based on link co-occurrence analysis for Wikipedia. In Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, 817--826. Google Scholar
Digital Library
- Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying conditional random fields to Japanese morphological analysis. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. ACL, 230--237.Google Scholar
- N. Kiran Kumar, G. S. K. Santosh, and Vasudeva Varma. 2011. Multilingual document clustering using Wikipedia as external knowledge. In Proceedings of the 2nd International Conference on Multidisciplinary Information Retrieval Facility. Springer, 108--117. Google Scholar
Digital Library
- Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web. ACM, 591--600. Google Scholar
Digital Library
- David Laniado and Peter Mika. 2010. Making sense of Twitter. In Proceedings of the 17th International Semantic Web Conference. Springer, 470--485. Google Scholar
Digital Library
- Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören Auer, and Christian Bizer. 2015. DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web 6, 2 (2015), 167--195.Google Scholar
- J. B. MacQueen. 1967. Some methods for classification and analysis of MultiVariate observations. In Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1. University of California Press, 281--297.Google Scholar
- Farzaneh Mahdisoltani, Joanna Biega, and Fabian M. Suchanek. 2015. YAGO3: A knowledge base from multilingual Wikipedias. In Proceedings of the 7th Biennial Conference on Innovative Data Systems Research. www.cidrdb.org.Google Scholar
- Edgar Meij, Wouter Weerkamp, and Maarten de Rijke. 2012. Adding semantics to microblog posts. In Proceedings of the 5th International Conference on Web Search and Web Data Mining. ACM, 563--572. Google Scholar
Digital Library
- Rada Mihalcea and Andras Csomai. 2007. Wikify!: Linking documents to encyclopedic knowledge. In Proceedings of the 16th ACM Conference on Information and Knowledge Management. ACM, 233--242. Google Scholar
Digital Library
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Representations (Workshop Track). http://arxiv.org/abs/1301.3781.Google Scholar
- David Milne and Ian H. Witten. 2008. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In Proceedings of the 2008 AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy. AAAI Press, 25--30.Google Scholar
- David Milne and Ian H. Witten. 2008. Learning to link with Wikipedia. In Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, 509--518. Google Scholar
Digital Library
- Andrea Moro, Alessandro Raganato, and Roberto Navigli. 2014. Entity linking meets word sense disambiguation: A unified approach. Transactions of the Association for Computational Linguistics 2 (2014), 231--244.Google Scholar
Cross Ref
- Tatsuya Nakamura, Masumi Shirakawa, Takahiro Hara, and Shojiro Nishio. 2014. Semantic similarity measurements for multi-lingual short texts using Wikipedia. In Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies, Vol. 2. IEEE Computer Society, 46--54. Google Scholar
Digital Library
- Kotaro Nakayama, Takahiro Hara, and Shojiro Nishio. 2007. Wikipedia mining for an association web thesaurus construction. In Proceedings of the 8th International Conference on Web Information Systems Engineering. Springer, 322--334. Google Scholar
Digital Library
- Roberto Navigli and Simone Paolo Ponzetto. 2010. BabelNet: Building a very large multilingual semantic network. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. ACL, 216--225. Google Scholar
Digital Library
- Roberto Navigli and Simone Paolo Ponzetto. 2012. BabelRelatey! A joint multilingual approach to computing semantic relatedness. In Proceedings of the 26th AAAI Conference on Artificial Intelligence. AAAI Press, 22--26. Google Scholar
Digital Library
- Yuan Ni, Qiong Kai Xu, Feng Cao, Yosi Mass, Dafna Sheinwald, Hui Jia Zhu, and Shao Sheng Cao. 2016. Semantic documents relatedness using concept graph representation. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining. ACM, 635--644. Google Scholar
Digital Library
- Yann Ollivier and Pierre Senellart. 2007. Finding related pages using green measures: An illustration with Wikipedia. In Proceedings of the 22nd AAAI Conference on Artificial Intelligence. AAAI Press, 1427--1433. Google Scholar
Digital Library
- Tommaso Pasini and Roberto Navigli. 2017. Train-O-Matic: Large-scale supervised word sense disambiguation in multiple languages without manual training data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. ACL, 78--88.Google Scholar
Cross Ref
- Antonio Penta, Gianluca Quercini, Reynaud Chantal, and Nigel Shadbolt. 2012. Discovering cross-language links in Wikipedia through semantic relatedness. In Proceedings of the 20th European Conference on Artificial Intelligence. IOS Press, 27--31. Google Scholar
Digital Library
- Marco Ponza, Paolo Ferragina, and Soumen Chakrabarti. 2017. A two-stage framework for computing entity relatedness in Wikipedia. In Proceedings of the 26th ACM International Conference on Information and Knowledge Management. ACM, 1867--1876. Google Scholar
Digital Library
- Simone Paolo Ponzetto and Michael Strube. 2006. Exploiting semantic role labeling, WordNet and Wikipedia for coreference resolution. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. ACL, 192--199. Google Scholar
Digital Library
- Jagan Sankaranarayanan, Hanan Samet, Benjamin E. Teitler, Michael D. Lieberman, and Jon Sperling. 2009. TwitterStand: News in tweets. In Proceedings of the 17th ACM SIGSPATIAL International Symposium on Advances in Geographic Information Systems. ACM, 42--51. Google Scholar
Digital Library
- Masumi Shirakawa, Takahiro Hara, and Shojiro Nishio. 2014. MLJ: Language-independent real-time search of tweets reported by media outlets and journalists. The Proceedings of the VLDB Endowment 7, 13 (2014), 1605--1608. Google Scholar
Digital Library
- Masumi Shirakawa, Kotaro Nakayama, Takahiro Hara, and Shojiro Nishio. 2015. Wikipedia-based semantic similarity measurements for noisy short texts using extended Naive Bayes. IEEE Transactions on Emerging Topics in Computing 3, 2 (2015), 205--219.Google Scholar
Cross Ref
- Lidan Shou, Zhenhua Wang, Ke Chen, and Gang Chen. 2013. Sumblr: Continuous summarization of evolving tweet streams. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACL, 533--542. Google Scholar
Digital Library
- Yangqiu Song, Haixun Wang, Zhongyuan Wang, Hongsong Li, and Weizhua Chen. 2011. Short text conceptualization using a probabilistic knowledgebase. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence. AAAI Press, 2330--2336. Google Scholar
Digital Library
- Philipp Sorg and Philipp Cimiano. 2008. Cross-lingual information retrieval with explicit semantic analysis. In Working Notes for CLEF 2008 Workshop co-located with the 12th European Conference on Digital Libraries. CEUR-WS.org.Google Scholar
- Philipp Sorg and Philipp Cimiano. 2008. Enriching the crosslingual link structure of Wikipedia -- A classification-based approach. In Proceedings of the 2008 AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy. AAAI Press, 49--54.Google Scholar
- Alexander Strehl and Joydeep Ghosh. 2002. Cluster ensembles -- A knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3 (2002), 583--617. Google Scholar
Digital Library
- Michael Strube and Simone Paolo Ponzetto. 2006. WikiRelatey! Computing semantic relatedness using Wikipedia. In Proceedings of the 21st AAAI Conference on Artificial Intelligence. AAAI Press, 1419--1424. Google Scholar
Digital Library
- Xinruo Sun, Haofen Wang, and Yong Yu. 2011. Towards effective short text deep classification. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1143--1144. Google Scholar
Digital Library
- Shyam Upadhyay, Manaal Faruqui, Chris Dyer, and Dan Roth. 2016. Cross-lingual models of word embeddings: An empirical comparison. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Vol. 1. ACL, 1661--1670.Google Scholar
Cross Ref
- Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016). http://arxiv.org/abs/1609.08144.Google Scholar
- Jiaming Xu, Bo Xu, Peng Wang, Suncong Zheng, Guanhua Tian, Jun Zhao, and Bo Xu. 2017. Self-taught convolutional neural networks for short text clustering. Neural Networks 88 (2017), 22--31.Google Scholar
Cross Ref
- Jianhua Yin and Jianyong Wang. 2014. A dirichlet multinomial mixture model-based approach for short text clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 233--242. Google Scholar
Digital Library
- Oren Zamir and Oren Etzioni. 1998. Web document clustering: A feasibility demonstration. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 46--54. Google Scholar
Digital Library
Index Terms
Wikipedia-Based Relatedness Measurements for Multilingual Short Text Clustering
Recommendations
Hindi Word Sense Disambiguation Using Semantic Relatedness Measure
MIWAI 2013: Proceedings of the 7th International Workshop on Multi-disciplinary Trends in Artificial Intelligence - Volume 8271In this paper we propose and evaluate a method of Hindi word sense disambiguation that computes similarity based on the semantics. We adapt an existing measure for semantic relatedness between two lexically expressed concepts of Hindi WordNet. This ...
Combining Word Embedding and Lexical Database for Semantic Relatedness Measurement
WWW '16 Companion: Proceedings of the 25th International Conference Companion on World Wide WebWhile many traditional studies on semantic relatedness utilize the lexical databases, such as WordNet or Wikitionary, the recent word embedding learning approaches demonstrate their abilities to capture syntactic and semantic information, and outperform ...
Enhancing text clustering by leveraging Wikipedia semantics
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrievalMost traditional text clustering methods are based on "bag of words" (BOW) representation based on frequency statistics in a set of documents. BOW, however, ignores the important information on the semantic relationships between key terms. To overcome ...






Comments