skip to main content
research-article

Wikipedia-Based Relatedness Measurements for Multilingual Short Text Clustering

Published:14 December 2018Publication History
Skip Abstract Section

Abstract

Throughout the world, people can post information about their local area in their own languages using social networking services. Multilingual short text clustering is an important task to organize such information, and it can be applied to various applications, such as event detection and summarization. However, measuring the relatedness between short texts written in various languages is a challenging problem. In addition to handling multiple languages, the semantic gaps among all languages must be considered. In this article, we propose two Wikipedia-based semantic relatedness measurement methods for multilingual short text clustering. The proposed methods solve the semantic gap problem by incorporating the inter-language links of Wikipedia into Extended Naive Bayes (ENB), a probabilistic method that can be applied to measure semantic relatedness among monolingual short texts. The proposed methods represent a multilingual short text as a vector of the English version of Wikipedia articles (entities). By transferring texts to a unified vector space, the relatedness between texts in different languages with similar meanings can be increased. We also propose an approach that can improve clustering performance and reduce the processing time by eliminating language-specific entities in the unified vector space. Experimental results on multilingual Twitter message clustering revealed that the proposed methods outperformed cross-lingual explicit semantic analysis, a previously proposed method to measure relatedness between texts in different languages. Moreover, the proposed methods were comparable to ENB applied to texts translated into English using a proprietary translation service. The proposed methods enabled relatedness measurements for multilingual short text clustering without requiring machine translation processes.

References

  1. Somnath Banerjee, Krishnan Ramanathan, and Ajay Gupta. 2007. Clustering short texts using Wikipedia. In Proceedings of the 30th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 787--788. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Hila Becker, Mor Naaman, and Luis Gravano. 2010. Learning similarity metrics for event identification in social media. In Proceedings of the 3rd International Conference on Web Search and Web Data Mining. ACM, 291--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Hila Becker, Mor Naaman, and Luis Gravano. 2011. Beyond trending topics: Real-world event identification on Twitter. In Proceedings of the 5th International Conference on Web and Social Media. AAAI Press, 438--441.Google ScholarGoogle Scholar
  4. Alexander Budanitsky and Graeme Hirst. 2006. Evaluating WordNet-based measures of lexical semantic relatedness. Computational Linguistics 32, 1 (2006), 13--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ming Wei Chang, Lev Ratinov, Dan Roth, and Vivek Srikumar. 2008. Importance of semantic representation: Dataless classification. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence, Vol. 2. AAAI Press, 830--835. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Mohamed Chebel, Chiraz Latiri, and Éric Gaussier. 2015. Multilingual documents clustering based on closed concepts mining. In Proceedings of the 26th International Conference on Database and Expert Systems Applications. Springer, 517--524. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Dong Deng, Guoliang Li, and Jianhua Feng. 2012. An efficient trie-based method for approximate entity extraction with edit-distance constraints. In Proceedings of the 28th IEEE International Conference on Data Engineering. IEEE Computer Society, 762--773. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ofer Egozi, Evgeniy Gabrilovich, and Shaul Markovitch. 2008. Concept-based feature generation and selection for information retrieval. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence, Vol. 2. AAAI Press, 1132--1137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Maike Erdmann, Kotaro Nakayama, Takahiro Hara, and Shojiro Nishio. 2009. Improving the extraction of bilingual terminology from Wikipedia. ACM Transactions on Multimedia Computing, Communications and Applications 5, 4, Article 31 (2009), 17 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. David Kirk Evans, Judith L. Klavans, and Kathleen R. McKeown. 2004. Columbia newsblaster: Multilingual news summarization on the web. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. ACL, 1--4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Fellbaum. 1998. WordNet: An Electronic Lexical Database. MIT Press.Google ScholarGoogle Scholar
  12. Paolo Ferragina and Ugo Scaiella. 2010. TAGME: On-the-fly annotation of short text fragments (by Wikipedia entities). In Proceedings of the 19th ACM Conference on Information and Knowledge Management. ACM, 1625--1628. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Elena Filatova. 2009. Multilingual Wikipedia, summarization, and information trustworthiness. In Proceedings of the 19th ACM Conference on Information and Knowledge Management. ACM, 1625--1628.Google ScholarGoogle Scholar
  14. Edward Fredkin. 1960. Trie memory. Communications of the ACM 3, 9 (1960), 490--499. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence. Morgan Kaufmann Publishers Inc., 1606--1611. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Samer Hassan and Rada Mihalcea. 2009. Cross-lingual semantic relatedness using encyclopedic knowledge. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Vol. 3. ACL, 1192--1201. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jian Hu, Lujun Fang, Yang Cao, Hua Jun Zeng, Hua Li, Qiang Yang, and Zheng Chen. 2008. Enhancing text clustering by leveraging Wikipedia semantics. In Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 179--186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Xiaohua Hu, Xiaodan Zhang, Caimei Lu, Eun K. Park, and Xiaohua Zhou. 2009. Exploiting Wikipedia as external knowledge for document clustering. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 389--396. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Wen Hua, Zhongyuan Wang, Haixun Wang, Kai Zheng, and Xiaofang Zhou. 2017. Understand short texts by harvesting and analyzing semantic knowledge. IEEE Transactions on Knowledge and Data Engineering 29, 3 (2017), 499--512. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Masahiro Ito, Kotaro Nakayama, Takahiro Hara, and Shojiro Nishio. 2008. Association thesaurus construction methods based on link co-occurrence analysis for Wikipedia. In Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, 817--826. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying conditional random fields to Japanese morphological analysis. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. ACL, 230--237.Google ScholarGoogle Scholar
  22. N. Kiran Kumar, G. S. K. Santosh, and Vasudeva Varma. 2011. Multilingual document clustering using Wikipedia as external knowledge. In Proceedings of the 2nd International Conference on Multidisciplinary Information Retrieval Facility. Springer, 108--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web. ACM, 591--600. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. David Laniado and Peter Mika. 2010. Making sense of Twitter. In Proceedings of the 17th International Semantic Web Conference. Springer, 470--485. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören Auer, and Christian Bizer. 2015. DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web 6, 2 (2015), 167--195.Google ScholarGoogle Scholar
  26. J. B. MacQueen. 1967. Some methods for classification and analysis of MultiVariate observations. In Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1. University of California Press, 281--297.Google ScholarGoogle Scholar
  27. Farzaneh Mahdisoltani, Joanna Biega, and Fabian M. Suchanek. 2015. YAGO3: A knowledge base from multilingual Wikipedias. In Proceedings of the 7th Biennial Conference on Innovative Data Systems Research. www.cidrdb.org.Google ScholarGoogle Scholar
  28. Edgar Meij, Wouter Weerkamp, and Maarten de Rijke. 2012. Adding semantics to microblog posts. In Proceedings of the 5th International Conference on Web Search and Web Data Mining. ACM, 563--572. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Rada Mihalcea and Andras Csomai. 2007. Wikify!: Linking documents to encyclopedic knowledge. In Proceedings of the 16th ACM Conference on Information and Knowledge Management. ACM, 233--242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Representations (Workshop Track). http://arxiv.org/abs/1301.3781.Google ScholarGoogle Scholar
  31. David Milne and Ian H. Witten. 2008. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In Proceedings of the 2008 AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy. AAAI Press, 25--30.Google ScholarGoogle Scholar
  32. David Milne and Ian H. Witten. 2008. Learning to link with Wikipedia. In Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, 509--518. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Andrea Moro, Alessandro Raganato, and Roberto Navigli. 2014. Entity linking meets word sense disambiguation: A unified approach. Transactions of the Association for Computational Linguistics 2 (2014), 231--244.Google ScholarGoogle ScholarCross RefCross Ref
  34. Tatsuya Nakamura, Masumi Shirakawa, Takahiro Hara, and Shojiro Nishio. 2014. Semantic similarity measurements for multi-lingual short texts using Wikipedia. In Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies, Vol. 2. IEEE Computer Society, 46--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Kotaro Nakayama, Takahiro Hara, and Shojiro Nishio. 2007. Wikipedia mining for an association web thesaurus construction. In Proceedings of the 8th International Conference on Web Information Systems Engineering. Springer, 322--334. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Roberto Navigli and Simone Paolo Ponzetto. 2010. BabelNet: Building a very large multilingual semantic network. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. ACL, 216--225. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Roberto Navigli and Simone Paolo Ponzetto. 2012. BabelRelatey! A joint multilingual approach to computing semantic relatedness. In Proceedings of the 26th AAAI Conference on Artificial Intelligence. AAAI Press, 22--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Yuan Ni, Qiong Kai Xu, Feng Cao, Yosi Mass, Dafna Sheinwald, Hui Jia Zhu, and Shao Sheng Cao. 2016. Semantic documents relatedness using concept graph representation. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining. ACM, 635--644. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Yann Ollivier and Pierre Senellart. 2007. Finding related pages using green measures: An illustration with Wikipedia. In Proceedings of the 22nd AAAI Conference on Artificial Intelligence. AAAI Press, 1427--1433. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Tommaso Pasini and Roberto Navigli. 2017. Train-O-Matic: Large-scale supervised word sense disambiguation in multiple languages without manual training data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. ACL, 78--88.Google ScholarGoogle ScholarCross RefCross Ref
  41. Antonio Penta, Gianluca Quercini, Reynaud Chantal, and Nigel Shadbolt. 2012. Discovering cross-language links in Wikipedia through semantic relatedness. In Proceedings of the 20th European Conference on Artificial Intelligence. IOS Press, 27--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Marco Ponza, Paolo Ferragina, and Soumen Chakrabarti. 2017. A two-stage framework for computing entity relatedness in Wikipedia. In Proceedings of the 26th ACM International Conference on Information and Knowledge Management. ACM, 1867--1876. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Simone Paolo Ponzetto and Michael Strube. 2006. Exploiting semantic role labeling, WordNet and Wikipedia for coreference resolution. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. ACL, 192--199. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Jagan Sankaranarayanan, Hanan Samet, Benjamin E. Teitler, Michael D. Lieberman, and Jon Sperling. 2009. TwitterStand: News in tweets. In Proceedings of the 17th ACM SIGSPATIAL International Symposium on Advances in Geographic Information Systems. ACM, 42--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Masumi Shirakawa, Takahiro Hara, and Shojiro Nishio. 2014. MLJ: Language-independent real-time search of tweets reported by media outlets and journalists. The Proceedings of the VLDB Endowment 7, 13 (2014), 1605--1608. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Masumi Shirakawa, Kotaro Nakayama, Takahiro Hara, and Shojiro Nishio. 2015. Wikipedia-based semantic similarity measurements for noisy short texts using extended Naive Bayes. IEEE Transactions on Emerging Topics in Computing 3, 2 (2015), 205--219.Google ScholarGoogle ScholarCross RefCross Ref
  47. Lidan Shou, Zhenhua Wang, Ke Chen, and Gang Chen. 2013. Sumblr: Continuous summarization of evolving tweet streams. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACL, 533--542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Yangqiu Song, Haixun Wang, Zhongyuan Wang, Hongsong Li, and Weizhua Chen. 2011. Short text conceptualization using a probabilistic knowledgebase. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence. AAAI Press, 2330--2336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Philipp Sorg and Philipp Cimiano. 2008. Cross-lingual information retrieval with explicit semantic analysis. In Working Notes for CLEF 2008 Workshop co-located with the 12th European Conference on Digital Libraries. CEUR-WS.org.Google ScholarGoogle Scholar
  50. Philipp Sorg and Philipp Cimiano. 2008. Enriching the crosslingual link structure of Wikipedia -- A classification-based approach. In Proceedings of the 2008 AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy. AAAI Press, 49--54.Google ScholarGoogle Scholar
  51. Alexander Strehl and Joydeep Ghosh. 2002. Cluster ensembles -- A knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3 (2002), 583--617. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Michael Strube and Simone Paolo Ponzetto. 2006. WikiRelatey! Computing semantic relatedness using Wikipedia. In Proceedings of the 21st AAAI Conference on Artificial Intelligence. AAAI Press, 1419--1424. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Xinruo Sun, Haofen Wang, and Yong Yu. 2011. Towards effective short text deep classification. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1143--1144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Shyam Upadhyay, Manaal Faruqui, Chris Dyer, and Dan Roth. 2016. Cross-lingual models of word embeddings: An empirical comparison. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Vol. 1. ACL, 1661--1670.Google ScholarGoogle ScholarCross RefCross Ref
  55. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016). http://arxiv.org/abs/1609.08144.Google ScholarGoogle Scholar
  56. Jiaming Xu, Bo Xu, Peng Wang, Suncong Zheng, Guanhua Tian, Jun Zhao, and Bo Xu. 2017. Self-taught convolutional neural networks for short text clustering. Neural Networks 88 (2017), 22--31.Google ScholarGoogle ScholarCross RefCross Ref
  57. Jianhua Yin and Jianyong Wang. 2014. A dirichlet multinomial mixture model-based approach for short text clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 233--242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Oren Zamir and Oren Etzioni. 1998. Web document clustering: A feasibility demonstration. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 46--54. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Wikipedia-Based Relatedness Measurements for Multilingual Short Text Clustering

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Asian and Low-Resource Language Information Processing
          ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 18, Issue 2
          June 2019
          208 pages
          ISSN:2375-4699
          EISSN:2375-4702
          DOI:10.1145/3300146
          Issue’s Table of Contents

          Copyright © 2018 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 14 December 2018
          • Accepted: 1 September 2018
          • Received: 1 November 2017
          Published in tallip Volume 18, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!