skip to main content
research-article

Incorporating Prior Knowledge into Word Embedding for Chinese Word Similarity Measurement

Authors Info & Claims
Published:02 April 2018Publication History
Skip Abstract Section

Abstract

Word embedding-based methods have received increasing attention for their flexibility and effectiveness in many natural language-processing (NLP) tasks, including Word Similarity (WS). However, these approaches rely on high-quality corpus and neglect prior knowledge. Lexicon-based methods concentrate on human’s intelligence contained in semantic resources, e.g., Tongyici Cilin, HowNet, and Chinese WordNet, but they have the drawback of being unable to deal with unknown words. This article proposes a three-stage framework for measuring the Chinese word similarity by incorporating prior knowledge obtained from lexicons and statistics into word embedding: in the first stage, we utilize retrieval techniques to crawl the contexts of word pairs from web resources to extend context corpus. In the next stage, we investigate three types of single similarity measurements, including lexicon similarities, statistical similarities, and embedding-based similarities. Finally, we exploit simple combination strategies with math operations and the counter-fitting combination strategy using optimization method. To demonstrate our system’s efficiency, comparable experiments are conducted on the PKU-500 dataset. Our final results are 0.561/0.516 of Spearman/Pearson rank correlation coefficient, which outperform the state-of-the-art performance to the best of our knowledge. Experiment results on Chinese MC-30 and SemEval-2012 datasets show that our system also performs well on other Chinese datasets, which proves its transferability. Besides, our system is not language-specific and can be applied to other languages, e.g., English.

References

  1. Mostafa Ghazizadeh Ahsaee, Mahmoud Naghibzadeh, and S. Ehsan Yasrebi Naeini. 2014. Semantic similarity assessment of words using weighted wordnet. Int. J. Mach. Learn. Cybernet. 5, 3 (2014), 479--490.Google ScholarGoogle ScholarCross RefCross Ref
  2. Danushka Bollegala, Yutaka Matsuo, and Mitsuru Ishizuka. 2007. Measuring semantic similarity between words using web search engines. In Proceedings of the International Conference on the World Wide Web (WWW’07). 757--766. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Zhigang Chen, Wei Lin, Qian Chen, Xiaoping Chen, Si Wei, Hui Jiang, and Xiaodan Zhu. 2015. Revisiting word embedding for contrasting meaning. In Proceedings of the Association for Computational Linguistics Conference (ACL’15). 106--115.Google ScholarGoogle ScholarCross RefCross Ref
  4. Zhendong Dong and Qiang Dong. 2006. HowNet and the Computation of Meaning. World Scientific, Singapore. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Mengjia Fan, Yangsen Zhang, and Jiayuan Li. 2015. Word similarity computation based on HowNet. In Proceedings of Fuzzy Systems and Knowledge Discovery. 1487--1492.Google ScholarGoogle Scholar
  6. Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard H. Hovy, and Noah A. Smith. 2015. Retrofitting word vectors to semantic lexicons. In Proceedings of North American Chapter of the. Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 1606--1615.Google ScholarGoogle Scholar
  7. Shaoru Guo, Yong Guan, Ru Li, and Qi Zhang. 2016. Chinese word similarity computing based on combination strategy. In Proceedings of the Conference on Natural Language Processing and Chinese Computing (NLPCC’16). 744--752.Google ScholarGoogle ScholarCross RefCross Ref
  8. Derrick Higgins. 2005. Which statistics reflect semantics? Rethinking synonymy and word similarity. Linguist. Evid. Empir. Theoret. Comput. Perspect. Studies Generat. Grammar 85 (2005), 265--284.Google ScholarGoogle ScholarCross RefCross Ref
  9. Felix Hill, Roi Reichart, and Anna Korhonen. 2016. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Comput. Linguist. (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. 2015. SensEmbed: Learning sense embeddings for word and relational similarity. In Proceedings of the Association for Computational Linguistics Conference (ACL’15). 95--105.Google ScholarGoogle ScholarCross RefCross Ref
  11. Chunxia Liang, Yanqiu Shao, and Jing Zhao. 2013. Construction of a Chinese semantic dictionary by integrating two heterogeneous dictionaries: TongYiCi cilin and HowNet. In Proceedings of the Workshops on Web Intelligence and Intelligent Agent Technologies (IAT’13). 203--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Quan Liu, Hui Jiang, Si Wei, Zhenhua Ling, and Yu Hu. 2015. Learning semantic word embeddings based on ordinal knowledge constraints. In Proceedings of the Association for Computational Linguistics Conference (ACL’15). 1501--1511.Google ScholarGoogle ScholarCross RefCross Ref
  13. Qun Liu and Sujian Li. 2002. Word similarity computing based on how-net. Comput. Linguist. Chinese Lang. Process. 7, 2 (2002).Google ScholarGoogle Scholar
  14. Jiaju Mei, Yiming Zhu, Yunqi Gao, and Hongxiang Yin. 1983. Tongyici Cilin. Shanghai Lexicon Publishing Company, Shanghai.Google ScholarGoogle Scholar
  15. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. In Proceedings of Workshop at the International Conference on Learning Representations (ICLR’13). 422--431.Google ScholarGoogle Scholar
  16. Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS’13). 3111--3119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Nikola Mrkšić, Diarmuid Ó Séaghdha, Blaise Thomson, Milica Gasic, Lina Maria Rojas-Barahona, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, and Steve J. Young. 2016. Counter-fitting word vectors to linguistic constraints. In Proceedings of the North American Chapter of the. Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’16). 142--148.Google ScholarGoogle Scholar
  18. Kim Anh Nguyen, Sabine Schulte im Walde, and Ngoc Thang Vu. 2016. Integrating distributional lexical contrast into word embeddings for antonym-synonym distinction. In Proceedings of the Association for Computational Linguistics Conference (ACL’16).Google ScholarGoogle ScholarCross RefCross Ref
  19. Masataka Ono, Makoto Miwa, and Yutaka Sasaki. 2015. Word embedding-based antonym detection using thesauri and distributional information. In Proceedings of the North American Chapter of the. Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 984--989.Google ScholarGoogle ScholarCross RefCross Ref
  20. Jiahuan Pei, Cong Zhang, Degen Huang, and Jianjun Ma. 2016. Combining word embedding and semantic lexicon for Chinese Word Similarity Computation. In Proceedings of the Conference on Natural Language Processing and Chinese Computing (NLPCC’16). 766--777.Google ScholarGoogle ScholarCross RefCross Ref
  21. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods for Natural Language Processing (EMNLP’14). 1532--1543.Google ScholarGoogle Scholar
  22. Sascha Rothe and Hinrich Schütze. 2015. AutoExtend: Extending word embeddings to embeddings for synsets and lexemes. In Proceedings of the Association for Computational Linguistics Conference (ACL’15). 1793--1803.Google ScholarGoogle ScholarCross RefCross Ref
  23. Cilibrasi Rudi and Paulf Vitanyi. 2007. The google similarity distance. IEEE Trans. Knowl. Data Eng. 19, 3 (2007), 370--383. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Wang Shan and Bond Francis. 2013. Building the Chinese open wordnet (COW): Starting from core synsets. In Proceedings of the 6th International Joint Conference on Natural Language Processing. 10--18.Google ScholarGoogle Scholar
  25. Jiule Tian and Wei Zhao. 2010. Words similarity algorithm based on tongyici cilin in semantic web adaptive learning system. J. Jilin Univ. (Info. Sci. Ed.) 28, 6 (2010), 602--608.Google ScholarGoogle Scholar
  26. Peter Turney. 2001. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the 12th European Conference on Machine Learning. 491--502. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Peter D. Turney and Patrick Pantel. 2010. From frequency to meaning: Vector space models of semantics. J. Artific. Intell. Res. 37 (2010), 141--188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Siying Wu and Yangyang Wu. 2010. Chinese and english word similarity measure based on Chinese wordnet. J. Zhengzhou Univ. (Natural Sci. Ed.) 2 (2010), 017.Google ScholarGoogle Scholar
  29. Yunfang Wu and Wei Li. 2016. Overview of the NLPCC-ICCPOL 2016 shared task: Chinese word similarity measurement. In Proceedings of the Conference on Natural Language Processing and Chinese Computing (NLPCC’16). 828--839.Google ScholarGoogle ScholarCross RefCross Ref
  30. Yueh-Cheng Wu and Shu-Kai Hsieh. 2010. PyCWN: A python module for Chinese wordnet. In Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations. Demonstrations Volume, 5--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Mo Yu and Mark Dredze. 2014. Improving lexical embeddings with semantic knowledge. In Proceedings of the Association for Computational Linguistics Conference (ACL’14). 545--550.Google ScholarGoogle ScholarCross RefCross Ref
  32. Peiying Zhang, Zhanshan Zhang, and Weishan Zhang. 2013. An approach of semantic similarity by combining hownet and cilin. In Proceedings of GreenCom/iThings/CPScom. 1638--1643. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Jun Zhao, Shuanzhu Hu, and Xinghua Fan. 2009. Word similarity computation based on word link distribution. J. Chongqing Univ. Posts Telecommun. (Natural Sci. Ed.) 21, 4 (2009), 528--532.Google ScholarGoogle Scholar
  34. Xinhua Zhu, Runcong Ma, Liu Sun, and Hongchao Chen. 2016. Word semantic similarity computation based on hownet and cilin. J. Chinese Info. Process. 30, 4 (2016), 29--36.Google ScholarGoogle Scholar

Index Terms

  1. Incorporating Prior Knowledge into Word Embedding for Chinese Word Similarity Measurement

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 17, Issue 3
      September 2018
      196 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3184403
      Issue’s Table of Contents

      Copyright © 2018 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 2 April 2018
      • Accepted: 1 January 2018
      • Revised: 1 November 2017
      • Received: 1 January 2017
      Published in tallip Volume 17, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!