Abstract
Word embedding-based methods have received increasing attention for their flexibility and effectiveness in many natural language-processing (NLP) tasks, including Word Similarity (WS). However, these approaches rely on high-quality corpus and neglect prior knowledge. Lexicon-based methods concentrate on human’s intelligence contained in semantic resources, e.g., Tongyici Cilin, HowNet, and Chinese WordNet, but they have the drawback of being unable to deal with unknown words. This article proposes a three-stage framework for measuring the Chinese word similarity by incorporating prior knowledge obtained from lexicons and statistics into word embedding: in the first stage, we utilize retrieval techniques to crawl the contexts of word pairs from web resources to extend context corpus. In the next stage, we investigate three types of single similarity measurements, including lexicon similarities, statistical similarities, and embedding-based similarities. Finally, we exploit simple combination strategies with math operations and the counter-fitting combination strategy using optimization method. To demonstrate our system’s efficiency, comparable experiments are conducted on the PKU-500 dataset. Our final results are 0.561/0.516 of Spearman/Pearson rank correlation coefficient, which outperform the state-of-the-art performance to the best of our knowledge. Experiment results on Chinese MC-30 and SemEval-2012 datasets show that our system also performs well on other Chinese datasets, which proves its transferability. Besides, our system is not language-specific and can be applied to other languages, e.g., English.
- Mostafa Ghazizadeh Ahsaee, Mahmoud Naghibzadeh, and S. Ehsan Yasrebi Naeini. 2014. Semantic similarity assessment of words using weighted wordnet. Int. J. Mach. Learn. Cybernet. 5, 3 (2014), 479--490.Google Scholar
Cross Ref
- Danushka Bollegala, Yutaka Matsuo, and Mitsuru Ishizuka. 2007. Measuring semantic similarity between words using web search engines. In Proceedings of the International Conference on the World Wide Web (WWW’07). 757--766. Google Scholar
Digital Library
- Zhigang Chen, Wei Lin, Qian Chen, Xiaoping Chen, Si Wei, Hui Jiang, and Xiaodan Zhu. 2015. Revisiting word embedding for contrasting meaning. In Proceedings of the Association for Computational Linguistics Conference (ACL’15). 106--115.Google Scholar
Cross Ref
- Zhendong Dong and Qiang Dong. 2006. HowNet and the Computation of Meaning. World Scientific, Singapore. Google Scholar
Digital Library
- Mengjia Fan, Yangsen Zhang, and Jiayuan Li. 2015. Word similarity computation based on HowNet. In Proceedings of Fuzzy Systems and Knowledge Discovery. 1487--1492.Google Scholar
- Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard H. Hovy, and Noah A. Smith. 2015. Retrofitting word vectors to semantic lexicons. In Proceedings of North American Chapter of the. Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 1606--1615.Google Scholar
- Shaoru Guo, Yong Guan, Ru Li, and Qi Zhang. 2016. Chinese word similarity computing based on combination strategy. In Proceedings of the Conference on Natural Language Processing and Chinese Computing (NLPCC’16). 744--752.Google Scholar
Cross Ref
- Derrick Higgins. 2005. Which statistics reflect semantics? Rethinking synonymy and word similarity. Linguist. Evid. Empir. Theoret. Comput. Perspect. Studies Generat. Grammar 85 (2005), 265--284.Google Scholar
Cross Ref
- Felix Hill, Roi Reichart, and Anna Korhonen. 2016. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Comput. Linguist. (2016). Google Scholar
Digital Library
- Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. 2015. SensEmbed: Learning sense embeddings for word and relational similarity. In Proceedings of the Association for Computational Linguistics Conference (ACL’15). 95--105.Google Scholar
Cross Ref
- Chunxia Liang, Yanqiu Shao, and Jing Zhao. 2013. Construction of a Chinese semantic dictionary by integrating two heterogeneous dictionaries: TongYiCi cilin and HowNet. In Proceedings of the Workshops on Web Intelligence and Intelligent Agent Technologies (IAT’13). 203--207. Google Scholar
Digital Library
- Quan Liu, Hui Jiang, Si Wei, Zhenhua Ling, and Yu Hu. 2015. Learning semantic word embeddings based on ordinal knowledge constraints. In Proceedings of the Association for Computational Linguistics Conference (ACL’15). 1501--1511.Google Scholar
Cross Ref
- Qun Liu and Sujian Li. 2002. Word similarity computing based on how-net. Comput. Linguist. Chinese Lang. Process. 7, 2 (2002).Google Scholar
- Jiaju Mei, Yiming Zhu, Yunqi Gao, and Hongxiang Yin. 1983. Tongyici Cilin. Shanghai Lexicon Publishing Company, Shanghai.Google Scholar
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. In Proceedings of Workshop at the International Conference on Learning Representations (ICLR’13). 422--431.Google Scholar
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS’13). 3111--3119. Google Scholar
Digital Library
- Nikola Mrkšić, Diarmuid Ó Séaghdha, Blaise Thomson, Milica Gasic, Lina Maria Rojas-Barahona, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, and Steve J. Young. 2016. Counter-fitting word vectors to linguistic constraints. In Proceedings of the North American Chapter of the. Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’16). 142--148.Google Scholar
- Kim Anh Nguyen, Sabine Schulte im Walde, and Ngoc Thang Vu. 2016. Integrating distributional lexical contrast into word embeddings for antonym-synonym distinction. In Proceedings of the Association for Computational Linguistics Conference (ACL’16).Google Scholar
Cross Ref
- Masataka Ono, Makoto Miwa, and Yutaka Sasaki. 2015. Word embedding-based antonym detection using thesauri and distributional information. In Proceedings of the North American Chapter of the. Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 984--989.Google Scholar
Cross Ref
- Jiahuan Pei, Cong Zhang, Degen Huang, and Jianjun Ma. 2016. Combining word embedding and semantic lexicon for Chinese Word Similarity Computation. In Proceedings of the Conference on Natural Language Processing and Chinese Computing (NLPCC’16). 766--777.Google Scholar
Cross Ref
- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods for Natural Language Processing (EMNLP’14). 1532--1543.Google Scholar
- Sascha Rothe and Hinrich Schütze. 2015. AutoExtend: Extending word embeddings to embeddings for synsets and lexemes. In Proceedings of the Association for Computational Linguistics Conference (ACL’15). 1793--1803.Google Scholar
Cross Ref
- Cilibrasi Rudi and Paulf Vitanyi. 2007. The google similarity distance. IEEE Trans. Knowl. Data Eng. 19, 3 (2007), 370--383. Google Scholar
Digital Library
- Wang Shan and Bond Francis. 2013. Building the Chinese open wordnet (COW): Starting from core synsets. In Proceedings of the 6th International Joint Conference on Natural Language Processing. 10--18.Google Scholar
- Jiule Tian and Wei Zhao. 2010. Words similarity algorithm based on tongyici cilin in semantic web adaptive learning system. J. Jilin Univ. (Info. Sci. Ed.) 28, 6 (2010), 602--608.Google Scholar
- Peter Turney. 2001. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the 12th European Conference on Machine Learning. 491--502. Google Scholar
Digital Library
- Peter D. Turney and Patrick Pantel. 2010. From frequency to meaning: Vector space models of semantics. J. Artific. Intell. Res. 37 (2010), 141--188. Google Scholar
Digital Library
- Siying Wu and Yangyang Wu. 2010. Chinese and english word similarity measure based on Chinese wordnet. J. Zhengzhou Univ. (Natural Sci. Ed.) 2 (2010), 017.Google Scholar
- Yunfang Wu and Wei Li. 2016. Overview of the NLPCC-ICCPOL 2016 shared task: Chinese word similarity measurement. In Proceedings of the Conference on Natural Language Processing and Chinese Computing (NLPCC’16). 828--839.Google Scholar
Cross Ref
- Yueh-Cheng Wu and Shu-Kai Hsieh. 2010. PyCWN: A python module for Chinese wordnet. In Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations. Demonstrations Volume, 5--8. Google Scholar
Digital Library
- Mo Yu and Mark Dredze. 2014. Improving lexical embeddings with semantic knowledge. In Proceedings of the Association for Computational Linguistics Conference (ACL’14). 545--550.Google Scholar
Cross Ref
- Peiying Zhang, Zhanshan Zhang, and Weishan Zhang. 2013. An approach of semantic similarity by combining hownet and cilin. In Proceedings of GreenCom/iThings/CPScom. 1638--1643. Google Scholar
Digital Library
- Jun Zhao, Shuanzhu Hu, and Xinghua Fan. 2009. Word similarity computation based on word link distribution. J. Chongqing Univ. Posts Telecommun. (Natural Sci. Ed.) 21, 4 (2009), 528--532.Google Scholar
- Xinhua Zhu, Runcong Ma, Liu Sun, and Hongchao Chen. 2016. Word semantic similarity computation based on hownet and cilin. J. Chinese Info. Process. 30, 4 (2016), 29--36.Google Scholar
Index Terms
Incorporating Prior Knowledge into Word Embedding for Chinese Word Similarity Measurement
Recommendations
Improving Vietnamese WordNet using word embedding
NLPIR '19: Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information RetrievalThis paper presents a simple but effective method to improve the quality of WordNet synsets and extract glosses for synsets. We translate the Princeton WordNet and other intermediate WordNets to a target language using a machine translator, then the ...
Composing Word Embeddings for Compound Words Using Linguistic Knowledge
In recent years, the use of distributed representations has been a fundamental technology for natural language processing. However, Japanese has multiple compound words, and often we must compare the meanings of a word and a compound word. Moreover, word ...
Word Embedding based Textual Semantic Similarity Measure in Bengali
AbstractTextual semantic similarity is a crucial constituent in many NLP tasks such as information retrieval, machine translation, information retrieval and textual forgery detection. It is a complicated task for rule-based techniques to address semantic ...






Comments