Abstract
We propose a novel framework for Chinese Spelling Check (CSC), which is an automatic algorithm to detect and correct Chinese spelling errors. Our framework contains two key components: candidate generation and candidate ranking. Our framework differs from previous research, such as Statistical Machine Translation (SMT) based model or Language Model (LM) based model, in that we use both SMT and LM models as components of our framework for generating the correction candidates, in order to obtain maximum recall; to improve the precision, we further employ a Support Vector Machines (SVM) classifier to rank the candidates generated by the SMT and the LM. Experiments show that our framework outperforms other systems, which adopted the same or similar resources as ours in the SIGHAN 7 shared task; even comparing with the state-of-the-art systems, which used more resources, such as a considerable large dictionary, an idiom dictionary and other semantic information, our framework still obtains competitive results. Furthermore, to address the resource scarceness problem for training the SMT model, we generate around 2 million artificial training sentences using the Chinese character confusion sets, which include a set of Chinese characters with similar shapes and similar pronunciations, provided by the SIGHAN 7 shared task.
- Gerlof Bouma. 2009. Normalized (pointwise) mutual information in collocation extraction. In Proceedings of the Biennial GSCL Conference. 31--40.Google Scholar
- Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Comput. Linguistics 19, 2, 263--311. Google Scholar
Digital Library
- Christopher J.C. Burges. 1998. A tutorial on support vector machines for pattern recognition. Data Mining Knowl. Discovery 2, 2, 121--167. Google Scholar
Digital Library
- Chao-Huang Chang. 1995. A new approach for automatic Chinese spelling correction. In Proceedings of the Natural Language Processing Pacific Rim Symposium (NLPRS’95). Citeseer, 278--283.Google Scholar
- Kuan-Yu Chen, Hung-Shin Lee, Chung-Han Lee, Hsin-Min Wang, and Hsin-Hsi Chen. 2013. A study of language modeling for Chinese spelling check. In Proceedings of the 6th International Joint Conference on Natural Language Processing. 79.Google Scholar
- Yong-Zhi Chen, Shih-Hung Wu, Chia-Ching Lu, and Tsun Ku. 2009. Chinese confusion word set for automatic generation of spelling error detecting template. In Proceedings of the 21st Conference on Computational Linguistics and Speech Processing. 1--2.Google Scholar
- Yong-Zhi Chen, Shih-Hung Wu, Ping-Che Yang, and Tsun Ku. 2011. Improve the detection of improperly used Chinese characters in students’ essays with error model. Int. J. Continuing Engin. Educ. Life Long Learning 21, 1, 103--116.Google Scholar
Cross Ref
- Hsun-wen Chiu, Jian-cheng Wu, and Jason S. Chang. 2013. Chinese Spelling Checker Based on Statistical Machine Translation. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. Asian Federation of Natural Language Processing, Nagoya, Japan, 49--53. http://www.aclweb.org/anthology/W13-4408.Google Scholar
- Chuen-Min Huang, Mei-Chen Wu, and Ching-Che Chang. 2007. Error detection and correction based on Chinese phonemic alphabet in Chinese text. In Modeling Decisions for Artificial Intelligence. Springer, 463--476. Google Scholar
Digital Library
- Ta-Hung Hung and Shih-Hung Wu. 1999. AutoTag.Google Scholar
- Ta-Hung Hung and Shih-HungWu. 2008. Chinese essay error detection and suggestion system. In Proceedings of the Taiwan E-Learning Forum.Google Scholar
- Yih-Jeng Lin, Feng-Long Huang, and Ming-Shing Yu. 2002. A Chinese spelling error correction system. In Proceedings of the 7th Conference on Artificial Intelligence and Applications.Google Scholar
- C.-L. Liu, M.-H. Lai, K.-W. Tien, Y.-H. Chuang, S.-H. Wu, and C.-Y. Lee. 2011. Visually and phonologically similar characters in incorrect Chinese words: Analyses, identification, and applications. ACM Trans. Asian Lang. Inform. Process. 10, 2, 10. Google Scholar
Digital Library
- Huan Liu and Lei Yu. 2005. Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng. 17, 4, 491--502. Google Scholar
Digital Library
- Xiaodong Liu, Fei Cheng, Yanyan Luo, Kevin Duh, and Yuji Matsumoto. 2013. A hybrid Chinese spelling correction system using language model and statistical machine translation with reranking. In Proceedings of the 6th International Joint Conference on Natural Language Processing. 54.Google Scholar
- Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press. Google Scholar
Digital Library
- Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Comput. Linguistics 29, 1, 19--51. Google Scholar
Digital Library
- Emily Pitler, Shane Bergsma, Dekang Lin, and Kenneth Church. 2010. Using web-scale N-grams to improve base NP parsing performance. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING’10). 886--894. Google Scholar
Digital Library
- Lawrence R. Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 2, 257--286.Google Scholar
Cross Ref
- Weiwei Sun. 2011. A stacked sub-word model for joint Chinese word segmentation and part-of-speech tagging. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 1385--1394. http://www.aclweb.org/anthology/P11-1139. Google Scholar
Digital Library
- Xu Sun, Jianfeng Gao, Daniel Micol, and Chris Quirk. 2010. Learning phrase-based spelling error models from clickthrough data. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL’10). Association for Computational Linguistics, 266--274. Google Scholar
Digital Library
- Shih-Hung Wu, Yong-Zhi Chen, Ping-che Yang, Tsun Ku, and Chao-Lin Liu. 2010. Reducing the false alarm rate of Chinese character error detection and correction. In Proceedings of the CIPS-SIGHAN Joint Conference on Chinese Language Processing. 54--61.Google Scholar
- Shih-Hung Wu, Chao-Lin Liu, and Lung-Hao Lee. 2013. Chinese spelling check evaluation at SIGHAN Bake-off 2013. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. Asian Federation of Natural Language Processing, 35--42. http://www.aclweb.org/anthology/W13-4406.Google Scholar
- Nianwen Xue and others. 2003. Chinese word segmentation as character tagging. Comput. Linguistics Chinese Language Process. 8, 1, 29--48.Google Scholar
- Jui-Feng Yeh, Sheng-Feng Li, Mei-Rong Wu, Wen-Yi Chen, and Mao-Chuan Su. 2013. Chinese word spelling correction based on n-gram ranked inverted index list. In Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing. Asian Federation of Natural Language Processing, 43--48. http://www.aclweb.org/anthology/W13-4407.Google Scholar
- Lei Yu and Huan Liu. 2004. Efficient feature selection via analysis of relevance and redundancy. J. Machine Learn. Res. 5, 1205--1224. Google Scholar
Digital Library
- Lei Zhang, Changning Huang, Ming Zhou, and Haihua Pan. 2000. Automatic detecting/correcting errors in Chinese text by an approximate word-matching algorithm. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (ACL’00). Association for Computational Linguistics, 248--254. Google Scholar
Digital Library
Index Terms
A Hybrid Ranking Approach to Chinese Spelling Check
Recommendations
Correcting Chinese Spelling Errors with Word Lattice Decoding
Special Issue on Chinese Spell CheckingChinese spell checkers are more difficult to develop because of two language features: 1) there are no word boundaries, and a character may function as a word or a word morpheme; and 2) the Chinese character set contains more than ten thousand ...
A Hybrid Model for Chinese Spelling Check
Spelling check for Chinese has more challenging difficulties than that for other languages. A hybrid model for Chinese spelling check is presented in this article. The hybrid model consists of three components: one graph-based model for generic errors ...
Improve Chinese Spelling Check by Reevaluation
Advances in Knowledge Discovery and Data MiningAbstractChinese Spelling Check (CSC) aims to detect and correct the spelling errors in Chinese. Most Chinese spelling errors are misused semantically, phonetically or graphically similar characters. Previous state-of-the-art works on the CSC task pursue ...






Comments