Abstract
Spelling check for Chinese has more challenging difficulties than that for other languages. A hybrid model for Chinese spelling check is presented in this article. The hybrid model consists of three components: one graph-based model for generic errors and two independently trained models for specific errors. In the graph model, a directed acyclic graph is generated for each sentence, and the single-source shortest-path algorithm is performed on the graph to detect and correct general spelling errors at the same time. Prior to that, two types of errors over functional words (characters) are first solved by conditional random fields: the confusion of “在” (at) (pinyin is zai in Chinese), “再” (again, more, then) (pinyin: zai) and “的” (of) (pinyin: de), “地” (-ly, adverb-forming particle) (pinyin: de), and “得” (so that, have to) (pinyin: de). Finally, a rule-based model is exploited to distinguish pronoun usage confusion: “她” (she) (pinyin: ta), “他” (he) (pinyin: ta), and some other common collocation errors. The proposed model is evaluated on the standard datasets released by the SIGHAN Bake-off shared tasks, giving state-of-the-art results.
- Farooq Ahmad and Grzegorz Kondrak. 2005. Learning a spelling error model from search query logs. In Proceedings of the Human Language Technology Conference and the Conference on Empirical Methods in Natural Language Processing. 955--962. Google Scholar
Digital Library
- Deng Cai and Hai Zhao. 2016. Neural word segmentation learning for Chinese. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 409--420. Google Scholar
Cross Ref
- Richard G. Casey and Eric Lecolinet. 1996. A survey of methods and strategies in character segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 18, 7, 690--706. Google Scholar
Digital Library
- Chaohuang Chang. 1995. A new approach for automatic Chinese spelling correction. In Proceedings of the Natural Language Processing Pacific Rim Symposium. 278--283.Google Scholar
- Kuanyu Chen, Hungshin Lee, Chunghan Lee, Hsinmin Wang, and Hsinhsi Chen. 2013. A study of language modeling for Chinese spelling check. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 79--83.Google Scholar
- Stanley F. Chen and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech and Language 13, 4, 359--393. Google Scholar
Digital Library
- Hsunwen Chiu, Jiancheng Wu, and Jason S. Chang. 2013. Chinese spelling checker based on statistical machine translation. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 49--53.Google Scholar
- Hsun-Wen Chiu, Jian-Cheng Wu, and Jason S. Chang. 2014. Chinese spell checking based on noisy channel model. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 202--209. Google Scholar
Cross Ref
- Wei-Cheng Chu and Chuan-Jie Lin. 2014. NTOU Chinese spelling check system in CLP Bake-off 2014. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 210--215. Google Scholar
Cross Ref
- Thomas Emerson. 2005. The Second International Chinese Word Segmentation Bakeoff. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing. 123--133.Google Scholar
- David Eppstein. 1998. Finding the k shortest paths. SIAM Journal on Computing 28, 2, 652--673. Google Scholar
Digital Library
- Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. 2008. IRSTLM: An open source toolkit for handling large scale language models. In Proceedings of the 9th Annual Conference of the International Speech Communication Association. 1618--1621.Google Scholar
- Jianfeng Gao, Xiaolong Li, Daniel Micol, Chris Quirk, and Xu Sun. 2010. A large scale ranker-based system for search query spelling correction. In Proceedings of the 23rd International Conference on Computational Linguistics. 358--366.Google Scholar
Digital Library
- Lei Gu, Yong Wang, and Xitao Liang. 2014. Introduction to NJUPT Chinese spelling check systems in CLP-2014 Bakeoff. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 167--172. Google Scholar
Cross Ref
- Dongxu Han and Baobao Chang. 2013. A maximum entropy approach to Chinese spelling check. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 74--78.Google Scholar
- Yu He and Guohong Fu. 2013. Description of HLJU Chinese spelling checker for SIGHAN Bakeoff 2013. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 84--87.Google Scholar
- Yuming Hsieh, Minghong Bai, and Kehjiann Chen. 2013. Introduction to CKIP Chinese spelling check system for SIGHAN Bakeoff 2013 evaluation. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 59--63.Google Scholar
- Changning Huang and Hai Zhao. 2007. Chinese word segmentation: A decade review. Journal of Chinese Information Processing 21, 3, 8--20.Google Scholar
- Qiang Huang, Peijie Huang, Xinrui Zhang, Weijian Xie, Kaiduo Hong, Bingzhou Chen, and Lei Huang. 2014. Chinese spelling check system based on tri-gram model. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 173--178. Google Scholar
Cross Ref
- Zhongye Jia, Peilu Wang, and Hai Zhao. 2013. Graph model for Chinese spell checking. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 88--92.Google Scholar
- Zhongye Jia and Hai Zhao. 2014. A joint graph model for Pinyin-to-Chinese conversion with typo correction. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1512--1523. Google Scholar
Cross Ref
- Junhui Li, Guodong Zhou, Hai Zhao, Qiaoming Zhu, and Peide Qian. 2009. Improving nominal SRL in Chinese language with verbal SRL information and automatic predicate recognition. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 1280--1288. Google Scholar
Cross Ref
- Chuanjie Lin and Weicheng Chu. 2013. NTOU Chinese spelling check system in SIGHAN Bake-off 2013. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 102--107.Google Scholar
- Chao-Lin Liu, Min-Hua Lai, Yi-Hsuan Chuang, and Chia-Ying Lee. 2010. Visually and phonologically similar characters in incorrect simplified Chinese words. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters. 739--747.Google Scholar
Digital Library
- Chao-Lin Liu, Min-Hua Lai, Kan-Wen Tien, Yi-Hsuan Chuang, Shih-Hung Wu, and Chia-Ying Lee. 2011. Visually and phonologically similar characters in incorrect Chinese words: Analyses, identification, and applications. ACM Transactions on Asian Language Information Processing 10, 2, 10.Google Scholar
Digital Library
- Min Liu, Ping Jian, and Heyan Huang. 2014. Introduction to BIT Chinese spelling correction system at CLP 2014 Bake-off. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 179--185. Google Scholar
Cross Ref
- Xiaodong Liu, Kevin Cheng, Yanyan Luo, Kevin Duh, and Yuji Matsumoto. 2013. A hybrid Chinese spelling correction using language model and statistical machine translation with reranking. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 54--58.Google Scholar
- Xuezhe Ma, Xiaotian Zhang, Hai Zhao, and Bao-Liang Lu. 2010. Dependency parser for Chinese constituent parsing. In Proceedings of the Joint Conference on Chinese Language Processing. 1--6.Google Scholar
- Heming Shou and Hai Zhao. 2012. Hybrid rule-based algorithm for coreference resolution. In Proceedings of the Joint Conference on EMNLP and CoNLL-Shared Task. 118--121.Google Scholar
Digital Library
- Xu Sun, Jianfeng Gao, Daniel Micol, and Chris Quirk. 2010. Learning phrase-based spelling error models from clickthrough data. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 266--274.Google Scholar
Digital Library
- Yih-Ru Wang and Yuan-Fu Liao. 2014. NCTU and NTUT’s entry to CLP-2014 Chinese spelling check evaluation. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 216--219. Google Scholar
Cross Ref
- Shihhung Wu, Chaolin Liu, and Lunghao Lee. 2013. Chinese spelling check evaluation at SIGHAN Bake-off 2013. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 35--42.Google Scholar
- Yang Xin, Hai Zhao, Yuzhu Wang, and Zhongye Jia. 2014. An improved graph model for Chinese spell checking. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 157--166. Google Scholar
Cross Ref
- Jinhua Xiong, Qiao Zhang, Jianpeng Hou, Qianbo Wang, Yuanzhuo Wang, and Xueqi Cheng. 2014. Extended HMM and ranking models for Chinese spelling correction. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 133--138. Google Scholar
Cross Ref
- Shaohua Yang, Hai Zhao, Xiaolin Wang, and Baoliang Lu. 2012. Spell checking for Chinese. In Proceedings of the International Conference on Language Resources and Evaluation. 730--736.Google Scholar
- Tinghao Yang, Yulun Hsieh, Yuhsuan Chen, Michael Tsang, Chengwei Shih, and Wenlian Hsu. 2013. Sinica-IASL Chinese spelling check system at SIGHAN-7. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 93--96.Google Scholar
- Jui-Feng Yeh, Sheng-Feng Li, Mei-Rong Wu, Wen-Yi Chen, and Mao-Chuan Su. 2013. Chinese word spelling correction based on N-gram ranked inverted index list. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 43--48.Google Scholar
- Jui-Feng Yeh, Yun-Yun Lu, Chen-Hsien Lee, Yu-Hsiang Yu, and Yong-Ting Chen. 2014. Chinese word spelling correction based on rule induction. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 139--145. Google Scholar
Cross Ref
- Junjie Yu and Zhenghua Li. 2014. Chinese spelling error detection and correction based on language model, pronunciation, and shape. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 220--223. Google Scholar
Cross Ref
- Liang-Chih Yu, Chao-Hong Liu, and Chung-Hsien Wu. 2013. Candidate scoring using Web-based measure for Chinese spelling error correction. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 108--112.Google Scholar
- Liang-Chih Yu, Lung-Hao Lee, Yuen-Hsien Tseng, and Hsin-Hsi Chen. 2014. Overview of SIGHAN 2014 Bake-off for Chinese spelling check. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 126--132. Google Scholar
Cross Ref
- Xiaotian Zhang, Chunyang Wu, and Hai Zhao. 2012. Chinese coreference resolution via ordered filtering. In Proceedings of the Joint Conference on EMNLP and CoNLL—Shared Task. 95--99.Google Scholar
Digital Library
- Xiaotian Zhang and Hai Zhao. 2011. Unsupervised Chinese phrase parsing based on tree pattern mining. In Proceedings of the 11th China National Conference on Computational Linguistics.Google Scholar
- Hai Zhao. 2009. Character-level dependencies in Chinese: Usefulness and learning. In Proceedings of the 12th Conference of the European Chapter of the ACL. 879--887. Google Scholar
Cross Ref
- Hai Zhao, Chang-Ning Huang, and Mu Li. 2006a. An improved Chinese word segmentation system with conditional random field. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing. 162--165.Google Scholar
- Hai Zhao, Chang-Ning Huang, Mu Li, and Bao-Liang Lu. 2006b. Effective tag set selection in Chinese word segmentation via conditional random field modeling. In Proceedings of the Pacific Asia Conference on Language, Information and Computation, Vol. 20. 87--94.Google Scholar
- Hai Zhao, Chang-Ning Huang, Mu Li, and Bao-Liang Lu. 2010a. A unified character-based tagging framework for Chinese word segmentation. ACM Transactions on Asian Language Information Processing 9, 2, 5.Google Scholar
Digital Library
- Hai Zhao and Chunyu Kit. 2007. Scaling conditional random field with application to Chinese word segmentation. In Proceedings of the 3rd International Conference on Natural Computation, Vol. 5. 95--99. Google Scholar
Digital Library
- Hai Zhao and Chunyu Kit. 2008. Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition. In Proceedings of the 6th SIGHAN Workshop on Chinese Language Processing. 106--111.Google Scholar
- Hai Zhao and Chunyu Kit. 2009. A simple and efficient model pruning method for conditional random fields. In Proceedings of the International Conference on Computer Processing of Oriental Languages. 145--155. Google Scholar
Digital Library
- Hai Zhao, Yan Song, and Chunyu Kit. 2010b. How large a corpus do we need: Statistical method versus rule-based method. In Proceedings of the 7th Conference on International Language Resources and Evaluation. 1672--1677.Google Scholar
- Hai Zhao, Masao Utiyama, Eiichiro Sumita, and Bao-Liang Lu. 2013. An empirical study on word segmentation for Chinese machine translation. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. 248--263. Google Scholar
Digital Library
Index Terms
A Hybrid Model for Chinese Spelling Check
Recommendations
IME-Spell: Chinese Spelling Check based on Input Method
NLPIR '20: Proceedings of the 4th International Conference on Natural Language Processing and Information RetrievalIntended for reducing manual inspection costs and semantic misunderstandings, Chinese Spelling Check (CSC) has been investigated extensively in natural language processing. However, little work has yet been done on input-method-based CSC in which CSC ...
A Hybrid Ranking Approach to Chinese Spelling Check
Special Issue on Chinese Spell CheckingWe propose a novel framework for Chinese Spelling Check (CSC), which is an automatic algorithm to detect and correct Chinese spelling errors. Our framework contains two key components: candidate generation and candidate ranking. Our framework differs ...
Correcting Chinese Spelling Errors with Word Lattice Decoding
Special Issue on Chinese Spell CheckingChinese spell checkers are more difficult to develop because of two language features: 1) there are no word boundaries, and a character may function as a word or a word morpheme; and 2) the Chinese character set contains more than ten thousand ...






Comments