Abstract
Chinese spell checkers are more difficult to develop because of two language features: 1) there are no word boundaries, and a character may function as a word or a word morpheme; and 2) the Chinese character set contains more than ten thousand characters. The former makes it difficult for a spell checker to detect spelling errors, and the latter makes it difficult for a spell checker to construct error models. We develop a word lattice decoding model for a Chinese spell checker that addresses these difficulties. The model performs word segmentation and error correction simultaneously, thereby solving the word boundary problem. The model corrects nonword errors as well as real-word errors. In order to better estimate the error distribution of large character sets for error models, we also propose a methodology to extract spelling error samples automatically from the Google web 1T corpus. Due to the large quantity of data in the Google web 1T corpus, many spelling error samples can be extracted, better reflecting spelling error distributions in the real world. Finally, in order to improve the spell checker for real applications, we produce n-best suggestions for spelling error corrections. We test our proposed approach with the Bakeoff 2013 CSC Datasets; the results show that the proposed methods with the error model significantly outperform the performance of Chinese spell checkers that do not use error models.
- Chao-Huang Chang. 1995. A new approach for automatic Chinese spelling correction. In Proceedings of the Natural Language Processing Pacific Rim Symposium (NLPRS’95). 278--283.Google Scholar
- Keh-Jiann Chen and Ming-Hong Bai. 1998. Unknown word detection for Chinese by a corpus-based learning method. Int. J. Comput. Linguistics Chinese Language Process. 3, 1, 27--44.Google Scholar
- Keh-Jiann Chen and Wei-Yun Ma. 2002. Unknown word extraction for Chinese documents. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’02), 1--7. Google Scholar
Digital Library
- Yong-Zhi Chen, Shih-Hung Wu, Chia-Ching Lu, and Tsun Ku. 2009. Chinese confusion word set for automatic generation of spelling error detecting template. In Proceedings of the 21st Conference on Computational Linguistics and Speech Processing (ROCLING’09). 359--372. {In Chinese}Google Scholar
- Yong-Zhi Chen, Shih-Hung Wu, Ping-Che Yang, Tsun Ku, and Gwo-Dong Chen. 2011. Improve the detection of improperly used Chinese characters in students’ essays with error model. Int. J. Continuing Engin. Educ. Life Long Learning 21, 1, 103--116.Google Scholar
Cross Ref
- Hsun-Wen Chiu, Jian-Cheng Wu, and Jason S. Chang. 2013. Chinese spelling checker based on statistical machine translation. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 49--53.Google Scholar
- Fred J. Damerau. 1964. A Technique for Computer Detection and Correction of Spelling Errors. Commun. ACM. 7, 3, 171--176. Google Scholar
Digital Library
- Jianfeng Gao, Joshua Goodman, Mingjing Li, and Kai-Fu Lee. 2002. Toward a Unified Approach to Statistical Language Modeling for Chinese. ACM Trans. Asian Lang. Inform. Process. 1, 1, 3--33. Google Scholar
Digital Library
- Hung-Yan Gu, Chiu-Yu Tseng, and Lin-Shan Lee. 1991. Markov modeling of Mandarin Chinese for decoding the phonetic sequence into Chinese characters. Computer Speech Lang. 5, 4, 363--377.Google Scholar
Cross Ref
- Yu-Ming Hsieh, Ming-Hong Bai, and Keh-Jiann Chen. 2013. Introduction to CKIP Chinese spelling check system for SIGHAN Bakeoff 2013 evaluation. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 59--63.Google Scholar
- Chuen-Ming Huang, Mei-Che Wu, and Ching-Che Chang. 2007. Error detection and correction based on Chinese phonemic alphabet in Chinese text. In Proceedings of the 4th Conference on Modeling Decisions for Artificial Intelligence. 463--476. Google Scholar
Digital Library
- Ta-Hung Hung, Shih-Hung Wu, Tsun Ku, and Wen-Nan Wang. 2008. Chinese essay analysis language model information retrieval. In Proceedings of the Taiwan E-Learning Forum (TWELF’08).Google Scholar
- Mark D. Kernighan, Kenneth W. Church, and William A. Gale. 1990. A spelling correction program based on a noisy channel model. In Proceedings of the 13th International Conference on Computational Linguistics (COLING’90). 205--210. Google Scholar
Digital Library
- Zhongye Jia, Peilu Wang, and Hai Zhao. 2013. Graph model for Chinese spell checking. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 88--92.Google Scholar
- Lin-Shan Lee, Chiu-Yu Tseng, Hung-Yan Gu, F.-H. Liu, C.H. Chang, Y.H. Lin, Yumin Lee, S.L. Tu, S.H. Hsieh, and C.H. Chen. 1993a. Golden Mandarin (I): A real-time Mandarin Speech dictation machine for Chinese language with very large vocabulary. IEEE Trans. Speech Audio Process. 1, 2, 158--179.Google Scholar
Cross Ref
- Lin-Shan Lee, C.-Y. Tseng, and K.-J. Chen, et al. 1993b. Golden Mandarin: An improved single-chip real-time Mandarin dictation machine for Chinese language with very large vocabulary. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’93). Vol. 2, 503--506. Google Scholar
Digital Library
- Yih-Jeng Lin, Feng-Long Huang, and Ming-Shing Yu. 2002. A Chinese spelling error correction system. In Proceedings of the 7th Conference on Artificial Intelligence and Applications.Google Scholar
- Chao-Lin Liu and Jen-Hsiang Lin. 2008. Using structural information for identifying similar Chinese characters. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 93--96. Google Scholar
Digital Library
- Chao-Lin Liu, Kan-Wen Tien, Min-Hua Lai, Yi-Hsuan Chuang, and Shih-Hung Wu. 2009a. Phonological and logographic influences on errors in written Chinese words. In Proceedings of the 7th Workshop on Asian Language Resources. 84--91. Google Scholar
Digital Library
- Chao-Lin Liu, Kan-Wen Tien, Min-Hua Lai, Yi-Hsuan Chuang, and Shih-Hung Wu. 2009b. Capturing errors in written Chinese words. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (ACLIJCNLP’09). 25--28. Google Scholar
Digital Library
- Chao-Lin Liu, Min-Hua Lai, Kan-Wen Tien, Yi-Hsuan Chuang, Shih-Hung Wu, and Chia-Ying Lee. 2011. Visually and phonologically similar characters in incorrect Chinese words: Analyses, identification, and applications. ACM Trans. Asian Lang. Inform. Process. 10, 2, 1--39. Google Scholar
Digital Library
- Eric Mays, Fred J. Damerau, and Robert L. Mercer. 1991. Context based spelling correction. Inf. Process. Manage. 27, 5, 517--522. Google Scholar
Digital Library
- MOE. 1994. The Standard Form of National Characters -- Instructor’s Manual. Ministry of Education, Taiwan. http://www.edu.tw/files/site_content/M0001/std/c4.htm.Google Scholar
- James L. Peterson. 1986. A note on undetected typing errors. Commun. ACM. 29, 7, 633--637. Google Scholar
Digital Library
- Fuji Ren, Hongchi Shi, and Qiang Zhou. 2001. A hybrid approach to automatic Chinese text checking and error correction. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics. Vol. 3, 1693--1698.Google Scholar
- Hinrich Schütze. 1998. Automatic word sense discrimination. Comput. Linguistics. 24, 1, 97--123. Google Scholar
Digital Library
- Claude E. Shannon. 1948. A mathematical theory of communication. Bell Syst. Tech. J. 27, 3, 379--423.Google Scholar
Cross Ref
- Unicode Consortium. 2014. The Unicode Standard 7.0. http://www.unicode.org.Google Scholar
- Jian-Cheng Wu, Hsun-Wen Chiu, and Jason S. Chang. 2013. Integrating dictionary and web n-grams for Chinese spell checking. Int. J. Comput. Linguistics Chinese Language Process. 18, 4, 17--30.Google Scholar
- Shih-Hung Wu, Chao-Lin Liu, and Lung-Hao Lee. 2013. Chinese spelling check evaluation at SIGHAN Bake-off 2013. In Proceeding of the 7th SIGHAN Workshop on Chinese Language Processing (SIGHAN’13). 35--42.Google Scholar
- Kae-Cherng Yang, Tai-Hsuan Ho, Lee-Feng Chien, and Lin-Shan Lee. 1998. Statistics-based segment pattern lexicon: A new direction for Chinese language modeling. In Proceedings of the IEEE International Conference on Acoustic, Speech, Signal Processing. 169--172.Google Scholar
- Lei Zhang, Zhou Ming, Changning Huang, and Mingyu Lu. 2000. Approach in automatic detection and correction of errors in Chinese text based on feature and learning. In Proceedings of the 3rd World Congress on Intelligent Control and Automation. 2744--2748.Google Scholar
Cross Ref
Index Terms
Correcting Chinese Spelling Errors with Word Lattice Decoding
Recommendations
Chinese Spelling Error Detection Using a Fusion Lattice LSTM
Spelling error detection serves as a crucial preprocessing in many natural language processing applications. Unlike English, where every single word is directly typed by keyboard, we have to use an input method to input Chinese characters. The pinyin ...
Revising word lattice using support vector machine for Chinese word segmentation
IIWAS '12: Proceedings of the 14th International Conference on Information Integration and Web-based Applications & ServicesThis paper presents a novel Chinese word segmentation approach combining both dictionary-based and statistics-based techniques. First, we transform a linear sentence to a word lattice based on dictionary. Then we apply classification method based on ...
Correcting real-word spelling errors by restoring lexical cohesion
Spelling errors that happen to result in a real word in the lexicon cannot be detected by a conventional spelling checker. We present a method for detecting and correcting many such errors by identifying tokens that are semantically unrelated to their ...






Comments