skip to main content
research-article

Correcting Chinese Spelling Errors with Word Lattice Decoding

Published:11 November 2015Publication History
Skip Abstract Section

Abstract

Chinese spell checkers are more difficult to develop because of two language features: 1) there are no word boundaries, and a character may function as a word or a word morpheme; and 2) the Chinese character set contains more than ten thousand characters. The former makes it difficult for a spell checker to detect spelling errors, and the latter makes it difficult for a spell checker to construct error models. We develop a word lattice decoding model for a Chinese spell checker that addresses these difficulties. The model performs word segmentation and error correction simultaneously, thereby solving the word boundary problem. The model corrects nonword errors as well as real-word errors. In order to better estimate the error distribution of large character sets for error models, we also propose a methodology to extract spelling error samples automatically from the Google web 1T corpus. Due to the large quantity of data in the Google web 1T corpus, many spelling error samples can be extracted, better reflecting spelling error distributions in the real world. Finally, in order to improve the spell checker for real applications, we produce n-best suggestions for spelling error corrections. We test our proposed approach with the Bakeoff 2013 CSC Datasets; the results show that the proposed methods with the error model significantly outperform the performance of Chinese spell checkers that do not use error models.

References

  1. Chao-Huang Chang. 1995. A new approach for automatic Chinese spelling correction. In Proceedings of the Natural Language Processing Pacific Rim Symposium (NLPRS’95). 278--283.Google ScholarGoogle Scholar
  2. Keh-Jiann Chen and Ming-Hong Bai. 1998. Unknown word detection for Chinese by a corpus-based learning method. Int. J. Comput. Linguistics Chinese Language Process. 3, 1, 27--44.Google ScholarGoogle Scholar
  3. Keh-Jiann Chen and Wei-Yun Ma. 2002. Unknown word extraction for Chinese documents. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’02), 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Yong-Zhi Chen, Shih-Hung Wu, Chia-Ching Lu, and Tsun Ku. 2009. Chinese confusion word set for automatic generation of spelling error detecting template. In Proceedings of the 21st Conference on Computational Linguistics and Speech Processing (ROCLING’09). 359--372. {In Chinese}Google ScholarGoogle Scholar
  5. Yong-Zhi Chen, Shih-Hung Wu, Ping-Che Yang, Tsun Ku, and Gwo-Dong Chen. 2011. Improve the detection of improperly used Chinese characters in students’ essays with error model. Int. J. Continuing Engin. Educ. Life Long Learning 21, 1, 103--116.Google ScholarGoogle ScholarCross RefCross Ref
  6. Hsun-Wen Chiu, Jian-Cheng Wu, and Jason S. Chang. 2013. Chinese spelling checker based on statistical machine translation. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 49--53.Google ScholarGoogle Scholar
  7. Fred J. Damerau. 1964. A Technique for Computer Detection and Correction of Spelling Errors. Commun. ACM. 7, 3, 171--176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jianfeng Gao, Joshua Goodman, Mingjing Li, and Kai-Fu Lee. 2002. Toward a Unified Approach to Statistical Language Modeling for Chinese. ACM Trans. Asian Lang. Inform. Process. 1, 1, 3--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Hung-Yan Gu, Chiu-Yu Tseng, and Lin-Shan Lee. 1991. Markov modeling of Mandarin Chinese for decoding the phonetic sequence into Chinese characters. Computer Speech Lang. 5, 4, 363--377.Google ScholarGoogle ScholarCross RefCross Ref
  10. Yu-Ming Hsieh, Ming-Hong Bai, and Keh-Jiann Chen. 2013. Introduction to CKIP Chinese spelling check system for SIGHAN Bakeoff 2013 evaluation. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 59--63.Google ScholarGoogle Scholar
  11. Chuen-Ming Huang, Mei-Che Wu, and Ching-Che Chang. 2007. Error detection and correction based on Chinese phonemic alphabet in Chinese text. In Proceedings of the 4th Conference on Modeling Decisions for Artificial Intelligence. 463--476. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Ta-Hung Hung, Shih-Hung Wu, Tsun Ku, and Wen-Nan Wang. 2008. Chinese essay analysis language model information retrieval. In Proceedings of the Taiwan E-Learning Forum (TWELF’08).Google ScholarGoogle Scholar
  13. Mark D. Kernighan, Kenneth W. Church, and William A. Gale. 1990. A spelling correction program based on a noisy channel model. In Proceedings of the 13th International Conference on Computational Linguistics (COLING’90). 205--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Zhongye Jia, Peilu Wang, and Hai Zhao. 2013. Graph model for Chinese spell checking. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 88--92.Google ScholarGoogle Scholar
  15. Lin-Shan Lee, Chiu-Yu Tseng, Hung-Yan Gu, F.-H. Liu, C.H. Chang, Y.H. Lin, Yumin Lee, S.L. Tu, S.H. Hsieh, and C.H. Chen. 1993a. Golden Mandarin (I): A real-time Mandarin Speech dictation machine for Chinese language with very large vocabulary. IEEE Trans. Speech Audio Process. 1, 2, 158--179.Google ScholarGoogle ScholarCross RefCross Ref
  16. Lin-Shan Lee, C.-Y. Tseng, and K.-J. Chen, et al. 1993b. Golden Mandarin: An improved single-chip real-time Mandarin dictation machine for Chinese language with very large vocabulary. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’93). Vol. 2, 503--506. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Yih-Jeng Lin, Feng-Long Huang, and Ming-Shing Yu. 2002. A Chinese spelling error correction system. In Proceedings of the 7th Conference on Artificial Intelligence and Applications.Google ScholarGoogle Scholar
  18. Chao-Lin Liu and Jen-Hsiang Lin. 2008. Using structural information for identifying similar Chinese characters. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 93--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Chao-Lin Liu, Kan-Wen Tien, Min-Hua Lai, Yi-Hsuan Chuang, and Shih-Hung Wu. 2009a. Phonological and logographic influences on errors in written Chinese words. In Proceedings of the 7th Workshop on Asian Language Resources. 84--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Chao-Lin Liu, Kan-Wen Tien, Min-Hua Lai, Yi-Hsuan Chuang, and Shih-Hung Wu. 2009b. Capturing errors in written Chinese words. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (ACLIJCNLP’09). 25--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Chao-Lin Liu, Min-Hua Lai, Kan-Wen Tien, Yi-Hsuan Chuang, Shih-Hung Wu, and Chia-Ying Lee. 2011. Visually and phonologically similar characters in incorrect Chinese words: Analyses, identification, and applications. ACM Trans. Asian Lang. Inform. Process. 10, 2, 1--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Eric Mays, Fred J. Damerau, and Robert L. Mercer. 1991. Context based spelling correction. Inf. Process. Manage. 27, 5, 517--522. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. MOE. 1994. The Standard Form of National Characters -- Instructor’s Manual. Ministry of Education, Taiwan. http://www.edu.tw/files/site_content/M0001/std/c4.htm.Google ScholarGoogle Scholar
  24. James L. Peterson. 1986. A note on undetected typing errors. Commun. ACM. 29, 7, 633--637. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Fuji Ren, Hongchi Shi, and Qiang Zhou. 2001. A hybrid approach to automatic Chinese text checking and error correction. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics. Vol. 3, 1693--1698.Google ScholarGoogle Scholar
  26. Hinrich Schütze. 1998. Automatic word sense discrimination. Comput. Linguistics. 24, 1, 97--123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Claude E. Shannon. 1948. A mathematical theory of communication. Bell Syst. Tech. J. 27, 3, 379--423.Google ScholarGoogle ScholarCross RefCross Ref
  28. Unicode Consortium. 2014. The Unicode Standard 7.0. http://www.unicode.org.Google ScholarGoogle Scholar
  29. Jian-Cheng Wu, Hsun-Wen Chiu, and Jason S. Chang. 2013. Integrating dictionary and web n-grams for Chinese spell checking. Int. J. Comput. Linguistics Chinese Language Process. 18, 4, 17--30.Google ScholarGoogle Scholar
  30. Shih-Hung Wu, Chao-Lin Liu, and Lung-Hao Lee. 2013. Chinese spelling check evaluation at SIGHAN Bake-off 2013. In Proceeding of the 7th SIGHAN Workshop on Chinese Language Processing (SIGHAN’13). 35--42.Google ScholarGoogle Scholar
  31. Kae-Cherng Yang, Tai-Hsuan Ho, Lee-Feng Chien, and Lin-Shan Lee. 1998. Statistics-based segment pattern lexicon: A new direction for Chinese language modeling. In Proceedings of the IEEE International Conference on Acoustic, Speech, Signal Processing. 169--172.Google ScholarGoogle Scholar
  32. Lei Zhang, Zhou Ming, Changning Huang, and Mingyu Lu. 2000. Approach in automatic detection and correction of errors in Chinese text based on feature and learning. In Proceedings of the 3rd World Congress on Intelligent Control and Automation. 2744--2748.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Correcting Chinese Spelling Errors with Word Lattice Decoding

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!