skip to main content
research-article

A Hybrid Model for Chinese Spelling Check

Authors Info & Claims
Published:30 March 2017Publication History
Skip Abstract Section

Abstract

Spelling check for Chinese has more challenging difficulties than that for other languages. A hybrid model for Chinese spelling check is presented in this article. The hybrid model consists of three components: one graph-based model for generic errors and two independently trained models for specific errors. In the graph model, a directed acyclic graph is generated for each sentence, and the single-source shortest-path algorithm is performed on the graph to detect and correct general spelling errors at the same time. Prior to that, two types of errors over functional words (characters) are first solved by conditional random fields: the confusion of “在” (at) (pinyin is zai in Chinese), “再” (again, more, then) (pinyin: zai) and “的” (of) (pinyin: de), “地” (-ly, adverb-forming particle) (pinyin: de), and “得” (so that, have to) (pinyin: de). Finally, a rule-based model is exploited to distinguish pronoun usage confusion: “她” (she) (pinyin: ta), “他” (he) (pinyin: ta), and some other common collocation errors. The proposed model is evaluated on the standard datasets released by the SIGHAN Bake-off shared tasks, giving state-of-the-art results.

References

  1. Farooq Ahmad and Grzegorz Kondrak. 2005. Learning a spelling error model from search query logs. In Proceedings of the Human Language Technology Conference and the Conference on Empirical Methods in Natural Language Processing. 955--962. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Deng Cai and Hai Zhao. 2016. Neural word segmentation learning for Chinese. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 409--420. Google ScholarGoogle ScholarCross RefCross Ref
  3. Richard G. Casey and Eric Lecolinet. 1996. A survey of methods and strategies in character segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 18, 7, 690--706. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Chaohuang Chang. 1995. A new approach for automatic Chinese spelling correction. In Proceedings of the Natural Language Processing Pacific Rim Symposium. 278--283.Google ScholarGoogle Scholar
  5. Kuanyu Chen, Hungshin Lee, Chunghan Lee, Hsinmin Wang, and Hsinhsi Chen. 2013. A study of language modeling for Chinese spelling check. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 79--83.Google ScholarGoogle Scholar
  6. Stanley F. Chen and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech and Language 13, 4, 359--393. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Hsunwen Chiu, Jiancheng Wu, and Jason S. Chang. 2013. Chinese spelling checker based on statistical machine translation. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 49--53.Google ScholarGoogle Scholar
  8. Hsun-Wen Chiu, Jian-Cheng Wu, and Jason S. Chang. 2014. Chinese spell checking based on noisy channel model. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 202--209. Google ScholarGoogle ScholarCross RefCross Ref
  9. Wei-Cheng Chu and Chuan-Jie Lin. 2014. NTOU Chinese spelling check system in CLP Bake-off 2014. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 210--215. Google ScholarGoogle ScholarCross RefCross Ref
  10. Thomas Emerson. 2005. The Second International Chinese Word Segmentation Bakeoff. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing. 123--133.Google ScholarGoogle Scholar
  11. David Eppstein. 1998. Finding the k shortest paths. SIAM Journal on Computing 28, 2, 652--673. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. 2008. IRSTLM: An open source toolkit for handling large scale language models. In Proceedings of the 9th Annual Conference of the International Speech Communication Association. 1618--1621.Google ScholarGoogle Scholar
  13. Jianfeng Gao, Xiaolong Li, Daniel Micol, Chris Quirk, and Xu Sun. 2010. A large scale ranker-based system for search query spelling correction. In Proceedings of the 23rd International Conference on Computational Linguistics. 358--366.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Lei Gu, Yong Wang, and Xitao Liang. 2014. Introduction to NJUPT Chinese spelling check systems in CLP-2014 Bakeoff. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 167--172. Google ScholarGoogle ScholarCross RefCross Ref
  15. Dongxu Han and Baobao Chang. 2013. A maximum entropy approach to Chinese spelling check. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 74--78.Google ScholarGoogle Scholar
  16. Yu He and Guohong Fu. 2013. Description of HLJU Chinese spelling checker for SIGHAN Bakeoff 2013. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 84--87.Google ScholarGoogle Scholar
  17. Yuming Hsieh, Minghong Bai, and Kehjiann Chen. 2013. Introduction to CKIP Chinese spelling check system for SIGHAN Bakeoff 2013 evaluation. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 59--63.Google ScholarGoogle Scholar
  18. Changning Huang and Hai Zhao. 2007. Chinese word segmentation: A decade review. Journal of Chinese Information Processing 21, 3, 8--20.Google ScholarGoogle Scholar
  19. Qiang Huang, Peijie Huang, Xinrui Zhang, Weijian Xie, Kaiduo Hong, Bingzhou Chen, and Lei Huang. 2014. Chinese spelling check system based on tri-gram model. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 173--178. Google ScholarGoogle ScholarCross RefCross Ref
  20. Zhongye Jia, Peilu Wang, and Hai Zhao. 2013. Graph model for Chinese spell checking. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 88--92.Google ScholarGoogle Scholar
  21. Zhongye Jia and Hai Zhao. 2014. A joint graph model for Pinyin-to-Chinese conversion with typo correction. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1512--1523. Google ScholarGoogle ScholarCross RefCross Ref
  22. Junhui Li, Guodong Zhou, Hai Zhao, Qiaoming Zhu, and Peide Qian. 2009. Improving nominal SRL in Chinese language with verbal SRL information and automatic predicate recognition. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 1280--1288. Google ScholarGoogle ScholarCross RefCross Ref
  23. Chuanjie Lin and Weicheng Chu. 2013. NTOU Chinese spelling check system in SIGHAN Bake-off 2013. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 102--107.Google ScholarGoogle Scholar
  24. Chao-Lin Liu, Min-Hua Lai, Yi-Hsuan Chuang, and Chia-Ying Lee. 2010. Visually and phonologically similar characters in incorrect simplified Chinese words. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters. 739--747.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Chao-Lin Liu, Min-Hua Lai, Kan-Wen Tien, Yi-Hsuan Chuang, Shih-Hung Wu, and Chia-Ying Lee. 2011. Visually and phonologically similar characters in incorrect Chinese words: Analyses, identification, and applications. ACM Transactions on Asian Language Information Processing 10, 2, 10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Min Liu, Ping Jian, and Heyan Huang. 2014. Introduction to BIT Chinese spelling correction system at CLP 2014 Bake-off. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 179--185. Google ScholarGoogle ScholarCross RefCross Ref
  27. Xiaodong Liu, Kevin Cheng, Yanyan Luo, Kevin Duh, and Yuji Matsumoto. 2013. A hybrid Chinese spelling correction using language model and statistical machine translation with reranking. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 54--58.Google ScholarGoogle Scholar
  28. Xuezhe Ma, Xiaotian Zhang, Hai Zhao, and Bao-Liang Lu. 2010. Dependency parser for Chinese constituent parsing. In Proceedings of the Joint Conference on Chinese Language Processing. 1--6.Google ScholarGoogle Scholar
  29. Heming Shou and Hai Zhao. 2012. Hybrid rule-based algorithm for coreference resolution. In Proceedings of the Joint Conference on EMNLP and CoNLL-Shared Task. 118--121.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Xu Sun, Jianfeng Gao, Daniel Micol, and Chris Quirk. 2010. Learning phrase-based spelling error models from clickthrough data. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 266--274.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Yih-Ru Wang and Yuan-Fu Liao. 2014. NCTU and NTUT’s entry to CLP-2014 Chinese spelling check evaluation. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 216--219. Google ScholarGoogle ScholarCross RefCross Ref
  32. Shihhung Wu, Chaolin Liu, and Lunghao Lee. 2013. Chinese spelling check evaluation at SIGHAN Bake-off 2013. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 35--42.Google ScholarGoogle Scholar
  33. Yang Xin, Hai Zhao, Yuzhu Wang, and Zhongye Jia. 2014. An improved graph model for Chinese spell checking. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 157--166. Google ScholarGoogle ScholarCross RefCross Ref
  34. Jinhua Xiong, Qiao Zhang, Jianpeng Hou, Qianbo Wang, Yuanzhuo Wang, and Xueqi Cheng. 2014. Extended HMM and ranking models for Chinese spelling correction. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 133--138. Google ScholarGoogle ScholarCross RefCross Ref
  35. Shaohua Yang, Hai Zhao, Xiaolin Wang, and Baoliang Lu. 2012. Spell checking for Chinese. In Proceedings of the International Conference on Language Resources and Evaluation. 730--736.Google ScholarGoogle Scholar
  36. Tinghao Yang, Yulun Hsieh, Yuhsuan Chen, Michael Tsang, Chengwei Shih, and Wenlian Hsu. 2013. Sinica-IASL Chinese spelling check system at SIGHAN-7. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 93--96.Google ScholarGoogle Scholar
  37. Jui-Feng Yeh, Sheng-Feng Li, Mei-Rong Wu, Wen-Yi Chen, and Mao-Chuan Su. 2013. Chinese word spelling correction based on N-gram ranked inverted index list. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 43--48.Google ScholarGoogle Scholar
  38. Jui-Feng Yeh, Yun-Yun Lu, Chen-Hsien Lee, Yu-Hsiang Yu, and Yong-Ting Chen. 2014. Chinese word spelling correction based on rule induction. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 139--145. Google ScholarGoogle ScholarCross RefCross Ref
  39. Junjie Yu and Zhenghua Li. 2014. Chinese spelling error detection and correction based on language model, pronunciation, and shape. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 220--223. Google ScholarGoogle ScholarCross RefCross Ref
  40. Liang-Chih Yu, Chao-Hong Liu, and Chung-Hsien Wu. 2013. Candidate scoring using Web-based measure for Chinese spelling error correction. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 108--112.Google ScholarGoogle Scholar
  41. Liang-Chih Yu, Lung-Hao Lee, Yuen-Hsien Tseng, and Hsin-Hsi Chen. 2014. Overview of SIGHAN 2014 Bake-off for Chinese spelling check. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 126--132. Google ScholarGoogle ScholarCross RefCross Ref
  42. Xiaotian Zhang, Chunyang Wu, and Hai Zhao. 2012. Chinese coreference resolution via ordered filtering. In Proceedings of the Joint Conference on EMNLP and CoNLL—Shared Task. 95--99.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Xiaotian Zhang and Hai Zhao. 2011. Unsupervised Chinese phrase parsing based on tree pattern mining. In Proceedings of the 11th China National Conference on Computational Linguistics.Google ScholarGoogle Scholar
  44. Hai Zhao. 2009. Character-level dependencies in Chinese: Usefulness and learning. In Proceedings of the 12th Conference of the European Chapter of the ACL. 879--887. Google ScholarGoogle ScholarCross RefCross Ref
  45. Hai Zhao, Chang-Ning Huang, and Mu Li. 2006a. An improved Chinese word segmentation system with conditional random field. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing. 162--165.Google ScholarGoogle Scholar
  46. Hai Zhao, Chang-Ning Huang, Mu Li, and Bao-Liang Lu. 2006b. Effective tag set selection in Chinese word segmentation via conditional random field modeling. In Proceedings of the Pacific Asia Conference on Language, Information and Computation, Vol. 20. 87--94.Google ScholarGoogle Scholar
  47. Hai Zhao, Chang-Ning Huang, Mu Li, and Bao-Liang Lu. 2010a. A unified character-based tagging framework for Chinese word segmentation. ACM Transactions on Asian Language Information Processing 9, 2, 5.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Hai Zhao and Chunyu Kit. 2007. Scaling conditional random field with application to Chinese word segmentation. In Proceedings of the 3rd International Conference on Natural Computation, Vol. 5. 95--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Hai Zhao and Chunyu Kit. 2008. Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition. In Proceedings of the 6th SIGHAN Workshop on Chinese Language Processing. 106--111.Google ScholarGoogle Scholar
  50. Hai Zhao and Chunyu Kit. 2009. A simple and efficient model pruning method for conditional random fields. In Proceedings of the International Conference on Computer Processing of Oriental Languages. 145--155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Hai Zhao, Yan Song, and Chunyu Kit. 2010b. How large a corpus do we need: Statistical method versus rule-based method. In Proceedings of the 7th Conference on International Language Resources and Evaluation. 1672--1677.Google ScholarGoogle Scholar
  52. Hai Zhao, Masao Utiyama, Eiichiro Sumita, and Bao-Liang Lu. 2013. An empirical study on word segmentation for Chinese machine translation. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. 248--263. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Hybrid Model for Chinese Spelling Check

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 16, Issue 3
      September 2017
      167 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3041821
      Issue’s Table of Contents

      Copyright © 2017 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 30 March 2017
      • Accepted: 1 January 2017
      • Revised: 1 November 2016
      • Received: 1 July 2016
      Published in tallip Volume 16, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!