skip to main content
research-article

A Hybrid Ranking Approach to Chinese Spelling Check

Authors Info & Claims
Published:11 November 2015Publication History
Skip Abstract Section

Abstract

We propose a novel framework for Chinese Spelling Check (CSC), which is an automatic algorithm to detect and correct Chinese spelling errors. Our framework contains two key components: candidate generation and candidate ranking. Our framework differs from previous research, such as Statistical Machine Translation (SMT) based model or Language Model (LM) based model, in that we use both SMT and LM models as components of our framework for generating the correction candidates, in order to obtain maximum recall; to improve the precision, we further employ a Support Vector Machines (SVM) classifier to rank the candidates generated by the SMT and the LM. Experiments show that our framework outperforms other systems, which adopted the same or similar resources as ours in the SIGHAN 7 shared task; even comparing with the state-of-the-art systems, which used more resources, such as a considerable large dictionary, an idiom dictionary and other semantic information, our framework still obtains competitive results. Furthermore, to address the resource scarceness problem for training the SMT model, we generate around 2 million artificial training sentences using the Chinese character confusion sets, which include a set of Chinese characters with similar shapes and similar pronunciations, provided by the SIGHAN 7 shared task.

References

  1. Gerlof Bouma. 2009. Normalized (pointwise) mutual information in collocation extraction. In Proceedings of the Biennial GSCL Conference. 31--40.Google ScholarGoogle Scholar
  2. Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Comput. Linguistics 19, 2, 263--311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Christopher J.C. Burges. 1998. A tutorial on support vector machines for pattern recognition. Data Mining Knowl. Discovery 2, 2, 121--167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Chao-Huang Chang. 1995. A new approach for automatic Chinese spelling correction. In Proceedings of the Natural Language Processing Pacific Rim Symposium (NLPRS’95). Citeseer, 278--283.Google ScholarGoogle Scholar
  5. Kuan-Yu Chen, Hung-Shin Lee, Chung-Han Lee, Hsin-Min Wang, and Hsin-Hsi Chen. 2013. A study of language modeling for Chinese spelling check. In Proceedings of the 6th International Joint Conference on Natural Language Processing. 79.Google ScholarGoogle Scholar
  6. Yong-Zhi Chen, Shih-Hung Wu, Chia-Ching Lu, and Tsun Ku. 2009. Chinese confusion word set for automatic generation of spelling error detecting template. In Proceedings of the 21st Conference on Computational Linguistics and Speech Processing. 1--2.Google ScholarGoogle Scholar
  7. Yong-Zhi Chen, Shih-Hung Wu, Ping-Che Yang, and Tsun Ku. 2011. Improve the detection of improperly used Chinese characters in students’ essays with error model. Int. J. Continuing Engin. Educ. Life Long Learning 21, 1, 103--116.Google ScholarGoogle ScholarCross RefCross Ref
  8. Hsun-wen Chiu, Jian-cheng Wu, and Jason S. Chang. 2013. Chinese Spelling Checker Based on Statistical Machine Translation. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. Asian Federation of Natural Language Processing, Nagoya, Japan, 49--53. http://www.aclweb.org/anthology/W13-4408.Google ScholarGoogle Scholar
  9. Chuen-Min Huang, Mei-Chen Wu, and Ching-Che Chang. 2007. Error detection and correction based on Chinese phonemic alphabet in Chinese text. In Modeling Decisions for Artificial Intelligence. Springer, 463--476. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ta-Hung Hung and Shih-Hung Wu. 1999. AutoTag.Google ScholarGoogle Scholar
  11. Ta-Hung Hung and Shih-HungWu. 2008. Chinese essay error detection and suggestion system. In Proceedings of the Taiwan E-Learning Forum.Google ScholarGoogle Scholar
  12. Yih-Jeng Lin, Feng-Long Huang, and Ming-Shing Yu. 2002. A Chinese spelling error correction system. In Proceedings of the 7th Conference on Artificial Intelligence and Applications.Google ScholarGoogle Scholar
  13. C.-L. Liu, M.-H. Lai, K.-W. Tien, Y.-H. Chuang, S.-H. Wu, and C.-Y. Lee. 2011. Visually and phonologically similar characters in incorrect Chinese words: Analyses, identification, and applications. ACM Trans. Asian Lang. Inform. Process. 10, 2, 10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Huan Liu and Lei Yu. 2005. Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng. 17, 4, 491--502. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Xiaodong Liu, Fei Cheng, Yanyan Luo, Kevin Duh, and Yuji Matsumoto. 2013. A hybrid Chinese spelling correction system using language model and statistical machine translation with reranking. In Proceedings of the 6th International Joint Conference on Natural Language Processing. 54.Google ScholarGoogle Scholar
  16. Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Comput. Linguistics 29, 1, 19--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Emily Pitler, Shane Bergsma, Dekang Lin, and Kenneth Church. 2010. Using web-scale N-grams to improve base NP parsing performance. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING’10). 886--894. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Lawrence R. Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 2, 257--286.Google ScholarGoogle ScholarCross RefCross Ref
  20. Weiwei Sun. 2011. A stacked sub-word model for joint Chinese word segmentation and part-of-speech tagging. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 1385--1394. http://www.aclweb.org/anthology/P11-1139. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Xu Sun, Jianfeng Gao, Daniel Micol, and Chris Quirk. 2010. Learning phrase-based spelling error models from clickthrough data. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL’10). Association for Computational Linguistics, 266--274. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Shih-Hung Wu, Yong-Zhi Chen, Ping-che Yang, Tsun Ku, and Chao-Lin Liu. 2010. Reducing the false alarm rate of Chinese character error detection and correction. In Proceedings of the CIPS-SIGHAN Joint Conference on Chinese Language Processing. 54--61.Google ScholarGoogle Scholar
  23. Shih-Hung Wu, Chao-Lin Liu, and Lung-Hao Lee. 2013. Chinese spelling check evaluation at SIGHAN Bake-off 2013. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. Asian Federation of Natural Language Processing, 35--42. http://www.aclweb.org/anthology/W13-4406.Google ScholarGoogle Scholar
  24. Nianwen Xue and others. 2003. Chinese word segmentation as character tagging. Comput. Linguistics Chinese Language Process. 8, 1, 29--48.Google ScholarGoogle Scholar
  25. Jui-Feng Yeh, Sheng-Feng Li, Mei-Rong Wu, Wen-Yi Chen, and Mao-Chuan Su. 2013. Chinese word spelling correction based on n-gram ranked inverted index list. In Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing. Asian Federation of Natural Language Processing, 43--48. http://www.aclweb.org/anthology/W13-4407.Google ScholarGoogle Scholar
  26. Lei Yu and Huan Liu. 2004. Efficient feature selection via analysis of relevance and redundancy. J. Machine Learn. Res. 5, 1205--1224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Lei Zhang, Changning Huang, Ming Zhou, and Haihua Pan. 2000. Automatic detecting/correcting errors in Chinese text by an approximate word-matching algorithm. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (ACL’00). Association for Computational Linguistics, 248--254. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Hybrid Ranking Approach to Chinese Spelling Check

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 14, Issue 4
      Special Issue on Chinese Spell Checking
      October 2015
      92 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/2845556
      Issue’s Table of Contents

      Copyright © 2015 Owner/Author

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 November 2015
      • Accepted: 1 March 2015
      • Revised: 1 December 2014
      • Received: 1 August 2014
      Published in tallip Volume 14, Issue 4

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!